SlideShare a Scribd company logo
1 of 97
Download to read offline
Wprowadzenie do
technologii Big Data
Radosław Stankiewicz
HackerBD & DS FTW!Technical Lead Trainer
2
Src: computing.co.uk , https://www.flickr.com/photos/barron/15483113 , tech.co
Agenda
Wstęp -> Map Reduce -> Pig -> Hive -> Ambari
4
Wprowadzenie
5
V O LUME6
Variety
7
A|123|10$
B|555|20$
Y|333|15$
{
'typ'='A',
'id'=123,
'kwota'='10$'
}
Velocity
OLAP
Real
Time
Batch
Streaming
Interactive
analytics
8
Przechowywanie danych
pliki (analiza batch i
interaktywna)
NoSQL (random access) Indeksy
pliki płaskie, csv
(rowid,col,czas)->value:
Accumulo, HBase
Cassandra
MongoDB
Solr
Elastic Search
JSON
AVRO
formaty kolumnowe Bazy grafowe
9
Value
11
Klasyfikacja problemu
• Baza danych ulic Warszawy, Dane w formacie JSON,
optymalizacja odbioru śmieci jednego z usługodawców.
• Zdarzenia z bazy transakcyjnej i kart kredytowych w
celu lepszego wykrywania fraudów
• System wyszukujący dobre oferty samochodów z wielu
serwisów - web crawling, parsowanie danych, analiza
trendów cen samochodów
• Centralne repozytorium skanów umów, TB danych,
codziennie przybywa kilkaset nowych dokumentów
12
13
BI/BigData/EDH
14
BI BD
EDH
Geneza
• za dużo danych
• pady serwerów
• wolne relacyjne bazy danych
15
16
17
18
Architektura
19 źródło: Hortonworks
Ekosystem Hadoop
20 źródło: Hortonworks
Wprowadzenie do
MapReduce na przykładzie
platformy Hadoop
21
Architektura
22
HDFS
Inspirowany GFS(po prawej)
Główne cechy:
• Fault tolerant
• Commodity, low cost hardware
• Batch processing
• High throughput, not low latency
• Write Once, Read Many
23
HDFS - Namenode,
Datanode
24
HDFS - replikacja
Datanodes
Namenode
25
● User Commands
o dfs
o fsck
● Administration Commands
o datanode
o dfsadmin
o namenode
dfs:
appendToFile cat chgrp chmod chown copyFromLocal copyToLocal count cp du
dus expunge get getfacl getfattr getmerge ls lsr mkdir moveFromLocal
moveToLocal mv put rm rmr setfacl setfattr setrep stat tail test text touchz
hdfs dfs -put localfile1 localfile2 /user/tmp/hadoopdir
hdfs dfs -getmerge /user/hadoop/output/ localfile
komendy
26
Uprawnienia
POSIX - Knox - Ranger
27
Architektura YARN
28
Map Reduce Framework
29
Map Reduce Framework
30
M
M
M
M
R
R
R
R
R
Mapper
#!/usr/bin/env python
import sys
for line in sys.stdin:
words = line.strip().split()
for word in words:
print '%st%s' % (word, 1)
line = “Ala ma kota”
Ala 1
ma 1
kota 1
31
Reducer
#!/usr/bin/env python
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('t', 1)
count = int(count)
if current_word == word:
current_count += count
else:
if current_word:
print '%s,%s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%s,%s' % (current_word, current_count)
ala 1
ala 1
bela 1
dela 1
ala,2
bela,1
dela,1
32
Uruchomienie streaming
cat input.txt | ./mapper.py | sort | ./reducer.py
bin/yarn jar [..]/hadoop-*streaming*.jar 

-file mapper.py -mapper ./mapper.py -file
reducer.py -reducer ./reducer.py 

-input /tmp/wordcount/input -output /tmp/
wordcount/output
33
Map Reduce w Java
(input) <k1, v1> -> map -> <k2, v2> -> combine ->
<k2, v2> -> reduce -> <k3, v3> (output)
1) Mapper
2) Reducer
3) run
public class WordCount extends Configured
implements Tool {
public static class TokenizerMapper{...}
public static class IntSumReducer{...}
public int run(...){...}
}
34
Mapper<KEYIN,VALUEIN,KEY
OUT,VALUEOUT>
public static class TokenizerMapper

extends Mapper<LongWritable, Text, Text, IntWritable>{



private final static IntWritable one = new IntWritable(1);

private Text word = new Text();



public void map(LongWritable key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

}
public void setup(...) {...}
public void cleanup(...) {...}
public void run(...) {...}

}
value = “Ala ma kota”
Ala,1
ma,1
kota,1
Reducer<KEYIN,VALUEIN,KEY
OUT,VALUEOUT>
public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();



public void reduce(Text key, Iterable<IntWritable> values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}
public void setup(...) {...}
public void cleanup(...) {...}
public void run(...) {...}

}
kota,(1,1,1,1)
kota,4
Main
public int run(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new WordCount(),args);
System.exit(res);
}
yarn jar wc.jar WordCount /tmp/wordcount/input /tmp/wordcount/output
Co dalej?
• Map Reduce w Javie
• Testowanie MRUnit
• Joins
• Avro
• Custom Key, Value
• Złączanie wielu zadań
• Custom Input, Output
38
Warsztat
39
Wprowadzenie do
przetwarzania danych na
przykładzie Pig
40
Architektura Pig
41
Czy warto?
Top 5 stron odwiedzanych przez
użytkowników mających 18 lat
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;
import org.apache.hadoop.mapred.lib.IdentityMapper;
public class MRExample {
public static class LoadPages extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
String key = line.substring(0, firstComma);
String value = line.substring(firstComma + 1);
Text outKey = new Text(key);
// Prepend an index to the value so we know which file
// it came from.
Text outVal = new Text("1" + value);
oc.collect(outKey, outVal);
}
}
public static class LoadAndFilterUsers extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
String value = line.substring(firstComma + 1);
int age = Integer.parseInt(value);
if (age < 18 || age > 25) return;
String key = line.substring(0, firstComma);
Text outKey = new Text(key);
// Prepend an index to the value so we know which file
// it came from.
Text outVal = new Text("2" + value);
oc.collect(outKey, outVal);
}
}
public static class Join extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key,
Iterator<Text> iter,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// For each value, figure out which file it's from and
store it
// accordingly.
List<String> first = new ArrayList<String>();
List<String> second = new ArrayList<String>();
while (iter.hasNext()) {
Text t = iter.next();
String value = t.toString();
if (value.charAt(0) == '1')
first.add(value.substring(1));
else second.add(value.substring(1));
reporter.setStatus("OK");
}
// Do the cross product and collect the values
for (String s1 : first) {
for (String s2 : second) {
String outval = key + "," + s1 + "," + s2;
oc.collect(null, new Text(outval));
reporter.setStatus("OK");
}
}
}
public static class LoadJoined extends MapReduceBase
implements Mapper<Text, Text, Text, LongWritable> {
public void map(
Text k,
Text val,
OutputCollector<Text, LongWritable> oc,
Reporter reporter) throws IOException {
// Find the url
String line = val.toString();
int firstComma = line.indexOf(',');
int secondComma = line.indexOf(',', firstComma);
String key = line.substring(firstComma, secondComma);
// drop the rest of the record, I don't need it anymore,
// just pass a 1 for the combiner/reducer to sum instead.
Text outKey = new Text(key);
oc.collect(outKey, new LongWritable(1L));
}
}
public static class ReduceUrls extends MapReduceBase
implements Reducer<Text, LongWritable, WritableComparable,
Writable> {
public void reduce(
Text key,
Iterator<LongWritable> iter,
OutputCollector<WritableComparable, Writable> oc,
Reporter reporter) throws IOException {
// Add up all the values we see
long sum = 0;
while (iter.hasNext()) {
sum += iter.next().get();
reporter.setStatus("OK");
}
oc.collect(key, new LongWritable(sum));
}
}
public static class LoadClicks extends MapReduceBase
implements Mapper<WritableComparable, Writable, LongWritable,
Text> {
public void map(
WritableComparable key,
Writable val,
OutputCollector<LongWritable, Text> oc,
Reporter reporter) throws IOException {
oc.collect((LongWritable)val, (Text)key);
}
}
public static class LimitClicks extends MapReduceBase
implements Reducer<LongWritable, Text, LongWritable, Text> {
int count = 0;
public void reduce(
LongWritable key,
Iterator<Text> iter,
OutputCollector<LongWritable, Text> oc,
Reporter reporter) throws IOException {
// Only output the first 100 records
while (count < 100 && iter.hasNext()) {
oc.collect(key, iter.next());
count++;
}
}
}
public static void main(String[] args) throws IOException {
JobConf lp = new JobConf(MRExample.class);
lp.setJobName("Load Pages");
lp.setInputFormat(TextInputFormat.class);
lp.setOutputKeyClass(Text.class);
lp.setOutputValueClass(Text.class);
lp.setMapperClass(LoadPages.class);
FileInputFormat.addInputPath(lp, new
Path("/user/gates/pages"));
FileOutputFormat.setOutputPath(lp,
new Path("/user/gates/tmp/indexed_pages"));
lp.setNumReduceTasks(0);
Job loadPages = new Job(lp);
JobConf lfu = new JobConf(MRExample.class);
lfu.setJobName("Load and Filter Users");
lfu.setInputFormat(TextInputFormat.class);
lfu.setOutputKeyClass(Text.class);
lfu.setOutputValueClass(Text.class);
lfu.setMapperClass(LoadAndFilterUsers.class);
FileInputFormat.addInputPath(lfu, new
Path("/user/gates/users"));
FileOutputFormat.setOutputPath(lfu,
new Path("/user/gates/tmp/filtered_users"));
lfu.setNumReduceTasks(0);
Job loadUsers = new Job(lfu);
JobConf join = new JobConf(MRExample.class);
join.setJobName("Join Users and Pages");
join.setInputFormat(KeyValueTextInputFormat.class);
join.setOutputKeyClass(Text.class);
join.setOutputValueClass(Text.class);
join.setMapperClass(IdentityMapper.class);
join.setReducerClass(Join.class);
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/indexed_pages"));
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/filtered_users"));
FileOutputFormat.setOutputPath(join, new
Path("/user/gates/tmp/joined"));
join.setNumReduceTasks(50);
Job joinJob = new Job(join);
joinJob.addDependingJob(loadPages);
joinJob.addDependingJob(loadUsers);
JobConf group = new JobConf(MRExample.class);
group.setJobName("Group URLs");
group.setInputFormat(KeyValueTextInputFormat.class);
group.setOutputKeyClass(Text.class);
group.setOutputValueClass(LongWritable.class);
group.setOutputFormat(SequenceFileOutputFormat.class);
group.setMapperClass(LoadJoined.class);
group.setCombinerClass(ReduceUrls.class);
group.setReducerClass(ReduceUrls.class);
FileInputFormat.addInputPath(group, new
Path("/user/gates/tmp/joined"));
FileOutputFormat.setOutputPath(group, new
Path("/user/gates/tmp/grouped"));
group.setNumReduceTasks(50);
Job groupJob = new Job(group);
groupJob.addDependingJob(joinJob);
JobConf top100 = new JobConf(MRExample.class);
top100.setJobName("Top 100 sites");
top100.setInputFormat(SequenceFileInputFormat.class);
top100.setOutputKeyClass(LongWritable.class);
top100.setOutputValueClass(Text.class);
top100.setOutputFormat(SequenceFileOutputFormat.class);
top100.setMapperClass(LoadClicks.class);
top100.setCombinerClass(LimitClicks.class);
top100.setReducerClass(LimitClicks.class);
FileInputFormat.addInputPath(top100, new
Path("/user/gates/tmp/grouped"));
FileOutputFormat.setOutputPath(top100, new
Path("/user/gates/top100sitesforusers18to25"));
top100.setNumReduceTasks(1);
Job limit = new Job(top100);
limit.addDependingJob(groupJob);
JobControl jc = new JobControl("Find top 100 sites for users
18 to 25");
jc.addJob(loadPages);
jc.addJob(loadUsers);
jc.addJob(joinJob);
jc.addJob(groupJob);
jc.addJob(limit);
jc.run();
}
}
Users = load ‘users’ as (name, age);
Fltrd = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Jnd = join Fltrd by name, Pages by user;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into ‘top5sites’;
Architektura Pig
47
Tryb Pracy
Interaktywny lub Wsadowy
48
Tryb Pracy
Lokalny lub Rozproszony
49
Tryb Pracy
Map Reduce lub Tez
50
Typy danych
51
int long float double
chararray datetime boolean
bytearray biginteger bigdecimal
Złożone typy
52
tuple bag map
Podstawy Pig Latin -
wielkość liter
• A = LOAD 'data' USING PigStorage() AS (f1:int, f2:int, f3:int);

B = GROUP A BY f1;

C = FOREACH B GENERATE COUNT ($0);

DUMP C;
• Nazwy zmiennych A, B, and C (tzw. aliasy) są case sensitive.
• Wielkość liter jest też istotna dla:
• nazwy pól f1, f2, i f3
• nazwy zmiennych A, B, C
• nazwy funkcji PigStorage, COUNT
• Z wyjątkiem: LOAD, USING, AS, GROUP, BY, FOREACH, GENERATE, oraz DUMP
53
assert, and, any, all, arrange, as, asc, AVG, bag,
BinStorage, by, bytearray, BIGINTEGER, BIGDECIMAL,
cache, CASE, cat, cd, chararray, cogroup, CONCAT,
copyFromLocal, copyToLocal, COUNT, cp, cross,
datetime, %declare, %default, define, dense, desc,
describe, DIFF, distinct, double, du, dump, e, E,
eval, exec, explain, f, F, filter, flatten, float,
foreach, full, generate, group, help, if, illustrate,
import, inner, input, int, into, is, join, kill, l, L,
left, limit, load, long, ls, map, matches, MAX, MIN,
mkdir, mv, not, null, onschema, or, order, outer,
output, parallel, pig, PigDump, PigStorage, pwd, quit,
register, returns, right, rm, rmf, rollup, run,
sample, set, ship, SIZE, split, stderr, stdin, stdout,
store, stream, SUM, TextLoader, TOKENIZE, through,
tuple, union, using, void
54
Słowa kluczowe
Pierwsze kroki
data = LOAD 'input' AS (query:CHARARRAY);
A = LOAD 'data' USING PigStorage('t') AS (f1:int, f2:int, f3:int);
STORE A INTO '/tmp/result' USING PigStorage(';')
55
Pierwsze kroki
SAMPLE
DESCRIBE
DUMP
EXPLAIN
ILLUSTRATE
56
Kolejne kroki - operacje na
danych
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int,
scholarship:float);
B = FILTER A BY age > 20;
57
Kolejne kroki - operacje na
danych
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int,
scholarship:float);
B = FILTER A BY age > 20;
C = LIMIT B 5;
58
Kolejne kroki - operacje na
danych
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int,
scholarship:float);
B = FILTER A BY age > 20;
C = LIMIT B 5;
D = FOREACH C GENERATE name, scholarship*semestre as funds
59
Kolejne kroki - operacje na
danych
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int,
scholarship:float);
E = GROUP A by age
60
Kolejne kroki - operacje na
danych
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int,
scholarship:float);
E = GROUP A by age
F = FOREACH E GENERATE group as age, AVG(A.scholarship)
61
Wydajność
Tez, Projekcje, Filtrowanie, Join
62
Co dalej?
UDF, PigUnit, Integracje
63
Warsztat
64
Wprowadzenie do analizy
danych na przykładzie
Hive
65
Architektura
66
Unikalne cechy Hive
Zapytania SQL na plikach płaskich, np. CSV
67
Unikalne cechy Hive
Znaczne przyspieszenie analizy - nie potrzeba pisać Map Reduce
Optymalizacja, wykonywanie części operacji w pamięci zamiast MR
68
Unikalne cechy Hive
Nieograniczone formy integracji - MongoDB, Elastic Search,
HBase, Solr
69
Unikalne cechy Hive
Integracja narzędzi BI i innych aplikacji poprzez JDBC
70
Hive CLI
Tryb Interaktywny
hive
Tryb Wsadowy:
hive -e ‘select foo from bar’
hive -f ‘/path/to/my/script.q’
hive -f ‘hdfs://namenode:port/path/to/my/
script.q’
więcej opcji: hive --help
71
Typy danych
INT, TINYINT, SMALLINT, BIGINT
BOOLEAN
DECIMAL
FLOAT, DOUBLE
STRING
BINARY
TIMESTAMP
ARRAY, MAP, STRUCT, UNION
DATE
CHAR
VARCHAR
72
Składnia zapytań
SELECT, INSERT, UPDATE
GROUP BY
UNION
LEFT, RIGHT, FULL INNER, FULL OUTER JOIN
OVER, RANK
(NOT) IN, HAVING
(NOT) EXISTS
73
Data Definition Language
• CREATE DATABASE/SCHEMA, TABLE, VIEW, FUNCTION, INDEX
• DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX
• TRUNCATE TABLE
• ALTER DATABASE/SCHEMA, TABLE, VIEW
• MSCK REPAIR TABLE (or ALTER TABLE RECOVER PARTITIONS)
• SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES,
PARTITIONS, FUNCTIONS, INDEX[ES], COLUMNS, CREATE TABLE
• DESCRIBE DATABASE/SCHEMA, table_name, view_name
74
Tabele
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '001'
STORED AS TEXTFILE;
75
Pierwsze kroki w Hive
CREATE TABLE tablename1 (foo INT, bar STRING) PARTITIONED BY (ds STRING);
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename1;
INSERT INTO TABLE tablename1 PARTITION (ds='2014') select_statement1 FROM
from_statement;
76
Pierwsze kroki w Hive
UPDATE tablename SET column = value [, column = value ...] [WHERE expression]
DELETE FROM tablename [WHERE expression]
77
Inne formaty plików? SerDe
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/
start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
CREATE TABLE apachelog (
host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING,
size STRING, referer STRING, agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^]*) ([^]*) ([^]*) (-|[^]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?:
([^ "]*|".*") ([^ "]*|".*"))?"
)
STORED AS TEXTFILE;
78
Inne formaty plików? SerDe
CREATE TABLE table (
foo STRING, bar STRING)
STORED AS TEXTFILE; ← lub SEQUENCEFILE, ORC, AVRO lub PARQUET
79
Zalety, wady, porównanie
Hive Pig
deklaratywny proceduralny
tabele tymczasowe pipeline
polegamy na optymalizatorze
bardziej ingerujemy w
implementacje
UDF, Transform UDF, streaming
sterowniki sql data pipeline splits
80
Stinger
http://hortonworks.com/labs/stinger/
81
Tips & Tricks
hive.vectorized.execution.enabled=true
ORC
hive.execution.engine=tez
John Lund Stone Getty Images
82
Co dalej?
• Integracje z Solr, Elastic, MongoDB
• UDF
• multi table inserts
• JDBC
83
Warsztat
źródło:HikingArtist84
Monitorowanie i
zarządzanie klastrem na
przykładzie Ambari
85
CLI
• Yarn Administration Commands
• resourcemanager nodemanager proxyserver rmadmin daemonlog
• HDFS
• User Commands
• dfs
• fsck
• Administration Commands
• datanode dfsadmin namenode
• HBase
• start-hbase.sh, stop-hbase.sh
• HCatalog, HiveServer,
• Kafka, Storm, Tez, Spark, Oozie i inne
• monitoring, konfiguracja, aktualizacja
86
Ambari
• zarządzanie klastrem
• konsola monitoringu
• instalacja nowych węzłów
• konfiguracja
• wygaszanie serwerów
87
Ambari
88
Now what?
90
Chcesz wiedzieć więcej?
Szkolenia pozwalają na indywidualną pracę z każdym
uczestnikiem
• pracujemy w grupach 4-8 osobowych
• program może być dostosowany do oczekiwań
grupy
• rozwiązujemy i odpowiadamy na indywidualne
pytania uczestników
• mamy dużo więcej czasu :)
Szkolenie dedykowane dla
Ciebie
Jesteś architektem lub team leaderem?
• na przekrojowym szkoleniu 5-dniowym omawiamy i
ćwiczymy cały ekosystem Hadoopa
• na szkoleniu dedykowanym dla architektów dyskutujemy
o projektowaniu systemów BigData
Jesteś analitykiem?
• na dedykowanym szkoleniu przećwiczysz w szczegółach
Pig i Hive i rozwiążesz przykładowe problemy analityczne
Szkolenie dedykowane dla
Ciebie
Jesteś programistą?
• szkolenie 3-dniowe pozwala w szczegółach zapoznać się z
programowaniem zaawansowanych aspektów MapReduce
w Javie i programowaniem w podejściu strumieniowym
Interesuje Cię całość zagadnienia BigData?
• Przetwarzanie Big Data z użyciem Apache Spark
• Bazy danych NoSQL - Cassandra
• Bazy danych NoSQL - MongoDB
źródła
• HikingArtist.com - rysunki
• hortonworks.com - architektura HDP
• apache.org - grafiki Pig, Hive, Hadoop
dziękuję
pytania?
96
Wprowadzenie do
technologii Big Data
Radosław Stankiewicz - radoslaw@zagwozdka.com
www.sages.com.pl

More Related Content

What's hot

Using Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsUsing Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsBartosz Konieczny
 
The Ring programming language version 1.3 book - Part 84 of 88
The Ring programming language version 1.3 book - Part 84 of 88The Ring programming language version 1.3 book - Part 84 of 88
The Ring programming language version 1.3 book - Part 84 of 88Mahmoud Samir Fayed
 
PyCon KR 2019 sprint - RustPython by example
PyCon KR 2019 sprint  - RustPython by examplePyCon KR 2019 sprint  - RustPython by example
PyCon KR 2019 sprint - RustPython by exampleYunWon Jeong
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak PROIDEA
 
ClojureScript for the web
ClojureScript for the webClojureScript for the web
ClojureScript for the webMichiel Borkent
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science Chucheng Hsieh
 
Writing native bindings to node.js in C++
Writing native bindings to node.js in C++Writing native bindings to node.js in C++
Writing native bindings to node.js in C++nsm.nikhil
 
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Bartosz Konieczny
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboyKenneth Geisshirt
 
RestMQ - HTTP/Redis based Message Queue
RestMQ - HTTP/Redis based Message QueueRestMQ - HTTP/Redis based Message Queue
RestMQ - HTTP/Redis based Message QueueGleicon Moraes
 
RxJS Evolved
RxJS EvolvedRxJS Evolved
RxJS Evolvedtrxcllnt
 
All you need to know about the JavaScript event loop
All you need to know about the JavaScript event loopAll you need to know about the JavaScript event loop
All you need to know about the JavaScript event loopSaša Tatar
 
Jakarta Commons - Don't re-invent the wheel
Jakarta Commons - Don't re-invent the wheelJakarta Commons - Don't re-invent the wheel
Jakarta Commons - Don't re-invent the wheeltcurdt
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)Subhas Kumar Ghosh
 
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationBartosz Konieczny
 

What's hot (20)

V8
V8V8
V8
 
Using Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsUsing Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasets
 
The Ring programming language version 1.3 book - Part 84 of 88
The Ring programming language version 1.3 book - Part 84 of 88The Ring programming language version 1.3 book - Part 84 of 88
The Ring programming language version 1.3 book - Part 84 of 88
 
PyCon KR 2019 sprint - RustPython by example
PyCon KR 2019 sprint  - RustPython by examplePyCon KR 2019 sprint  - RustPython by example
PyCon KR 2019 sprint - RustPython by example
 
Full Stack Clojure
Full Stack ClojureFull Stack Clojure
Full Stack Clojure
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
 
ClojureScript for the web
ClojureScript for the webClojureScript for the web
ClojureScript for the web
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
 
Writing native bindings to node.js in C++
Writing native bindings to node.js in C++Writing native bindings to node.js in C++
Writing native bindings to node.js in C++
 
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboy
 
RestMQ - HTTP/Redis based Message Queue
RestMQ - HTTP/Redis based Message QueueRestMQ - HTTP/Redis based Message Queue
RestMQ - HTTP/Redis based Message Queue
 
RxJS Evolved
RxJS EvolvedRxJS Evolved
RxJS Evolved
 
Python GC
Python GCPython GC
Python GC
 
Node.js extensions in C++
Node.js extensions in C++Node.js extensions in C++
Node.js extensions in C++
 
All you need to know about the JavaScript event loop
All you need to know about the JavaScript event loopAll you need to know about the JavaScript event loop
All you need to know about the JavaScript event loop
 
Jakarta Commons - Don't re-invent the wheel
Jakarta Commons - Don't re-invent the wheelJakarta Commons - Don't re-invent the wheel
Jakarta Commons - Don't re-invent the wheel
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customization
 

Viewers also liked

Jak zacząć przetwarzanie małych i dużych danych tekstowych?
Jak zacząć przetwarzanie małych i dużych danych tekstowych?Jak zacząć przetwarzanie małych i dużych danych tekstowych?
Jak zacząć przetwarzanie małych i dużych danych tekstowych?Sages
 
Wprowadzenie do Big Data i Apache Spark
Wprowadzenie do Big Data i Apache SparkWprowadzenie do Big Data i Apache Spark
Wprowadzenie do Big Data i Apache SparkSages
 
Budowa elementów GUI za pomocą biblioteki React - szybki start
Budowa elementów GUI za pomocą biblioteki React - szybki startBudowa elementów GUI za pomocą biblioteki React - szybki start
Budowa elementów GUI za pomocą biblioteki React - szybki startSages
 
Technologia Xamarin i wprowadzenie do Windows IoT core
Technologia Xamarin i wprowadzenie do Windows IoT coreTechnologia Xamarin i wprowadzenie do Windows IoT core
Technologia Xamarin i wprowadzenie do Windows IoT coreSages
 
Zrób dobrze swojej komórce - programowanie urządzeń mobilnych z wykorzystanie...
Zrób dobrze swojej komórce - programowanie urządzeń mobilnych z wykorzystanie...Zrób dobrze swojej komórce - programowanie urządzeń mobilnych z wykorzystanie...
Zrób dobrze swojej komórce - programowanie urządzeń mobilnych z wykorzystanie...Sages
 
Bezpieczne dane w aplikacjach java
Bezpieczne dane w aplikacjach javaBezpieczne dane w aplikacjach java
Bezpieczne dane w aplikacjach javaSages
 
Szybkie wprowadzenie do eksploracji danych z pakietem Weka
Szybkie wprowadzenie do eksploracji danych z pakietem WekaSzybkie wprowadzenie do eksploracji danych z pakietem Weka
Szybkie wprowadzenie do eksploracji danych z pakietem WekaSages
 
Wprowadzenie do technologii Big Data
Wprowadzenie do technologii Big DataWprowadzenie do technologii Big Data
Wprowadzenie do technologii Big DataSages
 
Architektura aplikacji android
Architektura aplikacji androidArchitektura aplikacji android
Architektura aplikacji androidSages
 
Podstawy AngularJS
Podstawy AngularJSPodstawy AngularJS
Podstawy AngularJSSages
 
Wprowadzenie do technologii Puppet
Wprowadzenie do technologii PuppetWprowadzenie do technologii Puppet
Wprowadzenie do technologii PuppetSages
 
Michał Dec - Quality in Clouds
Michał Dec - Quality in CloudsMichał Dec - Quality in Clouds
Michał Dec - Quality in Cloudskraqa
 
Vert.x v3 - high performance polyglot application toolkit
Vert.x v3 - high performance  polyglot application toolkitVert.x v3 - high performance  polyglot application toolkit
Vert.x v3 - high performance polyglot application toolkitSages
 
[WebMuses] Big data dla zdezorientowanych
[WebMuses] Big data dla zdezorientowanych[WebMuses] Big data dla zdezorientowanych
[WebMuses] Big data dla zdezorientowanychPrzemek Maciolek
 
Warsaw Data Science - Recsys2016 Quick Review
Warsaw Data Science - Recsys2016 Quick ReviewWarsaw Data Science - Recsys2016 Quick Review
Warsaw Data Science - Recsys2016 Quick ReviewBartlomiej Twardowski
 
Nie bój się analizy danych! Fakty i mity o big data i Business Intelligence.
Nie bój się analizy danych! Fakty i mity o big data i Business Intelligence.Nie bój się analizy danych! Fakty i mity o big data i Business Intelligence.
Nie bój się analizy danych! Fakty i mity o big data i Business Intelligence.Mateusz Muryjas
 
Prezentacja z Big Data Tech 2016: Machine Learning vs Big Data
Prezentacja z Big Data Tech 2016: Machine Learning vs Big DataPrezentacja z Big Data Tech 2016: Machine Learning vs Big Data
Prezentacja z Big Data Tech 2016: Machine Learning vs Big DataBartlomiej Twardowski
 
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Michael Noll
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 

Viewers also liked (20)

Jak zacząć przetwarzanie małych i dużych danych tekstowych?
Jak zacząć przetwarzanie małych i dużych danych tekstowych?Jak zacząć przetwarzanie małych i dużych danych tekstowych?
Jak zacząć przetwarzanie małych i dużych danych tekstowych?
 
Wprowadzenie do Big Data i Apache Spark
Wprowadzenie do Big Data i Apache SparkWprowadzenie do Big Data i Apache Spark
Wprowadzenie do Big Data i Apache Spark
 
Budowa elementów GUI za pomocą biblioteki React - szybki start
Budowa elementów GUI za pomocą biblioteki React - szybki startBudowa elementów GUI za pomocą biblioteki React - szybki start
Budowa elementów GUI za pomocą biblioteki React - szybki start
 
Technologia Xamarin i wprowadzenie do Windows IoT core
Technologia Xamarin i wprowadzenie do Windows IoT coreTechnologia Xamarin i wprowadzenie do Windows IoT core
Technologia Xamarin i wprowadzenie do Windows IoT core
 
Zrób dobrze swojej komórce - programowanie urządzeń mobilnych z wykorzystanie...
Zrób dobrze swojej komórce - programowanie urządzeń mobilnych z wykorzystanie...Zrób dobrze swojej komórce - programowanie urządzeń mobilnych z wykorzystanie...
Zrób dobrze swojej komórce - programowanie urządzeń mobilnych z wykorzystanie...
 
Bezpieczne dane w aplikacjach java
Bezpieczne dane w aplikacjach javaBezpieczne dane w aplikacjach java
Bezpieczne dane w aplikacjach java
 
Szybkie wprowadzenie do eksploracji danych z pakietem Weka
Szybkie wprowadzenie do eksploracji danych z pakietem WekaSzybkie wprowadzenie do eksploracji danych z pakietem Weka
Szybkie wprowadzenie do eksploracji danych z pakietem Weka
 
Wprowadzenie do technologii Big Data
Wprowadzenie do technologii Big DataWprowadzenie do technologii Big Data
Wprowadzenie do technologii Big Data
 
Architektura aplikacji android
Architektura aplikacji androidArchitektura aplikacji android
Architektura aplikacji android
 
Podstawy AngularJS
Podstawy AngularJSPodstawy AngularJS
Podstawy AngularJS
 
Wprowadzenie do technologii Puppet
Wprowadzenie do technologii PuppetWprowadzenie do technologii Puppet
Wprowadzenie do technologii Puppet
 
Michał Dec - Quality in Clouds
Michał Dec - Quality in CloudsMichał Dec - Quality in Clouds
Michał Dec - Quality in Clouds
 
Big data w praktyce
Big data w praktyceBig data w praktyce
Big data w praktyce
 
Vert.x v3 - high performance polyglot application toolkit
Vert.x v3 - high performance  polyglot application toolkitVert.x v3 - high performance  polyglot application toolkit
Vert.x v3 - high performance polyglot application toolkit
 
[WebMuses] Big data dla zdezorientowanych
[WebMuses] Big data dla zdezorientowanych[WebMuses] Big data dla zdezorientowanych
[WebMuses] Big data dla zdezorientowanych
 
Warsaw Data Science - Recsys2016 Quick Review
Warsaw Data Science - Recsys2016 Quick ReviewWarsaw Data Science - Recsys2016 Quick Review
Warsaw Data Science - Recsys2016 Quick Review
 
Nie bój się analizy danych! Fakty i mity o big data i Business Intelligence.
Nie bój się analizy danych! Fakty i mity o big data i Business Intelligence.Nie bój się analizy danych! Fakty i mity o big data i Business Intelligence.
Nie bój się analizy danych! Fakty i mity o big data i Business Intelligence.
 
Prezentacja z Big Data Tech 2016: Machine Learning vs Big Data
Prezentacja z Big Data Tech 2016: Machine Learning vs Big DataPrezentacja z Big Data Tech 2016: Machine Learning vs Big Data
Prezentacja z Big Data Tech 2016: Machine Learning vs Big Data
 
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 

Similar to Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem

Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
ITT 2015 - Saul Mora - Object Oriented Function Programming
ITT 2015 - Saul Mora - Object Oriented Function ProgrammingITT 2015 - Saul Mora - Object Oriented Function Programming
ITT 2015 - Saul Mora - Object Oriented Function ProgrammingIstanbul Tech Talks
 
Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Andrew Yongjoon Kong
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomyDongmin Yu
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
 
RESTful API using scalaz (3)
RESTful API using scalaz (3)RESTful API using scalaz (3)
RESTful API using scalaz (3)Yeshwanth Kumar
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
 
Apache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheelApache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheeltcurdt
 
Reactive Programming Patterns with RxSwift
Reactive Programming Patterns with RxSwiftReactive Programming Patterns with RxSwift
Reactive Programming Patterns with RxSwiftFlorent Pillet
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...GeeksLab Odessa
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 

Similar to Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem (20)

Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Solr @ Etsy - Apache Lucene Eurocon
Solr @ Etsy - Apache Lucene EuroconSolr @ Etsy - Apache Lucene Eurocon
Solr @ Etsy - Apache Lucene Eurocon
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
ITT 2015 - Saul Mora - Object Oriented Function Programming
ITT 2015 - Saul Mora - Object Oriented Function ProgrammingITT 2015 - Saul Mora - Object Oriented Function Programming
ITT 2015 - Saul Mora - Object Oriented Function Programming
 
Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
 
Pune Clojure Course Outline
Pune Clojure Course OutlinePune Clojure Course Outline
Pune Clojure Course Outline
 
RESTful API using scalaz (3)
RESTful API using scalaz (3)RESTful API using scalaz (3)
RESTful API using scalaz (3)
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Apache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheelApache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheel
 
Reactive Programming Patterns with RxSwift
Reactive Programming Patterns with RxSwiftReactive Programming Patterns with RxSwift
Reactive Programming Patterns with RxSwift
 
Lambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter LawreyLambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter Lawrey
 
Time for Functions
Time for FunctionsTime for Functions
Time for Functions
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 

More from Sages

Python szybki start
Python   szybki startPython   szybki start
Python szybki startSages
 
Budowanie rozwiązań serverless w chmurze Azure
Budowanie rozwiązań serverless w chmurze AzureBudowanie rozwiązań serverless w chmurze Azure
Budowanie rozwiązań serverless w chmurze AzureSages
 
Docker praktyczne podstawy
Docker  praktyczne podstawyDocker  praktyczne podstawy
Docker praktyczne podstawySages
 
Angular 4 pragmatycznie
Angular 4 pragmatycznieAngular 4 pragmatycznie
Angular 4 pragmatycznieSages
 
Jak działa blockchain?
Jak działa blockchain?Jak działa blockchain?
Jak działa blockchain?Sages
 
Qgis szybki start
Qgis szybki startQgis szybki start
Qgis szybki startSages
 
Architektura SOA - wstęp
Architektura SOA - wstępArchitektura SOA - wstęp
Architektura SOA - wstępSages
 

More from Sages (7)

Python szybki start
Python   szybki startPython   szybki start
Python szybki start
 
Budowanie rozwiązań serverless w chmurze Azure
Budowanie rozwiązań serverless w chmurze AzureBudowanie rozwiązań serverless w chmurze Azure
Budowanie rozwiązań serverless w chmurze Azure
 
Docker praktyczne podstawy
Docker  praktyczne podstawyDocker  praktyczne podstawy
Docker praktyczne podstawy
 
Angular 4 pragmatycznie
Angular 4 pragmatycznieAngular 4 pragmatycznie
Angular 4 pragmatycznie
 
Jak działa blockchain?
Jak działa blockchain?Jak działa blockchain?
Jak działa blockchain?
 
Qgis szybki start
Qgis szybki startQgis szybki start
Qgis szybki start
 
Architektura SOA - wstęp
Architektura SOA - wstępArchitektura SOA - wstęp
Architektura SOA - wstęp
 

Recently uploaded

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 

Recently uploaded (20)

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 

Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem

  • 1. Wprowadzenie do technologii Big Data Radosław Stankiewicz
  • 2. HackerBD & DS FTW!Technical Lead Trainer 2 Src: computing.co.uk , https://www.flickr.com/photos/barron/15483113 , tech.co
  • 3.
  • 4. Agenda Wstęp -> Map Reduce -> Pig -> Hive -> Ambari 4
  • 9. Przechowywanie danych pliki (analiza batch i interaktywna) NoSQL (random access) Indeksy pliki płaskie, csv (rowid,col,czas)->value: Accumulo, HBase Cassandra MongoDB Solr Elastic Search JSON AVRO formaty kolumnowe Bazy grafowe 9
  • 10.
  • 12. Klasyfikacja problemu • Baza danych ulic Warszawy, Dane w formacie JSON, optymalizacja odbioru śmieci jednego z usługodawców. • Zdarzenia z bazy transakcyjnej i kart kredytowych w celu lepszego wykrywania fraudów • System wyszukujący dobre oferty samochodów z wielu serwisów - web crawling, parsowanie danych, analiza trendów cen samochodów • Centralne repozytorium skanów umów, TB danych, codziennie przybywa kilkaset nowych dokumentów 12
  • 15. Geneza • za dużo danych • pady serwerów • wolne relacyjne bazy danych 15
  • 16. 16
  • 17. 17
  • 18. 18
  • 21. Wprowadzenie do MapReduce na przykładzie platformy Hadoop 21
  • 23. HDFS Inspirowany GFS(po prawej) Główne cechy: • Fault tolerant • Commodity, low cost hardware • Batch processing • High throughput, not low latency • Write Once, Read Many 23
  • 26. ● User Commands o dfs o fsck ● Administration Commands o datanode o dfsadmin o namenode dfs: appendToFile cat chgrp chmod chown copyFromLocal copyToLocal count cp du dus expunge get getfacl getfattr getmerge ls lsr mkdir moveFromLocal moveToLocal mv put rm rmr setfacl setfattr setrep stat tail test text touchz hdfs dfs -put localfile1 localfile2 /user/tmp/hadoopdir hdfs dfs -getmerge /user/hadoop/output/ localfile komendy 26
  • 31. Mapper #!/usr/bin/env python import sys for line in sys.stdin: words = line.strip().split() for word in words: print '%st%s' % (word, 1) line = “Ala ma kota” Ala 1 ma 1 kota 1 31
  • 32. Reducer #!/usr/bin/env python import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count = line.split('t', 1) count = int(count) if current_word == word: current_count += count else: if current_word: print '%s,%s' % (current_word, current_count) current_count = count current_word = word if current_word == word: print '%s,%s' % (current_word, current_count) ala 1 ala 1 bela 1 dela 1 ala,2 bela,1 dela,1 32
  • 33. Uruchomienie streaming cat input.txt | ./mapper.py | sort | ./reducer.py bin/yarn jar [..]/hadoop-*streaming*.jar 
 -file mapper.py -mapper ./mapper.py -file reducer.py -reducer ./reducer.py 
 -input /tmp/wordcount/input -output /tmp/ wordcount/output 33
  • 34. Map Reduce w Java (input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output) 1) Mapper 2) Reducer 3) run public class WordCount extends Configured implements Tool { public static class TokenizerMapper{...} public static class IntSumReducer{...} public int run(...){...} } 34
  • 35. Mapper<KEYIN,VALUEIN,KEY OUT,VALUEOUT> public static class TokenizerMapper
 extends Mapper<LongWritable, Text, Text, IntWritable>{
 
 private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();
 
 public void map(LongWritable key, Text value, Context context
 ) throws IOException, InterruptedException {
 StringTokenizer itr = new StringTokenizer(value.toString());
 while (itr.hasMoreTokens()) {
 word.set(itr.nextToken());
 context.write(word, one);
 }
 } public void setup(...) {...} public void cleanup(...) {...} public void run(...) {...}
 } value = “Ala ma kota” Ala,1 ma,1 kota,1
  • 36. Reducer<KEYIN,VALUEIN,KEY OUT,VALUEOUT> public static class IntSumReducer
 extends Reducer<Text,IntWritable,Text,IntWritable> {
 private IntWritable result = new IntWritable();
 
 public void reduce(Text key, Iterable<IntWritable> values,
 Context context
 ) throws IOException, InterruptedException {
 int sum = 0;
 for (IntWritable val : values) {
 sum += val.get();
 }
 result.set(sum);
 context.write(key, result);
 } public void setup(...) {...} public void cleanup(...) {...} public void run(...) {...}
 } kota,(1,1,1,1) kota,4
  • 37. Main public int run(String[] args) throws Exception {
 Configuration conf = new Configuration();
 Job job = Job.getInstance(conf, "word count");
 job.setJarByClass(WordCount.class);
 job.setMapperClass(TokenizerMapper.class);
 job.setCombinerClass(IntSumReducer.class);
 job.setReducerClass(IntSumReducer.class);
 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);
 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));
 System.exit(job.waitForCompletion(true) ? 0 : 1);
 } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new WordCount(),args); System.exit(res); } yarn jar wc.jar WordCount /tmp/wordcount/input /tmp/wordcount/output
  • 38. Co dalej? • Map Reduce w Javie • Testowanie MRUnit • Joins • Avro • Custom Key, Value • Złączanie wielu zadań • Custom Input, Output 38
  • 40. Wprowadzenie do przetwarzania danych na przykładzie Pig 40
  • 42. Czy warto? Top 5 stron odwiedzanych przez użytkowników mających 18 lat
  • 43. import java.io.IOException; import java.util.ArrayList; import java.util.Iterator; import java.util.List; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.KeyValueTextInputFormat; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.jobcontrol.Job; import org.apache.hadoop.mapred.jobcontrol.JobControl; import org.apache.hadoop.mapred.lib.IdentityMapper; public class MRExample { public static class LoadPages extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String key = line.substring(0, firstComma); String value = line.substring(firstComma + 1); Text outKey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outVal = new Text("1" + value); oc.collect(outKey, outVal); } } public static class LoadAndFilterUsers extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String value = line.substring(firstComma + 1); int age = Integer.parseInt(value); if (age < 18 || age > 25) return; String key = line.substring(0, firstComma); Text outKey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outVal = new Text("2" + value); oc.collect(outKey, outVal); } } public static class Join extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> iter, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // For each value, figure out which file it's from and store it // accordingly. List<String> first = new ArrayList<String>(); List<String> second = new ArrayList<String>(); while (iter.hasNext()) { Text t = iter.next(); String value = t.toString(); if (value.charAt(0) == '1') first.add(value.substring(1)); else second.add(value.substring(1)); reporter.setStatus("OK"); } // Do the cross product and collect the values for (String s1 : first) { for (String s2 : second) { String outval = key + "," + s1 + "," + s2; oc.collect(null, new Text(outval)); reporter.setStatus("OK"); } } }
  • 44. public static class LoadJoined extends MapReduceBase implements Mapper<Text, Text, Text, LongWritable> { public void map( Text k, Text val, OutputCollector<Text, LongWritable> oc, Reporter reporter) throws IOException { // Find the url String line = val.toString(); int firstComma = line.indexOf(','); int secondComma = line.indexOf(',', firstComma); String key = line.substring(firstComma, secondComma); // drop the rest of the record, I don't need it anymore, // just pass a 1 for the combiner/reducer to sum instead. Text outKey = new Text(key); oc.collect(outKey, new LongWritable(1L)); } } public static class ReduceUrls extends MapReduceBase implements Reducer<Text, LongWritable, WritableComparable, Writable> { public void reduce( Text key, Iterator<LongWritable> iter, OutputCollector<WritableComparable, Writable> oc, Reporter reporter) throws IOException { // Add up all the values we see long sum = 0; while (iter.hasNext()) { sum += iter.next().get(); reporter.setStatus("OK"); } oc.collect(key, new LongWritable(sum)); } } public static class LoadClicks extends MapReduceBase implements Mapper<WritableComparable, Writable, LongWritable, Text> { public void map( WritableComparable key, Writable val, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { oc.collect((LongWritable)val, (Text)key); } } public static class LimitClicks extends MapReduceBase implements Reducer<LongWritable, Text, LongWritable, Text> { int count = 0; public void reduce( LongWritable key, Iterator<Text> iter, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { // Only output the first 100 records while (count < 100 && iter.hasNext()) { oc.collect(key, iter.next()); count++; } } }
  • 45. public static void main(String[] args) throws IOException { JobConf lp = new JobConf(MRExample.class); lp.setJobName("Load Pages"); lp.setInputFormat(TextInputFormat.class); lp.setOutputKeyClass(Text.class); lp.setOutputValueClass(Text.class); lp.setMapperClass(LoadPages.class); FileInputFormat.addInputPath(lp, new Path("/user/gates/pages")); FileOutputFormat.setOutputPath(lp, new Path("/user/gates/tmp/indexed_pages")); lp.setNumReduceTasks(0); Job loadPages = new Job(lp); JobConf lfu = new JobConf(MRExample.class); lfu.setJobName("Load and Filter Users"); lfu.setInputFormat(TextInputFormat.class); lfu.setOutputKeyClass(Text.class); lfu.setOutputValueClass(Text.class); lfu.setMapperClass(LoadAndFilterUsers.class); FileInputFormat.addInputPath(lfu, new Path("/user/gates/users")); FileOutputFormat.setOutputPath(lfu, new Path("/user/gates/tmp/filtered_users")); lfu.setNumReduceTasks(0); Job loadUsers = new Job(lfu); JobConf join = new JobConf(MRExample.class); join.setJobName("Join Users and Pages"); join.setInputFormat(KeyValueTextInputFormat.class); join.setOutputKeyClass(Text.class); join.setOutputValueClass(Text.class); join.setMapperClass(IdentityMapper.class); join.setReducerClass(Join.class); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/indexed_pages")); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/filtered_users")); FileOutputFormat.setOutputPath(join, new Path("/user/gates/tmp/joined")); join.setNumReduceTasks(50); Job joinJob = new Job(join); joinJob.addDependingJob(loadPages); joinJob.addDependingJob(loadUsers); JobConf group = new JobConf(MRExample.class); group.setJobName("Group URLs"); group.setInputFormat(KeyValueTextInputFormat.class); group.setOutputKeyClass(Text.class); group.setOutputValueClass(LongWritable.class); group.setOutputFormat(SequenceFileOutputFormat.class); group.setMapperClass(LoadJoined.class); group.setCombinerClass(ReduceUrls.class); group.setReducerClass(ReduceUrls.class); FileInputFormat.addInputPath(group, new Path("/user/gates/tmp/joined")); FileOutputFormat.setOutputPath(group, new Path("/user/gates/tmp/grouped")); group.setNumReduceTasks(50); Job groupJob = new Job(group); groupJob.addDependingJob(joinJob); JobConf top100 = new JobConf(MRExample.class); top100.setJobName("Top 100 sites"); top100.setInputFormat(SequenceFileInputFormat.class); top100.setOutputKeyClass(LongWritable.class); top100.setOutputValueClass(Text.class); top100.setOutputFormat(SequenceFileOutputFormat.class); top100.setMapperClass(LoadClicks.class); top100.setCombinerClass(LimitClicks.class); top100.setReducerClass(LimitClicks.class); FileInputFormat.addInputPath(top100, new Path("/user/gates/tmp/grouped")); FileOutputFormat.setOutputPath(top100, new Path("/user/gates/top100sitesforusers18to25")); top100.setNumReduceTasks(1); Job limit = new Job(top100); limit.addDependingJob(groupJob); JobControl jc = new JobControl("Find top 100 sites for users 18 to 25"); jc.addJob(loadPages); jc.addJob(loadUsers); jc.addJob(joinJob); jc.addJob(groupJob); jc.addJob(limit); jc.run(); } }
  • 46. Users = load ‘users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into ‘top5sites’;
  • 49. Tryb Pracy Lokalny lub Rozproszony 49
  • 50. Tryb Pracy Map Reduce lub Tez 50
  • 51. Typy danych 51 int long float double chararray datetime boolean bytearray biginteger bigdecimal
  • 53. Podstawy Pig Latin - wielkość liter • A = LOAD 'data' USING PigStorage() AS (f1:int, f2:int, f3:int);
 B = GROUP A BY f1;
 C = FOREACH B GENERATE COUNT ($0);
 DUMP C; • Nazwy zmiennych A, B, and C (tzw. aliasy) są case sensitive. • Wielkość liter jest też istotna dla: • nazwy pól f1, f2, i f3 • nazwy zmiennych A, B, C • nazwy funkcji PigStorage, COUNT • Z wyjątkiem: LOAD, USING, AS, GROUP, BY, FOREACH, GENERATE, oraz DUMP 53
  • 54. assert, and, any, all, arrange, as, asc, AVG, bag, BinStorage, by, bytearray, BIGINTEGER, BIGDECIMAL, cache, CASE, cat, cd, chararray, cogroup, CONCAT, copyFromLocal, copyToLocal, COUNT, cp, cross, datetime, %declare, %default, define, dense, desc, describe, DIFF, distinct, double, du, dump, e, E, eval, exec, explain, f, F, filter, flatten, float, foreach, full, generate, group, help, if, illustrate, import, inner, input, int, into, is, join, kill, l, L, left, limit, load, long, ls, map, matches, MAX, MIN, mkdir, mv, not, null, onschema, or, order, outer, output, parallel, pig, PigDump, PigStorage, pwd, quit, register, returns, right, rm, rmf, rollup, run, sample, set, ship, SIZE, split, stderr, stdin, stdout, store, stream, SUM, TextLoader, TOKENIZE, through, tuple, union, using, void 54 Słowa kluczowe
  • 55. Pierwsze kroki data = LOAD 'input' AS (query:CHARARRAY); A = LOAD 'data' USING PigStorage('t') AS (f1:int, f2:int, f3:int); STORE A INTO '/tmp/result' USING PigStorage(';') 55
  • 57. Kolejne kroki - operacje na danych A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int, scholarship:float); B = FILTER A BY age > 20; 57
  • 58. Kolejne kroki - operacje na danych A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int, scholarship:float); B = FILTER A BY age > 20; C = LIMIT B 5; 58
  • 59. Kolejne kroki - operacje na danych A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int, scholarship:float); B = FILTER A BY age > 20; C = LIMIT B 5; D = FOREACH C GENERATE name, scholarship*semestre as funds 59
  • 60. Kolejne kroki - operacje na danych A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int, scholarship:float); E = GROUP A by age 60
  • 61. Kolejne kroki - operacje na danych A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int, scholarship:float); E = GROUP A by age F = FOREACH E GENERATE group as age, AVG(A.scholarship) 61
  • 63. Co dalej? UDF, PigUnit, Integracje 63
  • 65. Wprowadzenie do analizy danych na przykładzie Hive 65
  • 67. Unikalne cechy Hive Zapytania SQL na plikach płaskich, np. CSV 67
  • 68. Unikalne cechy Hive Znaczne przyspieszenie analizy - nie potrzeba pisać Map Reduce Optymalizacja, wykonywanie części operacji w pamięci zamiast MR 68
  • 69. Unikalne cechy Hive Nieograniczone formy integracji - MongoDB, Elastic Search, HBase, Solr 69
  • 70. Unikalne cechy Hive Integracja narzędzi BI i innych aplikacji poprzez JDBC 70
  • 71. Hive CLI Tryb Interaktywny hive Tryb Wsadowy: hive -e ‘select foo from bar’ hive -f ‘/path/to/my/script.q’ hive -f ‘hdfs://namenode:port/path/to/my/ script.q’ więcej opcji: hive --help 71
  • 72. Typy danych INT, TINYINT, SMALLINT, BIGINT BOOLEAN DECIMAL FLOAT, DOUBLE STRING BINARY TIMESTAMP ARRAY, MAP, STRUCT, UNION DATE CHAR VARCHAR 72
  • 73. Składnia zapytań SELECT, INSERT, UPDATE GROUP BY UNION LEFT, RIGHT, FULL INNER, FULL OUTER JOIN OVER, RANK (NOT) IN, HAVING (NOT) EXISTS 73
  • 74. Data Definition Language • CREATE DATABASE/SCHEMA, TABLE, VIEW, FUNCTION, INDEX • DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX • TRUNCATE TABLE • ALTER DATABASE/SCHEMA, TABLE, VIEW • MSCK REPAIR TABLE (or ALTER TABLE RECOVER PARTITIONS) • SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES, PARTITIONS, FUNCTIONS, INDEX[ES], COLUMNS, CREATE TABLE • DESCRIBE DATABASE/SCHEMA, table_name, view_name 74
  • 75. Tabele CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' STORED AS TEXTFILE; 75
  • 76. Pierwsze kroki w Hive CREATE TABLE tablename1 (foo INT, bar STRING) PARTITIONED BY (ds STRING); LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename1; INSERT INTO TABLE tablename1 PARTITION (ds='2014') select_statement1 FROM from_statement; 76
  • 77. Pierwsze kroki w Hive UPDATE tablename SET column = value [, column = value ...] [WHERE expression] DELETE FROM tablename [WHERE expression] 77
  • 78. Inne formaty plików? SerDe 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/ start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)" CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|[^]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|".*") ([^ "]*|".*"))?" ) STORED AS TEXTFILE; 78
  • 79. Inne formaty plików? SerDe CREATE TABLE table ( foo STRING, bar STRING) STORED AS TEXTFILE; ← lub SEQUENCEFILE, ORC, AVRO lub PARQUET 79
  • 80. Zalety, wady, porównanie Hive Pig deklaratywny proceduralny tabele tymczasowe pipeline polegamy na optymalizatorze bardziej ingerujemy w implementacje UDF, Transform UDF, streaming sterowniki sql data pipeline splits 80
  • 83. Co dalej? • Integracje z Solr, Elastic, MongoDB • UDF • multi table inserts • JDBC 83
  • 85. Monitorowanie i zarządzanie klastrem na przykładzie Ambari 85
  • 86. CLI • Yarn Administration Commands • resourcemanager nodemanager proxyserver rmadmin daemonlog • HDFS • User Commands • dfs • fsck • Administration Commands • datanode dfsadmin namenode • HBase • start-hbase.sh, stop-hbase.sh • HCatalog, HiveServer, • Kafka, Storm, Tez, Spark, Oozie i inne • monitoring, konfiguracja, aktualizacja 86
  • 87. Ambari • zarządzanie klastrem • konsola monitoringu • instalacja nowych węzłów • konfiguracja • wygaszanie serwerów 87
  • 89.
  • 91.
  • 92. Chcesz wiedzieć więcej? Szkolenia pozwalają na indywidualną pracę z każdym uczestnikiem • pracujemy w grupach 4-8 osobowych • program może być dostosowany do oczekiwań grupy • rozwiązujemy i odpowiadamy na indywidualne pytania uczestników • mamy dużo więcej czasu :)
  • 93. Szkolenie dedykowane dla Ciebie Jesteś architektem lub team leaderem? • na przekrojowym szkoleniu 5-dniowym omawiamy i ćwiczymy cały ekosystem Hadoopa • na szkoleniu dedykowanym dla architektów dyskutujemy o projektowaniu systemów BigData Jesteś analitykiem? • na dedykowanym szkoleniu przećwiczysz w szczegółach Pig i Hive i rozwiążesz przykładowe problemy analityczne
  • 94. Szkolenie dedykowane dla Ciebie Jesteś programistą? • szkolenie 3-dniowe pozwala w szczegółach zapoznać się z programowaniem zaawansowanych aspektów MapReduce w Javie i programowaniem w podejściu strumieniowym Interesuje Cię całość zagadnienia BigData? • Przetwarzanie Big Data z użyciem Apache Spark • Bazy danych NoSQL - Cassandra • Bazy danych NoSQL - MongoDB
  • 95. źródła • HikingArtist.com - rysunki • hortonworks.com - architektura HDP • apache.org - grafiki Pig, Hive, Hadoop
  • 97. Wprowadzenie do technologii Big Data Radosław Stankiewicz - radoslaw@zagwozdka.com www.sages.com.pl