SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
Big Data in Practice: 
The TrustYou Tech Stack 
Cluj Big Data Meetup, Nov 18th 
Steffen Wenz, CTO
Goals of today’s talk 
● Relate first-hand experiences with a big data tech 
stack 
● Introduce to a few essential technologies beyond 
Hadoop: 
○ Hortonworks HDP 
○ Apache Pig 
○ Luigi
Who are we? 
● For each hotel on the 
planet, provide a 
summary of all reviews 
● Expertise: 
○ NLP 
○ Machine Learning 
○ Big Data 
● Clients: …
TrustYou Tech Stack 
Batch Layer 
● Hadoop (HDP 2.1) 
● Python 
● Pig 
● Luigi 
Service Layer 
● PostgreSQL 
● MongoDB 
● Redis 
● Cassandra 
Data Data Queries 
Hadoop cluster (100 nodes) Application machines
Hadoop cluster 
(includes all live and 
development machines)
Python ♥ Big Data 
Hadoop ist Java-first, but: 
● Hadoop streaming 
cat input | ./map.py |  
sort | ./reduce.py > output 
○ MRJob, Luigi 
○ VirtualEnv 
● Pig: Python UDFS 
● Real-time processing: 
PySpark, PyStorm 
● Data processing: 
○ Numpy, SciPy 
○ Pandas 
● NLP: 
○ NLTK 
● Machine learning: 
○ Scikit-learn 
○ Gensim (word2vec)
Use case: Semantic analysis 
● “Nice room” 
● “Room wasn‘t so great” 
● “The air-conditioning 
was so powerful that we 
were cold in the room 
even when it was off.” 
● “อาหารรสชาติดี” 
● “ ” خدمة جیدة 
● 20 languages 
● Linguistic system 
(morphology, taggers, 
grammars, parsers …) 
● Hadoop: Scale out CPU 
● Python for ML & NLP 
libraries
Hortonworks Distribution
Hortonworks Distribution 
● Hortonworks Data 
Platform: Enterprise 
architecture out of the 
box 
● Try out in VM: 
Hortonworks Sandbox 
● Alternatives: Cloudera 
CDH, MapR 
TrustYou & Hortonworks @ BITKOM Big Data Summit
Apache Pig 
● Define & execute parallel data flows 
on Hadoop 
○ Engine + Language (“Pig Latin”) + Shell (“Grunt”) 
● “SQL of big data” (bad comparison; many differences) 
● Goal: Make Pig Latin native language of parallel data 
processing 
● Native support for: Projection, filtering, sort, group, join
Why not just MapReduce? 
● Projection 
SELECT a, b ... 
● Filter 
WHERE ... 
● Sort 
● Distinct 
● Group 
● Join 
Source: Hadoop - the definitive guide
Pig Example 
Load one day of raw GDELT data 
-- omitted: create table, insert 
select * from gdelt limit 10; 
gdelt = load '20141112.export.CSV'; 
gdelt = limit gdelt 10; 
dump gdelt; 
Pigs eat 
anything!
Specifying a schema 
gdelt = load '20141112.export.CSV' as ( 
event_id: chararray, 
sql_date: chararray, 
month_year: chararray, 
year: chararray, 
fraction_date: chararray, 
actor1: chararray, 
actor1_name: chararray, 
-- ... 59 columns in total ... 
event: int, 
goldstein_scale: float, 
date_added: chararray, 
source_url: chararray 
);
Pig Example 
Look at all non-empty actor countries 
select actor1_country 
from gdelt 
where actor1_country != ''; 
-- where: 
gdelt = filter gdelt 
by actor1_country != ''; 
-- select: 
country = foreach gdelt 
generate actor1_country; 
dump country;
Pig Example 
Get histogram of actor countries 
select actor1_country, count(*) 
from gdelt 
group by actor1_country; 
gdelt_grp = group gdelt 
by actor1_country; 
gdelt_cnt = foreach gdelt_grp generate 
group as country, 
COUNT(gdelt) as count; 
dump gdelt_cnt;
Pig Example 
Count total rows, count distinct event IDs 
select count(*) from gdelt; 
select count(distinct event_id) from 
gdelt; 
gdelt_grp = group gdelt all; 
gdelt_cnt = foreach gdelt_grp generate 
COUNT(gdelt); 
dump gdelt_cnt; -- 180793 
event = foreach gdelt generate event; 
event_dis = distinct event; 
event_grp = group event_dis all; 
event_cnt = foreach event_grp generate 
COUNT(event_dis); 
dump event_cnt; -- 215
Things you can’t do in Pig 
i = 2; 
Top-level variables are bags 
(sort of like tables). 
if (x#a == 2) dump xs; 
None of the usual control 
structures. You define data 
flows. 
For everything else: UDFs 
(user-defined functions). 
Custom operators 
implemented in Java or 
Python. 
Also: Directly call Java 
static methods
Cool, but where’s the parallelism? 
event = foreach gdelt generate event; 
-- map 
event_dis = distinct event 
parallel 50; -- reduce! 
event_grp = group event_dis all 
parallel 50; -- reduce! 
event_cnt = foreach event_grp generate 
COUNT(event_dis); -- map 
dump event_cnt;
Pig’s execution engine 
$ pig -x local -e "explain -script gdelt.pig" 
#----------------------------------------------- 
# New Logical Plan: 
#----------------------------------------------- 
event_cnt: (Name: LOStore Schema: #131:long) 
ColumnPrune:InputUids=[69]ColumnPrune:OutputUids=[69] 
| 
|---event_cnt: (Name: LOForEach Schema: #131:long) 
| | 
| (Name: LOGenerate[false] Schema: #131:long) 
| | | 
| | (Name: UserFunc(org.apache.pig.builtin. 
COUNT) Type: long Uid: 131) 
| | | 
| | |---event_dis:(Name: Project Type: bag 
Uid: 67 Input: 0 Column: (*)) 
| | 
| |---event_dis: (Name: LOInnerLoad[1] Schema: 
event#27:int) 
| 
|---event_grp: (Name: LOCogroup Schema: group#66: 
chararray,event_dis#67:bag{#129:tuple(event#27:int)}) 
$ pig -x local -e "explain -script gdelt.pig 
-dot -out gdelt.dot" 
$ dot -Tpng gdelt.dot > gdelt.png
Pig advanced: Asymmetric country relations 
-- we're only interested in countries 
gdelt = filter ( 
foreach gdelt generate actor1_country, actor2_country, goldstein_scale 
) by actor1_country != '' and actor2_country != ''; 
gdelt_grp = group gdelt by (actor1_country, actor2_country); 
-- it's not necessary to aggregate twice - except that Pig doesn't allow self joins 
gold_1 = foreach gdelt_grp generate 
group.actor1_country as actor1_country, 
group.actor2_country as actor2_country, 
SUM(gdelt.goldstein_scale) as goldstein_scale; 
gold_2 = foreach gdelt_grp generate 
group.actor1_country as actor1_country, 
group.actor2_country as actor2_country, 
SUM(gdelt.goldstein_scale) as goldstein_scale; 
-- join both sums together, to get the Goldstein values for both directions in one row 
gold = join gold_1 by (actor1_country, actor2_country), gold_2 by (actor2_country, actor1_country);
Pig advanced: Asymmetric country relations 
-- compute the difference in Goldstein score 
gold = foreach gold generate 
gold_1::actor1_country as actor1_country, 
gold_1::actor2_country as actor2_country, 
gold_1::goldstein_scale as gold_1, 
gold_2::goldstein_scale as gold_2, 
ABS(gold_1::goldstein_scale - gold_2::goldstein_scale) as diff; 
-- keep only the values where one direction is positive, the other negative 
-- also, remove all duplicate rows 
gold = filter gold by gold_1 * gold_2 < 0 and actor1_country < actor2_country; 
gold = order gold by diff desc; 
dump gold;
Pig advanced: Asymmetric country relations 
(PSE,USA,93.49999961256981,-76.30000001192093,169.79999962449074) 
(NGA,USA,15.900000423192978,-143.5999995470047,159.49999997019768) 
(ISR,JOR,143.89999967813492,-12.700000494718552,156.60000017285347) 
(IRN,SYR,103.50000095367432,-50.50000023841858,154.0000011920929) 
(IRN,ISR,16.60000056028366,-112.40000087022781,129.00000143051147) 
(GBR,RUS,73.09999999403954,-41.99999952316284,115.09999951720238) 
(EGY,SYR,-87.60000020265579,12.0,99.60000020265579) 
(USA,YEM,-78.30000007152557,15.700000047683716,94.00000011920929) 
(ISR,TUR,2.4000001549720764,-90.60000002384186,93.00000017881393) 
(MYS,UKR,35.10000038146973,-52.0,87.10000038146973) 
(GRC,TUR,-47.60000029206276,36.5,84.10000029206276) 
(HTI,USA,34.99999976158142,-45.40000009536743,80.39999985694885)
Apache Pig @ TrustYou 
● Before: 
○ Usage of Unix utilities (sort, cut, awk etc.) and 
custom tools (map_filter.py, reduce_agg.py) to 
transform data with Hadoop Streaming 
● Now: 
○ Data loading & transformation expressed in Pig 
○ PigUnit for testing 
○ Core algorithms still implemented in Python
Further Reading on Pig 
● O’Reilly Book - 
free online version 
See code samples on 
TrustYou GitHub account: 
https://github. 
com/trustyou/meetups/tre 
e/master/big-data
Luigi 
● Build complex pipelines of 
batch jobs 
○ Dependency resolution 
○ Parallelism 
○ Resume failed jobs 
● Pythonic replacement for Apache Oozie 
● Not a replacement for Pig, Cascading, Hive
Anatomy of a Luigi task 
class MyTask(luigi.Task): 
# Parameters which control the behavior of the task. Same parameters = the task only needs to run once! 
param1 = luigi.Parameter() 
# These dependencies need to be done before this task can start. Can also be a list or dict 
def requires(self): 
return DependentTask(self.param1) 
# Path to output file (local or HDFS). If this file is present, Luigi considers this task to be done. 
def output(self): 
return luigi.LocalTarget("data/my_task_output_{}".format(self.param1)) 
def run(self): 
# To make task execution atomic, Luigi writes all output to a temporary file, and only renames when 
you close the target. 
with self.output().open("w") as out: 
out.write("foo")
Luigi tasks vs. Makefiles 
class MyTask(luigi.Task): 
def requires(self): 
return DependentTask() 
def output(self): 
return luigi.LocalTarget 
("data/my_task_output")) 
def run(self): 
with self.output().open("w") as 
out: 
out.write("foo") 
data/my_task_output: DependentTask 
run 
run 
run ...
Luigi Hadoop integration 
class HadoopTask(luigi.hadoop.JobTask): 
def output(self): 
return luigi.HdfsTarget("output_in_hdfs") 
def requires(self): 
return { 
"some_task": SomeTask(), 
"some_other_task": SomeOtherTask() 
} 
def mapper(self, line): 
key, value = line.rstrip().split("t") 
yield key, value 
def reducer(self, key, values): 
yield key, ", ".join(values)
Luigi example 
Crawl a URL, then extract 
all links from it! CrawlTask(url) 
ExtractTask(url)
Luigi example: CrawlTask 
class CrawlTask(luigi.Task): 
url = luigi.Parameter() 
def output(self): 
url_hash = hashlib.md5(self.url).hexdigest() 
return luigi.LocalTarget(os.path.join("data", "crawl_" + url_hash)) 
def run(self): 
req = requests.get(self.url) 
res = req.text 
with self.output().open("w") as out: 
out.write(res.encode("utf-8"))
Luigi example: ExtractTask 
class ExtractTask(luigi.Task): 
url = luigi.Parameter() 
def requires(self): 
return CrawlTask(self.url) 
def output(self): 
url_hash = hashlib.md5(self.url).hexdigest() 
return luigi.LocalTarget(os.path.join("data", "extract_" + url_hash)) 
def run(self): 
soup = bs4.BeautifulSoup(self.input().open().read()) 
with self.output().open("w") as out: 
for link in soup.find_all("a"): 
out.write(str(link.get("href")) + "n")
Luigi example: Running it locally 
$ python luigi_demo.py --local-scheduler ExtractTask --url http://www.trustyou.com 
DEBUG: Checking if ExtractTask(url=http://www.trustyou.com) is complete 
INFO: Scheduled ExtractTask(url=http://www.trustyou.com) (PENDING) 
DEBUG: Checking if CrawlTask(url=http://www.trustyou.com) is complete 
INFO: Scheduled CrawlTask(url=http://www.trustyou.com) (PENDING) 
INFO: Done scheduling tasks 
INFO: Running Worker with 1 processes 
DEBUG: Asking scheduler for work... 
DEBUG: Pending tasks: 2 
INFO: [pid 2279] Worker Worker(salt=083397955, host=steffen-thinkpad, username=steffen, pid=2279) running 
CrawlTask(url=http://www.trustyou.com) 
INFO: [pid 2279] Worker Worker(salt=083397955, host=steffen-thinkpad, username=steffen, pid=2279) done 
CrawlTask(url=http://www.trustyou.com) 
DEBUG: 1 running tasks, waiting for next task to finish 
DEBUG: Asking scheduler for work... 
DEBUG: Pending tasks: 1 
INFO: [pid 2279] Worker Worker(salt=083397955, host=steffen-thinkpad, username=steffen, pid=2279) running 
ExtractTask(url=http://www.trustyou.com) 
INFO: [pid 2279] Worker Worker(salt=083397955, host=steffen-thinkpad, username=steffen, pid=2279) done 
ExtractTask(url=http://www.trustyou.com) 
DEBUG: 1 running tasks, waiting for next task to finish 
DEBUG: Asking scheduler for work...
Luigi @ TrustYou 
● Before: 
○ Bash scripts + cron 
○ Manual cleanup after 
failures due to network 
issues etc. 
● Now: 
○ Complex nested Luigi job 
graphs 
○ Failed jobs usually repair 
themselves
TrustYou wants you! 
We offer positions 
in Cluj & Munich: 
● Data engineer 
● Application developer 
● Crawling engineer 
Write me at swenz@trustyou.net, check out our website, 
or see you at the next meetup!
Backup
Cluj Big Data Meetup - Big Data in Practice
Cluj Big Data Meetup - Big Data in Practice

Weitere ähnliche Inhalte

Was ist angesagt?

Hacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 AutumnHacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 AutumnMoriyoshi Koizumi
 
PyCon KR 2019 sprint - RustPython by example
PyCon KR 2019 sprint  - RustPython by examplePyCon KR 2019 sprint  - RustPython by example
PyCon KR 2019 sprint - RustPython by exampleYunWon Jeong
 
Virtual machine and javascript engine
Virtual machine and javascript engineVirtual machine and javascript engine
Virtual machine and javascript engineDuoyi Wu
 
Concurrent applications with free monads and stm
Concurrent applications with free monads and stmConcurrent applications with free monads and stm
Concurrent applications with free monads and stmAlexander Granin
 
Letswift19-clean-architecture
Letswift19-clean-architectureLetswift19-clean-architecture
Letswift19-clean-architectureJung Kim
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak PROIDEA
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboyKenneth Geisshirt
 
All you need to know about the JavaScript event loop
All you need to know about the JavaScript event loopAll you need to know about the JavaScript event loop
All you need to know about the JavaScript event loopSaša Tatar
 
ClojureScript loves React, DomCode May 26 2015
ClojureScript loves React, DomCode May 26 2015ClojureScript loves React, DomCode May 26 2015
ClojureScript loves React, DomCode May 26 2015Michiel Borkent
 
Writing native bindings to node.js in C++
Writing native bindings to node.js in C++Writing native bindings to node.js in C++
Writing native bindings to node.js in C++nsm.nikhil
 
Compose Async with RxJS
Compose Async with RxJSCompose Async with RxJS
Compose Async with RxJSKyung Yeol Kim
 
RxJS Evolved
RxJS EvolvedRxJS Evolved
RxJS Evolvedtrxcllnt
 
ClojureScript for the web
ClojureScript for the webClojureScript for the web
ClojureScript for the webMichiel Borkent
 
Data structure programs in c++
Data structure programs in c++Data structure programs in c++
Data structure programs in c++mmirfan
 
Using Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsUsing Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsBartosz Konieczny
 
Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! aleks-f
 
C c++-meetup-1nov2017-autofdo
C c++-meetup-1nov2017-autofdoC c++-meetup-1nov2017-autofdo
C c++-meetup-1nov2017-autofdoKim Phillips
 

Was ist angesagt? (20)

Hacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 AutumnHacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 Autumn
 
PyCon KR 2019 sprint - RustPython by example
PyCon KR 2019 sprint  - RustPython by examplePyCon KR 2019 sprint  - RustPython by example
PyCon KR 2019 sprint - RustPython by example
 
Virtual machine and javascript engine
Virtual machine and javascript engineVirtual machine and javascript engine
Virtual machine and javascript engine
 
Concurrent applications with free monads and stm
Concurrent applications with free monads and stmConcurrent applications with free monads and stm
Concurrent applications with free monads and stm
 
Letswift19-clean-architecture
Letswift19-clean-architectureLetswift19-clean-architecture
Letswift19-clean-architecture
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboy
 
All you need to know about the JavaScript event loop
All you need to know about the JavaScript event loopAll you need to know about the JavaScript event loop
All you need to know about the JavaScript event loop
 
Python GC
Python GCPython GC
Python GC
 
ClojureScript loves React, DomCode May 26 2015
ClojureScript loves React, DomCode May 26 2015ClojureScript loves React, DomCode May 26 2015
ClojureScript loves React, DomCode May 26 2015
 
Writing native bindings to node.js in C++
Writing native bindings to node.js in C++Writing native bindings to node.js in C++
Writing native bindings to node.js in C++
 
Compose Async with RxJS
Compose Async with RxJSCompose Async with RxJS
Compose Async with RxJS
 
RxJS Evolved
RxJS EvolvedRxJS Evolved
RxJS Evolved
 
Python Objects
Python ObjectsPython Objects
Python Objects
 
ClojureScript for the web
ClojureScript for the webClojureScript for the web
ClojureScript for the web
 
Data structure programs in c++
Data structure programs in c++Data structure programs in c++
Data structure programs in c++
 
Using Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsUsing Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasets
 
Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! 
 
Full Stack Clojure
Full Stack ClojureFull Stack Clojure
Full Stack Clojure
 
C c++-meetup-1nov2017-autofdo
C c++-meetup-1nov2017-autofdoC c++-meetup-1nov2017-autofdo
C c++-meetup-1nov2017-autofdo
 

Andere mochten auch

Managing Online Reputation: ATM Dubai 2012
Managing Online Reputation: ATM Dubai 2012Managing Online Reputation: ATM Dubai 2012
Managing Online Reputation: ATM Dubai 2012TrustYou
 
The TrustYou Culture Book
The TrustYou Culture BookThe TrustYou Culture Book
The TrustYou Culture BookTrustYou
 
DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020Steffen Wenz
 
Research paper in filipino
Research paper in filipinoResearch paper in filipino
Research paper in filipinoSFYC
 
THESIS - WIKANG FILIPINO, SA MAKABAGONG PANAHON
THESIS - WIKANG FILIPINO, SA MAKABAGONG PANAHONTHESIS - WIKANG FILIPINO, SA MAKABAGONG PANAHON
THESIS - WIKANG FILIPINO, SA MAKABAGONG PANAHONMi L
 
THESIS (Pananaliksik) Tagalog
THESIS (Pananaliksik) TagalogTHESIS (Pananaliksik) Tagalog
THESIS (Pananaliksik) Tagaloghm alumia
 

Andere mochten auch (6)

Managing Online Reputation: ATM Dubai 2012
Managing Online Reputation: ATM Dubai 2012Managing Online Reputation: ATM Dubai 2012
Managing Online Reputation: ATM Dubai 2012
 
The TrustYou Culture Book
The TrustYou Culture BookThe TrustYou Culture Book
The TrustYou Culture Book
 
DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020
 
Research paper in filipino
Research paper in filipinoResearch paper in filipino
Research paper in filipino
 
THESIS - WIKANG FILIPINO, SA MAKABAGONG PANAHON
THESIS - WIKANG FILIPINO, SA MAKABAGONG PANAHONTHESIS - WIKANG FILIPINO, SA MAKABAGONG PANAHON
THESIS - WIKANG FILIPINO, SA MAKABAGONG PANAHON
 
THESIS (Pananaliksik) Tagalog
THESIS (Pananaliksik) TagalogTHESIS (Pananaliksik) Tagalog
THESIS (Pananaliksik) Tagalog
 

Ähnlich wie Cluj Big Data Meetup - Big Data in Practice

Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;Tzung-Bi Shih
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Jonathan Felch
 
Python tools to deploy your machine learning models faster
Python tools to deploy your machine learning models fasterPython tools to deploy your machine learning models faster
Python tools to deploy your machine learning models fasterJeff Hale
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePedro Figueiredo
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsHadoop User Group
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
Python GTK (Hacking Camp)
Python GTK (Hacking Camp)Python GTK (Hacking Camp)
Python GTK (Hacking Camp)Yuren Ju
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e práticaPET Computação
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Python-GTK
Python-GTKPython-GTK
Python-GTKYuren Ju
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
PHP CLI: A Cinderella Story
PHP CLI: A Cinderella StoryPHP CLI: A Cinderella Story
PHP CLI: A Cinderella StoryMike Lively
 

Ähnlich wie Cluj Big Data Meetup - Big Data in Practice (20)

Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Python tools to deploy your machine learning models faster
Python tools to deploy your machine learning models fasterPython tools to deploy your machine learning models faster
Python tools to deploy your machine learning models faster
 
Mario on spark
Mario on sparkMario on spark
Mario on spark
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReduce
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Python GTK (Hacking Camp)
Python GTK (Hacking Camp)Python GTK (Hacking Camp)
Python GTK (Hacking Camp)
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e prática
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Python-GTK
Python-GTKPython-GTK
Python-GTK
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
PHP CLI: A Cinderella Story
PHP CLI: A Cinderella StoryPHP CLI: A Cinderella Story
PHP CLI: A Cinderella Story
 

Kürzlich hochgeladen

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Cluj Big Data Meetup - Big Data in Practice

  • 1. Big Data in Practice: The TrustYou Tech Stack Cluj Big Data Meetup, Nov 18th Steffen Wenz, CTO
  • 2. Goals of today’s talk ● Relate first-hand experiences with a big data tech stack ● Introduce to a few essential technologies beyond Hadoop: ○ Hortonworks HDP ○ Apache Pig ○ Luigi
  • 3.
  • 4. Who are we? ● For each hotel on the planet, provide a summary of all reviews ● Expertise: ○ NLP ○ Machine Learning ○ Big Data ● Clients: …
  • 5.
  • 6. TrustYou Tech Stack Batch Layer ● Hadoop (HDP 2.1) ● Python ● Pig ● Luigi Service Layer ● PostgreSQL ● MongoDB ● Redis ● Cassandra Data Data Queries Hadoop cluster (100 nodes) Application machines
  • 7. Hadoop cluster (includes all live and development machines)
  • 8. Python ♥ Big Data Hadoop ist Java-first, but: ● Hadoop streaming cat input | ./map.py | sort | ./reduce.py > output ○ MRJob, Luigi ○ VirtualEnv ● Pig: Python UDFS ● Real-time processing: PySpark, PyStorm ● Data processing: ○ Numpy, SciPy ○ Pandas ● NLP: ○ NLTK ● Machine learning: ○ Scikit-learn ○ Gensim (word2vec)
  • 9. Use case: Semantic analysis ● “Nice room” ● “Room wasn‘t so great” ● “The air-conditioning was so powerful that we were cold in the room even when it was off.” ● “อาหารรสชาติดี” ● “ ” خدمة جیدة ● 20 languages ● Linguistic system (morphology, taggers, grammars, parsers …) ● Hadoop: Scale out CPU ● Python for ML & NLP libraries
  • 11. Hortonworks Distribution ● Hortonworks Data Platform: Enterprise architecture out of the box ● Try out in VM: Hortonworks Sandbox ● Alternatives: Cloudera CDH, MapR TrustYou & Hortonworks @ BITKOM Big Data Summit
  • 12. Apache Pig ● Define & execute parallel data flows on Hadoop ○ Engine + Language (“Pig Latin”) + Shell (“Grunt”) ● “SQL of big data” (bad comparison; many differences) ● Goal: Make Pig Latin native language of parallel data processing ● Native support for: Projection, filtering, sort, group, join
  • 13. Why not just MapReduce? ● Projection SELECT a, b ... ● Filter WHERE ... ● Sort ● Distinct ● Group ● Join Source: Hadoop - the definitive guide
  • 14. Pig Example Load one day of raw GDELT data -- omitted: create table, insert select * from gdelt limit 10; gdelt = load '20141112.export.CSV'; gdelt = limit gdelt 10; dump gdelt; Pigs eat anything!
  • 15. Specifying a schema gdelt = load '20141112.export.CSV' as ( event_id: chararray, sql_date: chararray, month_year: chararray, year: chararray, fraction_date: chararray, actor1: chararray, actor1_name: chararray, -- ... 59 columns in total ... event: int, goldstein_scale: float, date_added: chararray, source_url: chararray );
  • 16. Pig Example Look at all non-empty actor countries select actor1_country from gdelt where actor1_country != ''; -- where: gdelt = filter gdelt by actor1_country != ''; -- select: country = foreach gdelt generate actor1_country; dump country;
  • 17. Pig Example Get histogram of actor countries select actor1_country, count(*) from gdelt group by actor1_country; gdelt_grp = group gdelt by actor1_country; gdelt_cnt = foreach gdelt_grp generate group as country, COUNT(gdelt) as count; dump gdelt_cnt;
  • 18. Pig Example Count total rows, count distinct event IDs select count(*) from gdelt; select count(distinct event_id) from gdelt; gdelt_grp = group gdelt all; gdelt_cnt = foreach gdelt_grp generate COUNT(gdelt); dump gdelt_cnt; -- 180793 event = foreach gdelt generate event; event_dis = distinct event; event_grp = group event_dis all; event_cnt = foreach event_grp generate COUNT(event_dis); dump event_cnt; -- 215
  • 19. Things you can’t do in Pig i = 2; Top-level variables are bags (sort of like tables). if (x#a == 2) dump xs; None of the usual control structures. You define data flows. For everything else: UDFs (user-defined functions). Custom operators implemented in Java or Python. Also: Directly call Java static methods
  • 20. Cool, but where’s the parallelism? event = foreach gdelt generate event; -- map event_dis = distinct event parallel 50; -- reduce! event_grp = group event_dis all parallel 50; -- reduce! event_cnt = foreach event_grp generate COUNT(event_dis); -- map dump event_cnt;
  • 21. Pig’s execution engine $ pig -x local -e "explain -script gdelt.pig" #----------------------------------------------- # New Logical Plan: #----------------------------------------------- event_cnt: (Name: LOStore Schema: #131:long) ColumnPrune:InputUids=[69]ColumnPrune:OutputUids=[69] | |---event_cnt: (Name: LOForEach Schema: #131:long) | | | (Name: LOGenerate[false] Schema: #131:long) | | | | | (Name: UserFunc(org.apache.pig.builtin. COUNT) Type: long Uid: 131) | | | | | |---event_dis:(Name: Project Type: bag Uid: 67 Input: 0 Column: (*)) | | | |---event_dis: (Name: LOInnerLoad[1] Schema: event#27:int) | |---event_grp: (Name: LOCogroup Schema: group#66: chararray,event_dis#67:bag{#129:tuple(event#27:int)}) $ pig -x local -e "explain -script gdelt.pig -dot -out gdelt.dot" $ dot -Tpng gdelt.dot > gdelt.png
  • 22. Pig advanced: Asymmetric country relations -- we're only interested in countries gdelt = filter ( foreach gdelt generate actor1_country, actor2_country, goldstein_scale ) by actor1_country != '' and actor2_country != ''; gdelt_grp = group gdelt by (actor1_country, actor2_country); -- it's not necessary to aggregate twice - except that Pig doesn't allow self joins gold_1 = foreach gdelt_grp generate group.actor1_country as actor1_country, group.actor2_country as actor2_country, SUM(gdelt.goldstein_scale) as goldstein_scale; gold_2 = foreach gdelt_grp generate group.actor1_country as actor1_country, group.actor2_country as actor2_country, SUM(gdelt.goldstein_scale) as goldstein_scale; -- join both sums together, to get the Goldstein values for both directions in one row gold = join gold_1 by (actor1_country, actor2_country), gold_2 by (actor2_country, actor1_country);
  • 23. Pig advanced: Asymmetric country relations -- compute the difference in Goldstein score gold = foreach gold generate gold_1::actor1_country as actor1_country, gold_1::actor2_country as actor2_country, gold_1::goldstein_scale as gold_1, gold_2::goldstein_scale as gold_2, ABS(gold_1::goldstein_scale - gold_2::goldstein_scale) as diff; -- keep only the values where one direction is positive, the other negative -- also, remove all duplicate rows gold = filter gold by gold_1 * gold_2 < 0 and actor1_country < actor2_country; gold = order gold by diff desc; dump gold;
  • 24. Pig advanced: Asymmetric country relations (PSE,USA,93.49999961256981,-76.30000001192093,169.79999962449074) (NGA,USA,15.900000423192978,-143.5999995470047,159.49999997019768) (ISR,JOR,143.89999967813492,-12.700000494718552,156.60000017285347) (IRN,SYR,103.50000095367432,-50.50000023841858,154.0000011920929) (IRN,ISR,16.60000056028366,-112.40000087022781,129.00000143051147) (GBR,RUS,73.09999999403954,-41.99999952316284,115.09999951720238) (EGY,SYR,-87.60000020265579,12.0,99.60000020265579) (USA,YEM,-78.30000007152557,15.700000047683716,94.00000011920929) (ISR,TUR,2.4000001549720764,-90.60000002384186,93.00000017881393) (MYS,UKR,35.10000038146973,-52.0,87.10000038146973) (GRC,TUR,-47.60000029206276,36.5,84.10000029206276) (HTI,USA,34.99999976158142,-45.40000009536743,80.39999985694885)
  • 25. Apache Pig @ TrustYou ● Before: ○ Usage of Unix utilities (sort, cut, awk etc.) and custom tools (map_filter.py, reduce_agg.py) to transform data with Hadoop Streaming ● Now: ○ Data loading & transformation expressed in Pig ○ PigUnit for testing ○ Core algorithms still implemented in Python
  • 26. Further Reading on Pig ● O’Reilly Book - free online version See code samples on TrustYou GitHub account: https://github. com/trustyou/meetups/tre e/master/big-data
  • 27. Luigi ● Build complex pipelines of batch jobs ○ Dependency resolution ○ Parallelism ○ Resume failed jobs ● Pythonic replacement for Apache Oozie ● Not a replacement for Pig, Cascading, Hive
  • 28. Anatomy of a Luigi task class MyTask(luigi.Task): # Parameters which control the behavior of the task. Same parameters = the task only needs to run once! param1 = luigi.Parameter() # These dependencies need to be done before this task can start. Can also be a list or dict def requires(self): return DependentTask(self.param1) # Path to output file (local or HDFS). If this file is present, Luigi considers this task to be done. def output(self): return luigi.LocalTarget("data/my_task_output_{}".format(self.param1)) def run(self): # To make task execution atomic, Luigi writes all output to a temporary file, and only renames when you close the target. with self.output().open("w") as out: out.write("foo")
  • 29. Luigi tasks vs. Makefiles class MyTask(luigi.Task): def requires(self): return DependentTask() def output(self): return luigi.LocalTarget ("data/my_task_output")) def run(self): with self.output().open("w") as out: out.write("foo") data/my_task_output: DependentTask run run run ...
  • 30. Luigi Hadoop integration class HadoopTask(luigi.hadoop.JobTask): def output(self): return luigi.HdfsTarget("output_in_hdfs") def requires(self): return { "some_task": SomeTask(), "some_other_task": SomeOtherTask() } def mapper(self, line): key, value = line.rstrip().split("t") yield key, value def reducer(self, key, values): yield key, ", ".join(values)
  • 31. Luigi example Crawl a URL, then extract all links from it! CrawlTask(url) ExtractTask(url)
  • 32. Luigi example: CrawlTask class CrawlTask(luigi.Task): url = luigi.Parameter() def output(self): url_hash = hashlib.md5(self.url).hexdigest() return luigi.LocalTarget(os.path.join("data", "crawl_" + url_hash)) def run(self): req = requests.get(self.url) res = req.text with self.output().open("w") as out: out.write(res.encode("utf-8"))
  • 33. Luigi example: ExtractTask class ExtractTask(luigi.Task): url = luigi.Parameter() def requires(self): return CrawlTask(self.url) def output(self): url_hash = hashlib.md5(self.url).hexdigest() return luigi.LocalTarget(os.path.join("data", "extract_" + url_hash)) def run(self): soup = bs4.BeautifulSoup(self.input().open().read()) with self.output().open("w") as out: for link in soup.find_all("a"): out.write(str(link.get("href")) + "n")
  • 34. Luigi example: Running it locally $ python luigi_demo.py --local-scheduler ExtractTask --url http://www.trustyou.com DEBUG: Checking if ExtractTask(url=http://www.trustyou.com) is complete INFO: Scheduled ExtractTask(url=http://www.trustyou.com) (PENDING) DEBUG: Checking if CrawlTask(url=http://www.trustyou.com) is complete INFO: Scheduled CrawlTask(url=http://www.trustyou.com) (PENDING) INFO: Done scheduling tasks INFO: Running Worker with 1 processes DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 2279] Worker Worker(salt=083397955, host=steffen-thinkpad, username=steffen, pid=2279) running CrawlTask(url=http://www.trustyou.com) INFO: [pid 2279] Worker Worker(salt=083397955, host=steffen-thinkpad, username=steffen, pid=2279) done CrawlTask(url=http://www.trustyou.com) DEBUG: 1 running tasks, waiting for next task to finish DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 2279] Worker Worker(salt=083397955, host=steffen-thinkpad, username=steffen, pid=2279) running ExtractTask(url=http://www.trustyou.com) INFO: [pid 2279] Worker Worker(salt=083397955, host=steffen-thinkpad, username=steffen, pid=2279) done ExtractTask(url=http://www.trustyou.com) DEBUG: 1 running tasks, waiting for next task to finish DEBUG: Asking scheduler for work...
  • 35. Luigi @ TrustYou ● Before: ○ Bash scripts + cron ○ Manual cleanup after failures due to network issues etc. ● Now: ○ Complex nested Luigi job graphs ○ Failed jobs usually repair themselves
  • 36. TrustYou wants you! We offer positions in Cluj & Munich: ● Data engineer ● Application developer ● Crawling engineer Write me at swenz@trustyou.net, check out our website, or see you at the next meetup!