SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
LARGE-SCALE ANALYTICS WITH 
APACHE SPARK 
THOMSON REUTERS R&D 
TWIN CITIES HADOOP USER GROUP 
FRANK SCHILDER 
SEPTEMBER 22, 2014
THOMSON REUTERS 
• The Thomson Reuters Corporation 
– 50,000+ employees 
– 2,000+ journalists at news desks world wide 
– Offices in more than 1,000 countries 
– $12 billion dollars revenue/year 
• Products: intelligent information for professionals and enterprises 
– Legal: WestlawNext legal search engine 
– Financial: Eikon financial platform; Datastream real-time share price data 
– News: REUTERS news 
– Science: Endnote, ISI journal impact factor, Derwent World Patent Index 
– Tax & Accounting: OneSource tax information 
• Corporate R&D 
– Around 40 researchers and developers (NLP, IR, ML) 
– Three R&D sites the US one in the UK: Eagan, MN; Rochester, NY; NYC 
and London 
– We are hiring… email me at frank.schilder@thomsonreuters.com
OVERVIEW 
• Speed 
– Data locality, scalability, fault tolerance 
• Ease of Use 
– Scala, interactive Shell 
• Generality 
– SparkSQL, MLLib 
• Comparing ML frameworks 
– Vowpal Wabbit (VW) 
– Sparkling Water 
• The Future
WHAT IS SPARK? 
Apache Spark is a fast and general engine 
for large-scale data processing. 
• Speed: allows to run iterative Map-Reduce 
faster because of in-Memory computation: 
Resilient Distributed Datasets (RDD) 
• Ease of use: enables interactive data analysis 
in Scala, Python, or Java; interactive Shell 
• Generality: offers libraries for SQL, Streaming 
and large-scale analytics (graph processing 
and machine learning) 
• Integrated with Hadoop: runs on Hadoop 2’s 
YARN cluster
ACKNOWLEDGMENTS 
• Matei Zaharia and ampLab and databricks team for 
fantastic learning material and tutorials on Spark 
• Hiroko Bretz, Thomas Vacek, Dezhao Song, Terry 
Heinze for Spark and Scala support and running 
experiments 
• Adam Glaser for his time as a TSAP intern 
• Mahadev Wudali and Mike Edwards for letting us 
play in the “sandbox” (cluster)
SPEED
PRIMARY GOALS OF SPARK 
• Extend the MapReduce model to better support 
two common classes of analytics apps: 
– Iterative algorithms (machine learning, graphs) 
– Interactive data mining (R, Python) 
• Enhance programmability: 
– Integrate into Scala programming language 
– Allow interactive use from Scala interpreter 
– Make Spark easily accessible from other 
languages (Python, Java)
MOTIVATION 
• Acyclic data flow is inefficient for 
applications that repeatedly reuse a working 
set of data: 
– Iterative algorithms (machine learning, graphs) 
– Interactive data mining tools (R, Python) 
• With current frameworks, apps reload data 
from stable storage on each query
HADOOP MAPREDUCE VS SPARK
SOLUTION: Resilient 
Distributed Datasets (RDDs) 
• Allow apps to keep working sets in memory for 
efficient reuse 
• Retain the attractive properties of MapReduce 
– Fault tolerance, data locality, scalability 
• Support a wide range of applications
PROGRAMMING MODEL 
Resilient distributed datasets (RDDs) 
– Immutable, partitioned collections of objects 
– Created through parallel transformations (map, filter, 
groupBy, join, …) on data in stable storage 
– Functions follow the same patterns as Scala operations 
on lists 
– Can be cached for efficient reuse 
80+ Actions on RDDs 
– count, reduce, save, take, first, …
EXAMPLE: LOG MINING 
Load error messages from a log into memory, then 
interactively search for various patterns 
Base RDD 
Transformed RDD 
Val lines = spark.textFile(“hdfs://...”) 
Val errors = lines.filter(_.startsWith(“ERROR”)) 
Val messages = errors.map(_.split(‘t’)(2)) 
Val cachedMsgs = messages.cache() 
Block 1 
Block 2 
Block 3 
Worker 
results 
Worker 
Worker 
Driver 
cachedMsgs.filter(_.contains(“timeout”)).count 
cachedMsgs.filter(_.contains(“license”)).count 
. . . 
tasks 
Cache 1 
Cache 2 
Cache 3 
Action 
Result: scaled to 1 TB data in 5-7 sec 
Result: full-text search of Wikipedia in <1 sec 
(vs 170 sec for on-disk data) 
(vs 20 sec for on-disk data)
BEHAVIOR WITH NOT ENOUGH RAM 
68.8 
58.1 
40.7 
29.7 
11.5 
100 
80 
60 
40 
20 
0 
Cache 
disabled 
25% 
50% 
75% 
Fully 
cached 
Iteration 
time 
(s) 
% 
of 
working 
set 
in 
memory
RDD Fault Tolerance 
RDDs maintain lineage information that can be used 
to reconstruct lost partitions 
Ex: 
messages = textFile(...).filter(_.startsWith(“ERROR”)) 
.map(_.split(‘t’)(2)) 
HDFS File Filtered RDD Mapped RDD 
filter 
(func 
= 
_.contains(...)) 
map 
(func 
= 
_.split(...))
Fault Recovery Results 
119 
No 
Failure 
Failure 
in 
the 
6th 
Iteration 
57 
56 
58 
58 
81 
57 
59 
57 
59 
140 
120 
100 
80 
60 
40 
20 
0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
Iteratrion 
time 
(s) 
Iteration
EASE OF USE
INTERACTIVE SHELL 
• Data analysis can be done in the interactive shell. 
– Start from local machine or cluster 
– Access multi-core processor with local[n] 
– Spark context is already set up for you: SparkContext sc 
• Load data from anywhere (local, HDFS, 
Cassandra, Amazon S3 etc.): 
• Start analyzing your data: 
Processing 
starts here 
Local data file
ANALYZE YOUR DATA 
• Word count in one line: 
• List the word counts: 
• Broadcast variables (e.g. dictionary, stop word list) 
because local variables need to distributed to the workers:
RUN A SPARK SCRIPT
PYTHON SHELL & IPYTHON 
• The interactive shell can also be started as Python 
shell called pySpark: 
• Start analyzing your data in python now: 
• Since it’s Python, you may want to use iPython 
– (command shell for interactive programming in your 
brower) :
IPYTHON AND SPARK 
• The iPython notebook environment and pySpark: 
– Document data analysis results 
– Carry out machine learning experiments 
– Visualize results with matplotlib or other visualization 
libraries 
– Combine with NLP libraries such as NLTK 
• PySpark does not offer the full functionality of 
Spark Shell in Scala (yet) 
• Some bugs (e.g. problems with unicode)
PROJECTS AT R&D USING SPARK 
• Entity linking 
– Alternative name extraction from 
Wikipedia, Freebase, free text, ClueWeb12; 
several TB large web collection (planned) 
• Large-scale text data analysis: 
– creating fingerprints for entities/events 
– Temporal slot filling: Assigning a begin and end time 
stamp to a slot filler (e.g. A is employee of company B 
from BEGIN to END) 
– Large-Scale text classification of Reuters News Archive 
articles (10 years) 
• Language model computation used for search 
query analysis
SPARK MODULES 
• Spark streaming: 
– Processing real-time data streams 
• Spark SQL: 
– Support for structured data (JSON, Parquet) and 
relational queries (SQL) 
• MLlib: 
– Machine learning library 
• GraphX: 
– New graph processing API
SPARKSQL
SPARK SQL 
• Relational queries expressed in 
– SQL 
– HiveQL 
– Scala Domain specific language (DSL) 
• New type of RDD: SchemaRDD : 
– RDD composed of Row objects 
– Schema definition or inferred from a Parquet file, JSON 
data set, or data store in Hive 
• SPARK SQL is in alpha: API may change in the 
future!
DEFINING A SCHEMA
MLLIB
MLLIB 
• A machine learning module that comes with Spark 
• Shipped since Spark 0.8.0 
• Provides various machine learning algorithms for 
classification and clustering 
• Sparse vector representation since 1.0.0 
• New features in recently released version 1.1.0: 
– Includes a standard statistics library (e.g. correlation, 
Hypothesis testing, sampling) 
– More algorithms ported to Java and Python 
– More feature engineering: TF-IDF, Singular Value 
Decomposition (SVD)
MLLIB 
• Provides various machine learning algorithms: 
– Classification: 
• Logistic regression, support vector machine (SVM), naïve 
Bayes, decision trees 
– Regression: 
• Linear regression, regression trees 
– Collaborative Filtering: 
• Alternative least square (ALS) 
– Clustering: 
• K-means 
– Decomposition 
• Singular value decomposition (SVD), Principal component 
analysis (PCA)
OTHER ML FRAMEWORKS 
• Mahout 
• LIBLINEAR 
• MatLAB 
• Scikit-learn 
• GraphLab 
• R 
• Weka 
• Vowpal Wabbit 
• BigML
LARGE-SCALE ML INFRASTRUCTURE 
• More data implies bigger training sets and richer 
feature sets. 
• More data with simple ML algorithm often beats 
small data with complicated ML algorithm 
• Large-scale ML requires big data infrastructure: 
– Faster processing: Hadoop, Spark 
– Feature engineering: Principal Component Analysis, 
Hashing trick, Word2Vec
PREDICTIVE ANALYTICS WITH MLLIB
PREDICTIVE ANALYTICS WITH MLLIB 
http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured- 
data-using-spark-2.html
VW AND MLLIB COMPARISON 
• We compared Vowpal Wabbit and MLlib in 
December 2013 (work with Tom Vacek) 
• Vowpal Wabbit (VW) is a large-scale ML tool 
developed by John Langford (Microsoft) 
• Task: binary text classification task on Reuters 
articles 
– Ease of implementation 
– Feature Extraction 
– Parameter tuning 
– Speed 
– Accessibility of programming languages
VW VS. MLLIB 
• Ease of implementation 
– VW: user tool designed for ML, not programming language 
– MLlib: programming language, some support now (e.g. regularization) 
• Feature Extraction 
– VW: specific capabilities for bi-grams, prefix etc. 
– MLlib: no limit in terms of creating features 
• Parameter tuning 
– VW: no parameter search capability, but multiple parameters can be hand-tuned 
– MLlib: offers cross-validation 
• Speed 
– VW: highly optimized, very fast even on a single machine with multiple cores 
– MLlib: fast with lots of machines 
• Accessibility of programming languages 
– VW: written in C++, a few wrappers (e.g. Python) 
– MLlib: Scala, Python, Java 
• Conclusion end of 2013: VW had a slight advantage, but MLlib has caught up in at 
least some of the areas (e.g. sparse feature representation)
FINDINGS SO FAR 
• Large-scale extraction is a great fit for Spark when 
working with large data sets (> 1GB) 
• Ease of use makes Spark an ideal framework for 
rapid prototyping. 
• MLlib is a fast growing ML library, but “under 
development” 
• Vowpal Wabbit has been shown to crunch even 
large data sets with ease. 
250 
200 
150 
100 
50 
0 
vw liblinear Spark 
local[4] 
0/1 loss 
time
OTHER ML FRAMEWORKS 
• Internship by Adam Glaser compared various ML 
frameworks with 5 standard data sets (NIPS) 
– Mass-spectrometric data (cancer), handwritten digit 
detection, Reuters news classification, synthetic data sets 
– Data sets were not very big, but had up to 1.000.000 
features 
• Evaluated accuracy of the generated models and 
speed for training time 
• H20, GraphLab and Microsoft Azure showed strong 
performances in terms of accuracy and training 
time.
ACCURACY
SPEED
WHAT IS NEXT? 
• Oxdata plans to release Sparkling Water in October 
2014: 
• Microsoft Azure also offers a strong platform with 
multiple ML algorithm and an intuitive user interface 
• GraphLab has GraphLab Canvas ™ for visualizing your 
data and plans to incorporate more ML algorithms.
CAN’T DECIDE?
CONCLUSIONS
CONCLUSIONS 
• Apache Spark is the most active project in the Hadoop 
eco system 
• Spark offers speed and ease of use because of 
– RDDs 
– Interactive shell and 
– Easy integration of Scala, Java, Python scripts 
• Integrated in Spark are modules for 
– Easy data access via SparkSQL 
– Large-scale analytics via MLlib 
• Other ML frameworks enable analytics as well 
• Evaluate which framework is the best fit for your data 
problem
THE FUTURE? 
• Apache Spark will be a unified platform to run 
under various work loads: 
– Batch 
– Streaming 
– Interactive 
• And connect with different runtime systems 
– Hadoop 
– Cassandra 
– Mesos 
– Cloud 
– …
THE FUTURE? 
• Spark will extend its offering of large-scale 
algorithms for doing complex analytics: 
– Graph processing 
– Classification 
– Clustering 
– … 
• Other frameworks will continue to offer similar 
capabilities. 
• If you can’t beat them, join them.
http://labs.thomsonreuters.com/about-rd-careers/ 
FRANK.SCHILDER@THOMSONREUTERS.COM
EXTRA SLIDES
Example: Logistic Regression 
Goal: find best line separating two sets of points 
+ 
– 
– 
+ 
+ 
+ + + 
+ 
+ + 
– 
– – 
– 
– 
– – 
+ 
target 
– 
random 
initial 
line
Example: Logistic Regression 
val data = spark.textFile(...).map(readPoint).cache() 
var w = Vector.random(D) 
for (i <- 1 to ITERATIONS) { 
val gradient = data.map(p => 
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x 
).reduce(_ + _) 
w -= gradient 
} 
println("Final w: " + w)
Logistic Regression Performance 
4500 
4000 
3500 
3000 
2500 
2000 
1500 
1000 
500 
0 
1 5 10 20 30 
Running Time (s) 
Number of Iterations 
127 
s 
/ 
iteration 
Hadoop 
Spark 
first 
iteration 
174 
s 
further 
iterations 
6 
s
Spark Scheduler 
Dryad-like DAGs 
Pipelines functions 
within a stage 
Cache-aware work 
reuse & locality 
Partitioning-aware 
to avoid shuffles 
join 
groupBy 
union 
map 
Stage 
3 
A: 
Stage 
1 
Stage 
2 
B: 
C: 
D: 
E: 
F: 
G: 
= 
cached 
data 
partition
Spark Operations 
Transformations 
(define a new 
RDD) 
map 
filter 
sample 
groupByKey 
reduceByKey 
sortByKey 
flatMap 
union 
join 
cogroup 
cross 
mapValues 
Actions 
(return a result to 
driver program) 
collect 
reduce 
count 
save 
lookupKey

Weitere ähnliche Inhalte

Was ist angesagt?

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming宇 傅
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMeeraj Kunnumpurath
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib Zahra Eskandari
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Databricks
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupPhaneendra Chiruvella
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkRDataWorks Summit
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics Databricks
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Hubert Fan Chiang
 

Was ist angesagt? (20)

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 

Andere mochten auch

Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRyan Bosshart
 
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demoDatabricks
 
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopDataWorks Summit/Hadoop Summit
 
Migrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemMigrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemPerficient, Inc.
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0Minwoo Kim
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkJen Aman
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Redis for duplicate detection on real time stream
Redis for duplicate detection on real time streamRedis for duplicate detection on real time stream
Redis for duplicate detection on real time streamRoberto Franchini
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습동현 강
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
 
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제NAVER D2
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRDatabricks
 

Andere mochten auch (20)

Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
 
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
 
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
 
Big Data Platform Industrialization
Big Data Platform Industrialization Big Data Platform Industrialization
Big Data Platform Industrialization
 
SAP HORTONWORKS
SAP HORTONWORKSSAP HORTONWORKS
SAP HORTONWORKS
 
Migrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemMigrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management System
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With Spark
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Redis for duplicate detection on real time stream
Redis for duplicate detection on real time streamRedis for duplicate detection on real time stream
Redis for duplicate detection on real time stream
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 

Ähnlich wie Spark meetup TCHUG

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops TrainingSpark Summit
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Databricks
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlibGrigory Sapunov
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 

Ähnlich wie Spark meetup TCHUG (20)

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 

KĂźrzlich hochgeladen

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

KĂźrzlich hochgeladen (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Spark meetup TCHUG

  • 1. LARGE-SCALE ANALYTICS WITH APACHE SPARK THOMSON REUTERS R&D TWIN CITIES HADOOP USER GROUP FRANK SCHILDER SEPTEMBER 22, 2014
  • 2. THOMSON REUTERS • The Thomson Reuters Corporation – 50,000+ employees – 2,000+ journalists at news desks world wide – Offices in more than 1,000 countries – $12 billion dollars revenue/year • Products: intelligent information for professionals and enterprises – Legal: WestlawNext legal search engine – Financial: Eikon financial platform; Datastream real-time share price data – News: REUTERS news – Science: Endnote, ISI journal impact factor, Derwent World Patent Index – Tax & Accounting: OneSource tax information • Corporate R&D – Around 40 researchers and developers (NLP, IR, ML) – Three R&D sites the US one in the UK: Eagan, MN; Rochester, NY; NYC and London – We are hiring… email me at frank.schilder@thomsonreuters.com
  • 3. OVERVIEW • Speed – Data locality, scalability, fault tolerance • Ease of Use – Scala, interactive Shell • Generality – SparkSQL, MLLib • Comparing ML frameworks – Vowpal Wabbit (VW) – Sparkling Water • The Future
  • 4. WHAT IS SPARK? Apache Spark is a fast and general engine for large-scale data processing. • Speed: allows to run iterative Map-Reduce faster because of in-Memory computation: Resilient Distributed Datasets (RDD) • Ease of use: enables interactive data analysis in Scala, Python, or Java; interactive Shell • Generality: offers libraries for SQL, Streaming and large-scale analytics (graph processing and machine learning) • Integrated with Hadoop: runs on Hadoop 2’s YARN cluster
  • 5. ACKNOWLEDGMENTS • Matei Zaharia and ampLab and databricks team for fantastic learning material and tutorials on Spark • Hiroko Bretz, Thomas Vacek, Dezhao Song, Terry Heinze for Spark and Scala support and running experiments • Adam Glaser for his time as a TSAP intern • Mahadev Wudali and Mike Edwards for letting us play in the “sandbox” (cluster)
  • 7. PRIMARY GOALS OF SPARK • Extend the MapReduce model to better support two common classes of analytics apps: – Iterative algorithms (machine learning, graphs) – Interactive data mining (R, Python) • Enhance programmability: – Integrate into Scala programming language – Allow interactive use from Scala interpreter – Make Spark easily accessible from other languages (Python, Java)
  • 8. MOTIVATION • Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: – Iterative algorithms (machine learning, graphs) – Interactive data mining tools (R, Python) • With current frameworks, apps reload data from stable storage on each query
  • 10. SOLUTION: Resilient Distributed Datasets (RDDs) • Allow apps to keep working sets in memory for efficient reuse • Retain the attractive properties of MapReduce – Fault tolerance, data locality, scalability • Support a wide range of applications
  • 11. PROGRAMMING MODEL Resilient distributed datasets (RDDs) – Immutable, partitioned collections of objects – Created through parallel transformations (map, filter, groupBy, join, …) on data in stable storage – Functions follow the same patterns as Scala operations on lists – Can be cached for efficient reuse 80+ Actions on RDDs – count, reduce, save, take, first, …
  • 12. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns Base RDD Transformed RDD Val lines = spark.textFile(“hdfs://...”) Val errors = lines.filter(_.startsWith(“ERROR”)) Val messages = errors.map(_.split(‘t’)(2)) Val cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker results Worker Worker Driver cachedMsgs.filter(_.contains(“timeout”)).count cachedMsgs.filter(_.contains(“license”)).count . . . tasks Cache 1 Cache 2 Cache 3 Action Result: scaled to 1 TB data in 5-7 sec Result: full-text search of Wikipedia in <1 sec (vs 170 sec for on-disk data) (vs 20 sec for on-disk data)
  • 13. BEHAVIOR WITH NOT ENOUGH RAM 68.8 58.1 40.7 29.7 11.5 100 80 60 40 20 0 Cache disabled 25% 50% 75% Fully cached Iteration time (s) % of working set in memory
  • 14. RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘t’)(2)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))
  • 15. Fault Recovery Results 119 No Failure Failure in the 6th Iteration 57 56 58 58 81 57 59 57 59 140 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 Iteratrion time (s) Iteration
  • 17. INTERACTIVE SHELL • Data analysis can be done in the interactive shell. – Start from local machine or cluster – Access multi-core processor with local[n] – Spark context is already set up for you: SparkContext sc • Load data from anywhere (local, HDFS, Cassandra, Amazon S3 etc.): • Start analyzing your data: Processing starts here Local data file
  • 18. ANALYZE YOUR DATA • Word count in one line: • List the word counts: • Broadcast variables (e.g. dictionary, stop word list) because local variables need to distributed to the workers:
  • 19. RUN A SPARK SCRIPT
  • 20. PYTHON SHELL & IPYTHON • The interactive shell can also be started as Python shell called pySpark: • Start analyzing your data in python now: • Since it’s Python, you may want to use iPython – (command shell for interactive programming in your brower) :
  • 21. IPYTHON AND SPARK • The iPython notebook environment and pySpark: – Document data analysis results – Carry out machine learning experiments – Visualize results with matplotlib or other visualization libraries – Combine with NLP libraries such as NLTK • PySpark does not offer the full functionality of Spark Shell in Scala (yet) • Some bugs (e.g. problems with unicode)
  • 22.
  • 23. PROJECTS AT R&D USING SPARK • Entity linking – Alternative name extraction from Wikipedia, Freebase, free text, ClueWeb12; several TB large web collection (planned) • Large-scale text data analysis: – creating fingerprints for entities/events – Temporal slot filling: Assigning a begin and end time stamp to a slot filler (e.g. A is employee of company B from BEGIN to END) – Large-Scale text classification of Reuters News Archive articles (10 years) • Language model computation used for search query analysis
  • 24. SPARK MODULES • Spark streaming: – Processing real-time data streams • Spark SQL: – Support for structured data (JSON, Parquet) and relational queries (SQL) • MLlib: – Machine learning library • GraphX: – New graph processing API
  • 26. SPARK SQL • Relational queries expressed in – SQL – HiveQL – Scala Domain specific language (DSL) • New type of RDD: SchemaRDD : – RDD composed of Row objects – Schema definition or inferred from a Parquet file, JSON data set, or data store in Hive • SPARK SQL is in alpha: API may change in the future!
  • 28. MLLIB
  • 29. MLLIB • A machine learning module that comes with Spark • Shipped since Spark 0.8.0 • Provides various machine learning algorithms for classification and clustering • Sparse vector representation since 1.0.0 • New features in recently released version 1.1.0: – Includes a standard statistics library (e.g. correlation, Hypothesis testing, sampling) – More algorithms ported to Java and Python – More feature engineering: TF-IDF, Singular Value Decomposition (SVD)
  • 30. MLLIB • Provides various machine learning algorithms: – Classification: • Logistic regression, support vector machine (SVM), naĂŻve Bayes, decision trees – Regression: • Linear regression, regression trees – Collaborative Filtering: • Alternative least square (ALS) – Clustering: • K-means – Decomposition • Singular value decomposition (SVD), Principal component analysis (PCA)
  • 31. OTHER ML FRAMEWORKS • Mahout • LIBLINEAR • MatLAB • Scikit-learn • GraphLab • R • Weka • Vowpal Wabbit • BigML
  • 32. LARGE-SCALE ML INFRASTRUCTURE • More data implies bigger training sets and richer feature sets. • More data with simple ML algorithm often beats small data with complicated ML algorithm • Large-scale ML requires big data infrastructure: – Faster processing: Hadoop, Spark – Feature engineering: Principal Component Analysis, Hashing trick, Word2Vec
  • 34. PREDICTIVE ANALYTICS WITH MLLIB http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured- data-using-spark-2.html
  • 35. VW AND MLLIB COMPARISON • We compared Vowpal Wabbit and MLlib in December 2013 (work with Tom Vacek) • Vowpal Wabbit (VW) is a large-scale ML tool developed by John Langford (Microsoft) • Task: binary text classification task on Reuters articles – Ease of implementation – Feature Extraction – Parameter tuning – Speed – Accessibility of programming languages
  • 36. VW VS. MLLIB • Ease of implementation – VW: user tool designed for ML, not programming language – MLlib: programming language, some support now (e.g. regularization) • Feature Extraction – VW: specific capabilities for bi-grams, prefix etc. – MLlib: no limit in terms of creating features • Parameter tuning – VW: no parameter search capability, but multiple parameters can be hand-tuned – MLlib: offers cross-validation • Speed – VW: highly optimized, very fast even on a single machine with multiple cores – MLlib: fast with lots of machines • Accessibility of programming languages – VW: written in C++, a few wrappers (e.g. Python) – MLlib: Scala, Python, Java • Conclusion end of 2013: VW had a slight advantage, but MLlib has caught up in at least some of the areas (e.g. sparse feature representation)
  • 37. FINDINGS SO FAR • Large-scale extraction is a great fit for Spark when working with large data sets (> 1GB) • Ease of use makes Spark an ideal framework for rapid prototyping. • MLlib is a fast growing ML library, but “under development” • Vowpal Wabbit has been shown to crunch even large data sets with ease. 250 200 150 100 50 0 vw liblinear Spark local[4] 0/1 loss time
  • 38. OTHER ML FRAMEWORKS • Internship by Adam Glaser compared various ML frameworks with 5 standard data sets (NIPS) – Mass-spectrometric data (cancer), handwritten digit detection, Reuters news classification, synthetic data sets – Data sets were not very big, but had up to 1.000.000 features • Evaluated accuracy of the generated models and speed for training time • H20, GraphLab and Microsoft Azure showed strong performances in terms of accuracy and training time.
  • 40. SPEED
  • 41. WHAT IS NEXT? • Oxdata plans to release Sparkling Water in October 2014: • Microsoft Azure also offers a strong platform with multiple ML algorithm and an intuitive user interface • GraphLab has GraphLab Canvas ™ for visualizing your data and plans to incorporate more ML algorithms.
  • 44. CONCLUSIONS • Apache Spark is the most active project in the Hadoop eco system • Spark offers speed and ease of use because of – RDDs – Interactive shell and – Easy integration of Scala, Java, Python scripts • Integrated in Spark are modules for – Easy data access via SparkSQL – Large-scale analytics via MLlib • Other ML frameworks enable analytics as well • Evaluate which framework is the best fit for your data problem
  • 45. THE FUTURE? • Apache Spark will be a unified platform to run under various work loads: – Batch – Streaming – Interactive • And connect with different runtime systems – Hadoop – Cassandra – Mesos – Cloud – …
  • 46. THE FUTURE? • Spark will extend its offering of large-scale algorithms for doing complex analytics: – Graph processing – Classification – Clustering – … • Other frameworks will continue to offer similar capabilities. • If you can’t beat them, join them.
  • 49. Example: Logistic Regression Goal: find best line separating two sets of points + – – + + + + + + + + – – – – – – – + target – random initial line
  • 50. Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)
  • 51. Logistic Regression Performance 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Running Time (s) Number of Iterations 127 s / iteration Hadoop Spark first iteration 174 s further iterations 6 s
  • 52. Spark Scheduler Dryad-like DAGs Pipelines functions within a stage Cache-aware work reuse & locality Partitioning-aware to avoid shuffles join groupBy union map Stage 3 A: Stage 1 Stage 2 B: C: D: E: F: G: = cached data partition
  • 53. Spark Operations Transformations (define a new RDD) map filter sample groupByKey reduceByKey sortByKey flatMap union join cogroup cross mapValues Actions (return a result to driver program) collect reduce count save lookupKey