SlideShare a Scribd company logo
1 of 37
Building a Unified Data 
Aaron Davidson 
Slides adapted from Matei Zaharia 
spark.apache.org 
Pipeline in 
Spark で構築する統合データパイプライン
What is Apache Spark? 
Fast and general cluster computing system 
interoperable with Hadoop 
Improves efficiency through: 
»In-memory computing primitives 
»General computation graphs 
Improves usability through: 
»Rich APIs in Java, Scala, Python 
»Interactive shell 
Up to 100× faster 
(2-10× on disk) 
2-5× less code 
Hadoop互換のクラスタ計算システム 
計算性能とユーザビリティを改善
Project History 
Started at UC Berkeley in 2009, open 
sourced in 2010 
50+ companies now contributing 
»Databricks, Yahoo!, Intel, Cloudera, IBM, … 
Most active project in Hadoop ecosystem 
UC バークレー生まれ 
OSSとして50社以上が開発に参加
A General Stack 
Spark 
Spark 
Streaming 
real-time 
Spark 
SQL 
structured 
GraphX 
graph 
MLlib 
machine 
learning 
… 
構造化クエリ、リアルタイム分析、グラフ処理、機械学習
This Talk 
Spark introduction & use cases 
Modules built on Spark 
The power of unification 
Demo 
Sparkの紹介とユースケース
Why a New Programming 
Model? 
MapReduce greatly simplified big data 
analysis 
But once started, users wanted more: 
»More complex, multi-pass analytics (e.g. ML, 
graph) 
»More interactive ad-hoc queries 
»More real-time stream processing 
All 3 need faster data sharing in parallel 
aMpappRseduceの次にユーザが望むもの: 
より複雑な分析、対話的なクエリ、リアルタイム処理
Data Sharing in MapReduce 
iter. 1 iter. 2 . . . 
Input 
HDFS 
read 
HDFS 
write 
HDFS 
read 
HDFS 
write 
Input 
query 1 
query 2 
query 3 
result 1 
result 2 
result 3 
. . . 
HDFS 
read 
Slow due to replication, serialization, and disk IO 
MapReduce のデータ共有が遅いのはディスクIOのせい
What We’d Like 
iter. 1 iter. 2 . . . 
Input 
Distributed 
memory 
Input 
query 1 
query 2 
query 3 
. . . 
one-time 
processing 
10-100× faster than network and disk 
ネットワークやディスクより10~100倍くらい高速化したい
Spark Model 
Write programs in terms of transformations 
on distributed datasets 
Resilient Distributed Datasets (RDDs) 
»Collections of objects that can be stored in 
memory or disk across a cluster 
»Built via parallel transformations (map, filter, …) 
»Automatically rebuilt on failure 
自己修復する分散データセット(RDD) 
RDDはmap やfilter 等のメソッドで並列に変換できる
Example: Log Mining 
Load error messages from a log into memory, 
then interactively search for various patterns 
BaseT RraDnDsformed RDD 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(‘t’)[2]) 
messages.cache() Block 1 
Result: full-scaled text to search 1 TB data of Wikipedia in 5-7 sec 
in 
<1 (sec vs 170 (vs 20 sec sec for for on-on-disk disk data) 
data) 
Block 2 
Action 
Block 3 
Worker 
Worker 
Worker 
Driver 
messages.filter(lambda s: “foo” in s).count() 
messages.filter(lambda s: “bar” in s).count() 
. . . 
results 
tasks 
Cache 1 
Cache 2 
Cache 3 
様々なパターンで対話的に検索。1 TBの処理時間が170 -> 5~7秒に
Fault Tolerance 
RDDs track lineage info to rebuild lost data 
file.map(lambda rec: (rec.type, 1)) 
.reduceByKey(lambda x, y: x + y) 
.filter(lambda (type, count): count > 10) 
map reduce filter 
Input file 
**系統** 情報を追跡して失ったデータを再構築
Fault Tolerance 
RDDs track lineage info to rebuild lost data 
file.map(lambda rec: (rec.type, 1)) 
map reduce filter 
Input file 
.reduceByKey(lambda x, y: x + y) 
.filter(lambda (type, count): count > 10) 
**系統** 情報を追跡して失ったデータを再構築
Example: Logistic 
Regression 
4000 
3500 
3000 
2500 
2000 
1500 
1000 
500 
0 
1 5 10 20 30 
Running Time (s) 
Number of Iterations 
110 s / iteration 
Hadoop 
Spark 
first iteration 80 s 
further iterations 1 s 
ロジスティック回帰
Behavior with Less RAM 
68.8 
58.1 
40.7 
29.7 
11.5 
100 
80 
60 
40 
20 
0 
Cache 
disabled 
25% 50% 75% Fully 
cached 
Iteration time (s) 
% of working set in memory 
キャッシュを減らした場合の振る舞い
Spark in Scala and Java 
// Scala: 
val lines = sc.textFile(...) 
lines.filter(s => s.contains(“ERROR”)).count() 
// Java: 
JavaRDD<String> lines = sc.textFile(...); 
lines.filter(new Function<String, Boolean>() { 
Boolean call(String s) { 
return s.contains(“error”); 
} 
}).count();
Spark in Scala and Java 
// Scala: 
val lines = sc.textFile(...) 
lines.filter(s => s.contains(“ERROR”)).count() 
// Java 8: 
JavaRDD<String> lines = sc.textFile(...); 
lines.filter(s -> s.contains(“ERROR”)).count();
Supported Operators 
map 
filter 
groupBy 
sort 
union 
join 
leftOuterJoin 
rightOuterJoin 
reduce 
count 
fold 
reduceByKey 
groupByKey 
cogroup 
cross 
zip 
sample 
take 
first 
partitionBy 
mapWith 
pipe 
save 
...
Spark Community 
250+ developers, 50+ companies contributing 
Most active open source project in big data 
MapReduce 
YARN 
HDFS 
Storm 
Spark 
1400 
1200 
1000 
800 
600 
400 
200 
0 
commits past 6 months 
ビッグデータ分野で最も活発なOSSプロジェクト
Continuing Growth 
source: ohloh.net 
Contributors per month to Spark 
貢献者は増加し続けている
Get Started 
Visit spark.apache.org for docs & tutorials 
Easy to run on just your laptop 
Free training materials: spark-summit.org 
ラップトップ一台から始められます
This Talk 
Spark introduction & use cases 
Modules built on Spark 
The power of unification 
Demo 
Spark 上に構築されたモジュール
The Spark Stack 
Spark 
Spark 
Streaming 
real-time 
Spark 
SQL 
structured 
GraphX 
graph 
MLlib 
machine 
learning 
… 
Spark スタック
Evolution of the Shark project 
Allows querying structured data in Spark 
From Hive: 
c = HiveContext(sc) 
rows = c.sql(“select text, year from hivetable”) 
rows.filter(lambda r: r.year > 2013).collect() 
{“text”: “hi”, 
“user”: { 
“name”: “matei”, 
“id”: 123 
}} 
From JSON: 
c.jsonFile(“tweets.json”).registerAsTable(“tweets”) 
c.sql(“select text, user.name from tweets”) 
tweets.json 
Spark SQL 
Shark の後継。Spark で構造化データをクエリする。
Spark SQL 
Integrates closely with Spark’s language APIs 
c.registerFunction(“hasSpark”, lambda text: “Spark” in text) 
c.sql(“select * from tweets where hasSpark(text)”) 
Uniform interface for data access 
Python Scala Java 
Hive Parquet JSON 
Cassan-dra 
… 
SQL 
Spark 言語APIとの統合 
様々なデータソースに対して統一インタフェースを提供
Spark Streaming 
Stateful, fault-tolerant stream processing 
with the same API as batch jobs 
sc.twitterStream(...) 
.map(tweet => (tweet.language, 1)) 
.reduceByWindow(“5s”, _ + _) 
Storm 
Spark 
35 
30 
25 
20 
15 
10 
5 
0 
Throughput … 
ステートフルで耐障害性のあるストリーム処理 
バッチジョブと同じAPI
MLlib 
Built-in library of machine learning 
algorithms 
»K-means clustering 
»Alternating least squares 
»Generalized linear models (with L1 / L2 reg.) 
»SVD and PCA 
»Naïve Bayes 
points = sc.textFile(...).map(parsePoint) 
model = KMeans.train(points, 10) 
組み込みの機械学習ライブラリ
This Talk 
Spark introduction & use cases 
Modules built on Spark 
The power of unification 
Demo 
統合されたスタックのパワー
Big Data Systems Today 
MapReduce 
Pregel 
Dremel 
GraphLab 
Storm 
Giraph 
Drill 
Tez 
Impala 
S4 
… 
Specialized systems 
(iterative, interactive and 
streaming apps) 
General batch 
processing 
現状: 特化型のビッグデータシステムが乱立
Spark’s Approach 
Instead of specializing, generalize MapReduce 
to support new apps in same engine 
Two changes (general task DAG & data 
sharing) are enough to express previous 
models! 
Unification has big benefits 
»For the engine 
»For users Spark 
Streaming 
GraphX 
… 
Shark 
MLbase 
Spark のアプローチ: 特化しない 
汎用的な同一の基盤で、新たなアプリをサポートする
What it Means for Users 
Separate frameworks: 
… 
HDFS 
read 
HDFS 
write 
ETL 
HDFS 
read 
HDFS 
write 
train 
HDFS 
read 
HDFS 
write 
query 
Spark: Interactive 
HDFS 
HDFS 
read 
ETL 
train 
query 
analysis 
全ての処理がSpark 上で完結。さらに対話型分析も
Combining Processing 
Types 
// Load data using SQL 
val points = ctx.sql( 
“select latitude, longitude from historic_tweets”) 
// Train a machine learning model 
val model = KMeans.train(points, 10) 
// Apply it to a stream 
sc.twitterStream(...) 
.map(t => (model.closestCenter(t.location), 1)) 
.reduceByWindow(“5s”, _ + _) 
SQL、機械学習、ストリームへの適用など、 
異なる処理タイプを組み合わせる
This Talk 
Spark introduction & use cases 
Modules built on Spark 
The power of unification 
Demo 
デモ
The Plan 
Raw JSON 
Tweets 
SQL 
Streaming 
Machine 
Learning 
訓生S特p練徴aJSrしkベO SNたクQ をモLト でデHルDツルをFイSで抽 かー、出らトツし読本イてみ文ーk込を-トmむ抽スea出トns リでーモムデをルクをラ訓ス練タすリるングする
Demo!
Summary: What We Did 
Raw JSON 
SQL 
Streaming 
Machine 
Learning 
-生JSON をHDFS から読み込む 
-Spark SQL でツイート本文を抽出 
-特徴ベクトルを抽出してk-means でモデルを訓練する 
-訓練したモデルで、ツイートストリームをクラスタリングする
import org.apache.spark.sql._ 
val ctx = new org.apache.spark.sql.SQLContext(sc) 
val tweets = sc.textFile("hdfs:/twitter") 
val tweetTable = JsonTable.fromRDD(sqlContext, tweets, Some(0.1)) 
tweetTable.registerAsTable("tweetTable") 
ctx.sql("SELECT text FROM tweetTable LIMIT 5").collect.foreach(println) 
ctx.sql("SELECT lang, COUNT(*) AS cnt FROM tweetTable  
GROUP BY lang ORDER BY cnt DESC LIMIT 10").collect.foreach(println) 
val texts = sql("SELECT text FROM tweetTable").map(_.head.toString) 
def featurize(str: String): Vector = { ... } 
val vectors = texts.map(featurize).cache() 
val model = KMeans.train(vectors, 10, 10) 
sc.makeRDD(model.clusterCenters, 10).saveAsObjectFile("hdfs:/model") 
val ssc = new StreamingContext(new SparkConf(), Seconds(1)) 
val model = new KMeansModel( 
ssc.sparkContext.objectFile(modelFile).collect()) 
// Streaming 
val tweets = TwitterUtils.createStream(ssc, /* auth */) 
val statuses = tweets.map(_.getText) 
val filteredTweets = statuses.filter { 
t => model.predict(featurize(t)) == clusterNumber 
} 
filteredTweets.print() 
ssc.start()
Conclusion 
Big data analytics is evolving to include: 
»More complex analytics (e.g. machine learning) 
»More interactive ad-hoc queries 
»More real-time stream processing 
Spark is a fast platform that unifies these 
apps 
Learn more: spark.apache.org 
ビッグデータ分析は、複雑で、対話的で、リアルタイムな方向へと進化 
Sparkはこれらのアプリを統合した最速のプラットフォーム

More Related Content

What's hot

Scala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on HerokuScala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on Heroku
Havoc Pennington
 
Build Cloud Applications with Akka and Heroku
Build Cloud Applications with Akka and HerokuBuild Cloud Applications with Akka and Heroku
Build Cloud Applications with Akka and Heroku
Salesforce Developers
 

What's hot (20)

The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
Scala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on HerokuScala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on Heroku
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup Edinburgh
 
Refactoring to Scala DSLs and LiftOff 2009 Recap
Refactoring to Scala DSLs and LiftOff 2009 RecapRefactoring to Scala DSLs and LiftOff 2009 Recap
Refactoring to Scala DSLs and LiftOff 2009 Recap
 
From Ruby to Scala
From Ruby to ScalaFrom Ruby to Scala
From Ruby to Scala
 
Build Cloud Applications with Akka and Heroku
Build Cloud Applications with Akka and HerokuBuild Cloud Applications with Akka and Heroku
Build Cloud Applications with Akka and Heroku
 
Java 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from OredevJava 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from Oredev
 
Martin Odersky: What's next for Scala
Martin Odersky: What's next for ScalaMartin Odersky: What's next for Scala
Martin Odersky: What's next for Scala
 
JavaOne 2011 - JVM Bytecode for Dummies
JavaOne 2011 - JVM Bytecode for DummiesJavaOne 2011 - JVM Bytecode for Dummies
JavaOne 2011 - JVM Bytecode for Dummies
 
Scala Days NYC 2016
Scala Days NYC 2016Scala Days NYC 2016
Scala Days NYC 2016
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
 
Spring data requery
Spring data requerySpring data requery
Spring data requery
 
Akka Actor presentation
Akka Actor presentationAkka Actor presentation
Akka Actor presentation
 
Alternatives of JPA/Hibernate
Alternatives of JPA/HibernateAlternatives of JPA/Hibernate
Alternatives of JPA/Hibernate
 
Requery overview
Requery overviewRequery overview
Requery overview
 
Scala profiling
Scala profilingScala profiling
Scala profiling
 
Short intro to scala and the play framework
Short intro to scala and the play frameworkShort intro to scala and the play framework
Short intro to scala and the play framework
 
Scala coated JVM
Scala coated JVMScala coated JVM
Scala coated JVM
 
Above the clouds: introducing Akka
Above the clouds: introducing AkkaAbove the clouds: introducing Akka
Above the clouds: introducing Akka
 
Scala : language of the future
Scala : language of the futureScala : language of the future
Scala : language of the future
 

Viewers also liked

GitBucket: The perfect Github clone by Scala
GitBucket: The perfect Github clone by ScalaGitBucket: The perfect Github clone by Scala
GitBucket: The perfect Github clone by Scala
takezoe
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 
芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜
芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜
芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜
scalaconfjp
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 

Viewers also liked (20)

sbt, past and future / sbt, 傾向と対策
sbt, past and future / sbt, 傾向と対策sbt, past and future / sbt, 傾向と対策
sbt, past and future / sbt, 傾向と対策
 
Xitrum Web Framework Live Coding Demos / Xitrum Web Framework ライブコーディング
Xitrum Web Framework Live Coding Demos / Xitrum Web Framework ライブコーディングXitrum Web Framework Live Coding Demos / Xitrum Web Framework ライブコーディング
Xitrum Web Framework Live Coding Demos / Xitrum Web Framework ライブコーディング
 
What's a macro?: Learning by Examples / Scalaのマクロに実用例から触れてみよう!
What's a macro?: Learning by Examples / Scalaのマクロに実用例から触れてみよう!What's a macro?: Learning by Examples / Scalaのマクロに実用例から触れてみよう!
What's a macro?: Learning by Examples / Scalaのマクロに実用例から触れてみよう!
 
Scalable Generator: Using Scala in SIer Business (ScalaMatsuri)
Scalable Generator: Using Scala in SIer Business (ScalaMatsuri)Scalable Generator: Using Scala in SIer Business (ScalaMatsuri)
Scalable Generator: Using Scala in SIer Business (ScalaMatsuri)
 
[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫
[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫
[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫
 
GitBucket: The perfect Github clone by Scala
GitBucket: The perfect Github clone by ScalaGitBucket: The perfect Github clone by Scala
GitBucket: The perfect Github clone by Scala
 
Node.js vs Play Framework (with Japanese subtitles)
Node.js vs Play Framework (with Japanese subtitles)Node.js vs Play Framework (with Japanese subtitles)
Node.js vs Play Framework (with Japanese subtitles)
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & AirflowBuilding a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA Technologies
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜
芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜
芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜
 
Spark etl
Spark etlSpark etl
Spark etl
 
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 

Similar to Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一

Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 

Similar to Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一 (20)

Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 

More from scalaconfjp

Scalaに対して意識の低いエンジニアがScalaで何したかの話, by 芸者東京エンターテインメント
Scalaに対して意識の低いエンジニアがScalaで何したかの話, by 芸者東京エンターテインメントScalaに対して意識の低いエンジニアがScalaで何したかの話, by 芸者東京エンターテインメント
Scalaに対して意識の低いエンジニアがScalaで何したかの話, by 芸者東京エンターテインメント
scalaconfjp
 

More from scalaconfjp (20)

脆弱性対策のためのClean Architecture ~脆弱性に対するレジリエンスを確保せよ~
脆弱性対策のためのClean Architecture ~脆弱性に対するレジリエンスを確保せよ~脆弱性対策のためのClean Architecture ~脆弱性に対するレジリエンスを確保せよ~
脆弱性対策のためのClean Architecture ~脆弱性に対するレジリエンスを確保せよ~
 
Alp x BizReach SaaS事業を営む2社がお互い気になることをゆるゆる聞いてみる会
Alp x BizReach SaaS事業を営む2社がお互い気になることをゆるゆる聞いてみる会Alp x BizReach SaaS事業を営む2社がお互い気になることをゆるゆる聞いてみる会
Alp x BizReach SaaS事業を営む2社がお互い気になることをゆるゆる聞いてみる会
 
GraalVM Overview Compact version
GraalVM Overview Compact versionGraalVM Overview Compact version
GraalVM Overview Compact version
 
Run Scala Faster with GraalVM on any Platform / GraalVMで、どこでもScalaを高速実行しよう by...
Run Scala Faster with GraalVM on any Platform / GraalVMで、どこでもScalaを高速実行しよう by...Run Scala Faster with GraalVM on any Platform / GraalVMで、どこでもScalaを高速実行しよう by...
Run Scala Faster with GraalVM on any Platform / GraalVMで、どこでもScalaを高速実行しよう by...
 
Monitoring Reactive Architecture Like Never Before / 今までになかったリアクティブアーキテクチャの監視...
Monitoring Reactive Architecture Like Never Before / 今までになかったリアクティブアーキテクチャの監視...Monitoring Reactive Architecture Like Never Before / 今までになかったリアクティブアーキテクチャの監視...
Monitoring Reactive Architecture Like Never Before / 今までになかったリアクティブアーキテクチャの監視...
 
Scala 3, what does it means for me? / Scala 3って、私にはどんな影響があるの? by Joan Goyeau
Scala 3, what does it means for me? / Scala 3って、私にはどんな影響があるの? by Joan GoyeauScala 3, what does it means for me? / Scala 3って、私にはどんな影響があるの? by Joan Goyeau
Scala 3, what does it means for me? / Scala 3って、私にはどんな影響があるの? by Joan Goyeau
 
Functional Object-Oriented Imperative Scala / 関数型オブジェクト指向命令型 Scala by Sébasti...
Functional Object-Oriented Imperative Scala / 関数型オブジェクト指向命令型 Scala by Sébasti...Functional Object-Oriented Imperative Scala / 関数型オブジェクト指向命令型 Scala by Sébasti...
Functional Object-Oriented Imperative Scala / 関数型オブジェクト指向命令型 Scala by Sébasti...
 
Scala ♥ Graal by Flavio Brasil
Scala ♥ Graal by Flavio BrasilScala ♥ Graal by Flavio Brasil
Scala ♥ Graal by Flavio Brasil
 
Introduction to GraphQL in Scala
Introduction to GraphQL in ScalaIntroduction to GraphQL in Scala
Introduction to GraphQL in Scala
 
Safety Beyond Types
Safety Beyond TypesSafety Beyond Types
Safety Beyond Types
 
Reactive Kafka with Akka Streams
Reactive Kafka with Akka StreamsReactive Kafka with Akka Streams
Reactive Kafka with Akka Streams
 
Reactive microservices with play and akka
Reactive microservices with play and akkaReactive microservices with play and akka
Reactive microservices with play and akka
 
Scalaに対して意識の低いエンジニアがScalaで何したかの話, by 芸者東京エンターテインメント
Scalaに対して意識の低いエンジニアがScalaで何したかの話, by 芸者東京エンターテインメントScalaに対して意識の低いエンジニアがScalaで何したかの話, by 芸者東京エンターテインメント
Scalaに対して意識の低いエンジニアがScalaで何したかの話, by 芸者東京エンターテインメント
 
DWANGO by ドワンゴ
DWANGO by ドワンゴDWANGO by ドワンゴ
DWANGO by ドワンゴ
 
OCTOPARTS by M3, Inc.
OCTOPARTS by M3, Inc.OCTOPARTS by M3, Inc.
OCTOPARTS by M3, Inc.
 
Try using Aeromock by Marverick, Inc.
Try using Aeromock by Marverick, Inc.Try using Aeromock by Marverick, Inc.
Try using Aeromock by Marverick, Inc.
 
統計をとって高速化する
Scala開発 by CyberZ,Inc.
統計をとって高速化する
Scala開発 by CyberZ,Inc.統計をとって高速化する
Scala開発 by CyberZ,Inc.
統計をとって高速化する
Scala開発 by CyberZ,Inc.
 
Short Introduction of Implicit Conversion by TIS, Inc.
Short Introduction of Implicit Conversion by TIS, Inc.Short Introduction of Implicit Conversion by TIS, Inc.
Short Introduction of Implicit Conversion by TIS, Inc.
 
ビズリーチ x ScalaMatsuri by BIZREACH, Inc.
ビズリーチ x ScalaMatsuri  by BIZREACH, Inc.ビズリーチ x ScalaMatsuri  by BIZREACH, Inc.
ビズリーチ x ScalaMatsuri by BIZREACH, Inc.
 
Solid and Sustainable Development in Scala
Solid and Sustainable Development in ScalaSolid and Sustainable Development in Scala
Solid and Sustainable Development in Scala
 

Recently uploaded

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Recently uploaded (20)

ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 

Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一

  • 1. Building a Unified Data Aaron Davidson Slides adapted from Matei Zaharia spark.apache.org Pipeline in Spark で構築する統合データパイプライン
  • 2. What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves efficiency through: »In-memory computing primitives »General computation graphs Improves usability through: »Rich APIs in Java, Scala, Python »Interactive shell Up to 100× faster (2-10× on disk) 2-5× less code Hadoop互換のクラスタ計算システム 計算性能とユーザビリティを改善
  • 3. Project History Started at UC Berkeley in 2009, open sourced in 2010 50+ companies now contributing »Databricks, Yahoo!, Intel, Cloudera, IBM, … Most active project in Hadoop ecosystem UC バークレー生まれ OSSとして50社以上が開発に参加
  • 4. A General Stack Spark Spark Streaming real-time Spark SQL structured GraphX graph MLlib machine learning … 構造化クエリ、リアルタイム分析、グラフ処理、機械学習
  • 5. This Talk Spark introduction & use cases Modules built on Spark The power of unification Demo Sparkの紹介とユースケース
  • 6. Why a New Programming Model? MapReduce greatly simplified big data analysis But once started, users wanted more: »More complex, multi-pass analytics (e.g. ML, graph) »More interactive ad-hoc queries »More real-time stream processing All 3 need faster data sharing in parallel aMpappRseduceの次にユーザが望むもの: より複雑な分析、対話的なクエリ、リアルタイム処理
  • 7. Data Sharing in MapReduce iter. 1 iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . HDFS read Slow due to replication, serialization, and disk IO MapReduce のデータ共有が遅いのはディスクIOのせい
  • 8. What We’d Like iter. 1 iter. 2 . . . Input Distributed memory Input query 1 query 2 query 3 . . . one-time processing 10-100× faster than network and disk ネットワークやディスクより10~100倍くらい高速化したい
  • 9. Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) »Collections of objects that can be stored in memory or disk across a cluster »Built via parallel transformations (map, filter, …) »Automatically rebuilt on failure 自己修復する分散データセット(RDD) RDDはmap やfilter 等のメソッドで並列に変換できる
  • 10. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns BaseT RraDnDsformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() Block 1 Result: full-scaled text to search 1 TB data of Wikipedia in 5-7 sec in <1 (sec vs 170 (vs 20 sec sec for for on-on-disk disk data) data) Block 2 Action Block 3 Worker Worker Worker Driver messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . results tasks Cache 1 Cache 2 Cache 3 様々なパターンで対話的に検索。1 TBの処理時間が170 -> 5~7秒に
  • 11. Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file **系統** 情報を追跡して失ったデータを再構築
  • 12. Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)) map reduce filter Input file .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) **系統** 情報を追跡して失ったデータを再構築
  • 13. Example: Logistic Regression 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Running Time (s) Number of Iterations 110 s / iteration Hadoop Spark first iteration 80 s further iterations 1 s ロジスティック回帰
  • 14. Behavior with Less RAM 68.8 58.1 40.7 29.7 11.5 100 80 60 40 20 0 Cache disabled 25% 50% 75% Fully cached Iteration time (s) % of working set in memory キャッシュを減らした場合の振る舞い
  • 15. Spark in Scala and Java // Scala: val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  • 16. Spark in Scala and Java // Scala: val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() // Java 8: JavaRDD<String> lines = sc.textFile(...); lines.filter(s -> s.contains(“ERROR”)).count();
  • 17. Supported Operators map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
  • 18. Spark Community 250+ developers, 50+ companies contributing Most active open source project in big data MapReduce YARN HDFS Storm Spark 1400 1200 1000 800 600 400 200 0 commits past 6 months ビッグデータ分野で最も活発なOSSプロジェクト
  • 19. Continuing Growth source: ohloh.net Contributors per month to Spark 貢献者は増加し続けている
  • 20. Get Started Visit spark.apache.org for docs & tutorials Easy to run on just your laptop Free training materials: spark-summit.org ラップトップ一台から始められます
  • 21. This Talk Spark introduction & use cases Modules built on Spark The power of unification Demo Spark 上に構築されたモジュール
  • 22. The Spark Stack Spark Spark Streaming real-time Spark SQL structured GraphX graph MLlib machine learning … Spark スタック
  • 23. Evolution of the Shark project Allows querying structured data in Spark From Hive: c = HiveContext(sc) rows = c.sql(“select text, year from hivetable”) rows.filter(lambda r: r.year > 2013).collect() {“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }} From JSON: c.jsonFile(“tweets.json”).registerAsTable(“tweets”) c.sql(“select text, user.name from tweets”) tweets.json Spark SQL Shark の後継。Spark で構造化データをクエリする。
  • 24. Spark SQL Integrates closely with Spark’s language APIs c.registerFunction(“hasSpark”, lambda text: “Spark” in text) c.sql(“select * from tweets where hasSpark(text)”) Uniform interface for data access Python Scala Java Hive Parquet JSON Cassan-dra … SQL Spark 言語APIとの統合 様々なデータソースに対して統一インタフェースを提供
  • 25. Spark Streaming Stateful, fault-tolerant stream processing with the same API as batch jobs sc.twitterStream(...) .map(tweet => (tweet.language, 1)) .reduceByWindow(“5s”, _ + _) Storm Spark 35 30 25 20 15 10 5 0 Throughput … ステートフルで耐障害性のあるストリーム処理 バッチジョブと同じAPI
  • 26. MLlib Built-in library of machine learning algorithms »K-means clustering »Alternating least squares »Generalized linear models (with L1 / L2 reg.) »SVD and PCA »Naïve Bayes points = sc.textFile(...).map(parsePoint) model = KMeans.train(points, 10) 組み込みの機械学習ライブラリ
  • 27. This Talk Spark introduction & use cases Modules built on Spark The power of unification Demo 統合されたスタックのパワー
  • 28. Big Data Systems Today MapReduce Pregel Dremel GraphLab Storm Giraph Drill Tez Impala S4 … Specialized systems (iterative, interactive and streaming apps) General batch processing 現状: 特化型のビッグデータシステムが乱立
  • 29. Spark’s Approach Instead of specializing, generalize MapReduce to support new apps in same engine Two changes (general task DAG & data sharing) are enough to express previous models! Unification has big benefits »For the engine »For users Spark Streaming GraphX … Shark MLbase Spark のアプローチ: 特化しない 汎用的な同一の基盤で、新たなアプリをサポートする
  • 30. What it Means for Users Separate frameworks: … HDFS read HDFS write ETL HDFS read HDFS write train HDFS read HDFS write query Spark: Interactive HDFS HDFS read ETL train query analysis 全ての処理がSpark 上で完結。さらに対話型分析も
  • 31. Combining Processing Types // Load data using SQL val points = ctx.sql( “select latitude, longitude from historic_tweets”) // Train a machine learning model val model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _) SQL、機械学習、ストリームへの適用など、 異なる処理タイプを組み合わせる
  • 32. This Talk Spark introduction & use cases Modules built on Spark The power of unification Demo デモ
  • 33. The Plan Raw JSON Tweets SQL Streaming Machine Learning 訓生S特p練徴aJSrしkベO SNたクQ をモLト でデHルDツルをFイSで抽 かー、出らトツし読本イてみ文ーk込を-トmむ抽スea出トns リでーモムデをルクをラ訓ス練タすリるングする
  • 34. Demo!
  • 35. Summary: What We Did Raw JSON SQL Streaming Machine Learning -生JSON をHDFS から読み込む -Spark SQL でツイート本文を抽出 -特徴ベクトルを抽出してk-means でモデルを訓練する -訓練したモデルで、ツイートストリームをクラスタリングする
  • 36. import org.apache.spark.sql._ val ctx = new org.apache.spark.sql.SQLContext(sc) val tweets = sc.textFile("hdfs:/twitter") val tweetTable = JsonTable.fromRDD(sqlContext, tweets, Some(0.1)) tweetTable.registerAsTable("tweetTable") ctx.sql("SELECT text FROM tweetTable LIMIT 5").collect.foreach(println) ctx.sql("SELECT lang, COUNT(*) AS cnt FROM tweetTable GROUP BY lang ORDER BY cnt DESC LIMIT 10").collect.foreach(println) val texts = sql("SELECT text FROM tweetTable").map(_.head.toString) def featurize(str: String): Vector = { ... } val vectors = texts.map(featurize).cache() val model = KMeans.train(vectors, 10, 10) sc.makeRDD(model.clusterCenters, 10).saveAsObjectFile("hdfs:/model") val ssc = new StreamingContext(new SparkConf(), Seconds(1)) val model = new KMeansModel( ssc.sparkContext.objectFile(modelFile).collect()) // Streaming val tweets = TwitterUtils.createStream(ssc, /* auth */) val statuses = tweets.map(_.getText) val filteredTweets = statuses.filter { t => model.predict(featurize(t)) == clusterNumber } filteredTweets.print() ssc.start()
  • 37. Conclusion Big data analytics is evolving to include: »More complex analytics (e.g. machine learning) »More interactive ad-hoc queries »More real-time stream processing Spark is a fast platform that unifies these apps Learn more: spark.apache.org ビッグデータ分析は、複雑で、対話的で、リアルタイムな方向へと進化 Sparkはこれらのアプリを統合した最速のプラットフォーム

Editor's Notes

  1. TODO: Apache incubator logo
  2. Each iteration is, for example, a MapReduce job
  3. Add “variables” to the “functions” in functional programming
  4. 100 GB of data on 50 m1.xlarge EC2 machines
  5. Alibaba, tenzent At Berkeley, we have been working on a solution since 2009. This solution consists of a software stack for data analytics, called the Berkeley Data Analytics Stack. The centerpiece of this stack is Spark. Spark has seen significant adoption with hundreds of companies using it, out of which around sixteen companies have contributed back the code. In addition, Spark has been deployed on clusters that exceed 1,000 nodes.
  6. Despite Hadoop having been around for 7 years, the Spark community is still growing; to us this shows that there’s still a huge gap in making big data easy to use and contributors are excited about Spark’s approach here