The Future of Apache Spark

The Future of Apache Spark
Patrick Wendell

First Things First…
Recruit Technologies, NTT Data and
#HCJ2104, thank you for your hospitality
This slide – I managed to translate myself!
Recruit Technologies, NTT Data とHadoop Conference
Japanのおもてなしをありがとうございました。
このスライドは、私がトランスレートした。

A Week in Spark Development
500 patch updates
200 updates to our issue tracker
140 user list e-mails
80 merged patches
Spark開発コミュニティの１週間のアクティビティ
パッチの投稿や修正など : 500件
JIRA/GitHub上でのコメントなど : 200件
ユーザメーリングリスト上でのやり取り : 140スレッド
マージされるパッチの数 : 80件

Spark’s Future
Spark has seen rapid growth in the last year… where
are we going now?
Spark releases and developer process
Technical roadmap over future releases
Sparkのリリースや開発プロセス
将来のリリースにわたってのテクニカルロードマップを紹介します
Spark の将来
ここ1年でSparkは急速に成長しました
現在どこに向かっているのでしょうか？

Goal of the Spark project
Empower data scientists and engineers
Expressive, clean APIs
Unified runtime across many environments
Powerful standard libraries
Spark プロジェクトの目的
データサイエンティストやエンジニアの能力拡張
表現力のある、クリーンなAPIの提供
多様な環境にわたって統合されたランタイム
強力な標準ライブラリ群

API stability
In 1.0+ Spark has well defined public API’s and well
defined experimental API’s
Apps written against Spark API will be portable in new
versions
Patches that break our API automatically fail our build
Spark1.0以降、標準APIと試験的APIが提供されている
Spark APIに沿って書かれたアプリは新しいバージョンでも
動作する
API互換の無いパッチはビルド時に自動的に失敗する
API の安定性について

Developer-friendly release cadence
Minor releases every 3 months
1.1 (August), 1.2, 1.3
Maintenance releases with fixes as necessary
1.0.1, 1.0.2, etc
Extremely conservative about patch releases
マイナーリリースは3ヶ月毎に提供
必要に応じてメンテナンスリリースを提供
ただし、パッチリリースに関しては極めて慎重に
開発者にやさしいリリースサイクル

The Spark Stack
Spark Runtime
YARN, Mesos, AWS
HDFS, S3,
Cassandra, Hana
Cluster Managers Data Sources
Spark
Streaming
real-time
Spark SQL
Relational
operators
GraphX
Graph
processing
MLLib
machine
learning

The Spark Stack
Spark Runtime
Spark
Streaming
real-time
Spark SQL
Relational
operators
GraphX
Graph
processing
MLLib
machine
learning
More mature, focus on optimization and pluggability
Newer, focused on adding capabilities
新しいコンポーネントで、ケーパビリティの追加に集中
より完成度の高いコンポーネントで、最適化やパーツの可換性に集中

The future of Spark is libraries
Critical component of any successful runtime
Packaged and distributed with Spark to provide full
inter-operability
Lead by experts in respective fields, highly curated
and integrated with Spark core API
Spark の未来は「ライブラリ」
成功するランタイムの最重要コンポーネント
パッケージ化,ディストリビューション化して,相互運用性を提供
各分野の専門家たちによってリードされ、精選されて、Spark
core API に統合される

Spark SQL
Growing faster than any other component
Support for SQL language and notion of typed schema
RDDs
Focuses going forward:
- Optimization (code gen, faster joins, etc)
- Language extensions (towards SQL92)
- Integration (next slide…)
他のコンポーネントよりも急速に成長
SQL言語と型付きスキーマRDDの考えをサポート
現在優先的に取り組んでいる課題
- クエリ最適化 / 言語の拡張 / インテグレーション

Spark SQL and SchemaRDD
Spark Runtime
Spark SQL
Hadoop NoSQL RDBMS
Will facilitate deeper integration with other systems
Parquet
JSON
他のシステムとの深いインテグレーションをひきつける

Spark SQL and Shark
Spark 0.9Shark 0.9
Spark 1.0
Spark 1.1
Shark 0.8 Spark 0.8
Spark 1.0.1 + JDBC
Spark 1.1+ will provide a
JDBC/ODBC Server allowing
direct upgrade for Shark users.
Preview release packaged
with Spark 1.0.1
このサーバソフトウェアのプレ
ビューリリース版がSpark 1.0.1に
同梱されている。
Spark 1.1以上のバージョンでは、
JDBC/ODBC接続用のサーバソ
フトウェアの提供を予定している。
これによりSharkを使用している
ユーザが直接アップグレードする
ことができる。

The Spark Stack
Spark Runtime
Spark
Streaming
real-time
GraphX
Graph
processing
MLLib
machine
learning
More mature, focus on optimization and plugability
Newer, focused on adding capabilities
Spark SQL
Relational
operators
新しいコンポーネントで、ケーパビリティの追加に集中
より完成度の高いコンポーネントで、最適化やパーツの可換性に集中

MLlib
Second fastest growing component 
MLLib 1.0 has about ~15 algorithms
MLLib 1.1 should roughly double that…
traditional descriptive statistics:
sampling, correlation, estimators, tests
learning algorithms:
NMF, Sparse SVD, LDA…
2番目に成長の早いコンポーネント
MLLib1.0では15のアルゴリズムが利用可能
MLLib1.1ではざっと2倍のアルゴリズムが利用可能となる予定
伝統的な記述統計
学習アルゴリズム

SparkR
Make SparkR “production ready”
(Alteryx and Databricks).
Integration with Mllib.
Consolidating the the data frame and RDD concepts.
Fast
Scalable
Expressive
Numerical
Interactive
Packages
SparkR を “production ready” にする
Mllib とのインテグレーション
データフレームと RDD のコンセプトを統合

Notable trends
Hardware
Memory prices continue to fall, 256+GB machines
not uncommon
SSD’s becoming widely deployed
Software
Tachyon and other cluster memory managers
注目すべきトレンド
メモリの容量単価が下がり続け、256GB以上のメモリを搭載するハードウェ
アも珍しくない
SSD が広く普及しはじめている
Tachyon や他にもクラスタ全体でメモリを管理するソフトウェアが登場

Spark Core
Allow extension/innovation by defining internal API’s:
Internal storage API
Support for SSDs
Shared memory systems like Tachyon,
and (eventually) HDFS caching/DDMs.
Spark shuffle API
Sort-based shuffle
Pipelined shuffle
内部APIの定義により、拡張/革新が可能になった
SSDのサポート
Tachyonのような共有型のメモリシステムや、
HDFS caching / 分散データマネジメントへの対応
Sort-based shuffle や Pipelined shuffle など、
shuffle 時の挙動をプラガブルに

Timeline
Spark 1.0.1
JSON support in Spark SQL
Spark 1.1
Generalized shuffle interface
MLLib stats algorithms
JDBC server
Sort-based shuffle*
Spark 1.2
Refactored storage support
Spark 1.3+
SparkR

I’ve only scratched the surface…
Streaming: new data sources and tighter flume
integration
Graphx: optimizations and API stability
Core: Elastic scaling on YARN, user-defined metrics
and counters
[Your work here]
詳細には触れていない内容...
Streaming : 新しいデータソースや、Flumeとのより強い統合
GraphX: 最適化とAPIの安定化
Core: YARNによる柔軟なスケーリング、ユーザー定義メトリク
ス/カウンタ
[あなたのワークもここに含まれるかも]

Should also mention: Databricks Cloud
Provision a Spark cluster instantly in the cloud
Interactive workspace with full power of Spark:
notebooks, dashboards, and scheduled jobs
In private beta now, you can sign up at
databricks.com/cloud (or find me!)
もう一つ伝えておくべきこと：Databricks Cloud
クラウドですぐに Spark クラスタを提供
Sparkのフルパワーを備えたインタラクティブなワークスペース
notebooks, ダッシュボードとスケジューリングされたジョブ
Sparkのフルパワーを備えたインタラクティブなワークスペース

Wrapping it all up
Spark will grow substantially in the next year
Focus is on libraries and improving core internals for
future innovation
Release process and cadence provides users with
stable releases despite fast growth
まとめ
Sparkは次の1年についても十分な成長を遂げる予定
将来的な革新に向けライブラリやコア機能の改良にフォーカス
リリースプロセスとサイクル化により、急速な成長を遂げるつ
つも安定したリリースをユーザーに提供する

The Future of Apache Spark

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie The Future of Apache Spark

Ähnlich wie The Future of Apache Spark (20)

Mehr von Hadoop / Spark Conference Japan

Mehr von Hadoop / Spark Conference Japan (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (7)

The Future of Apache Spark