Short introduction to ML frameworks on Hadoop

Short introduction to
ML frameworks on Hadoop
Yuya Takashina 2016
1

Hadoop(2011-)
• De facto standard for storage distribution and parallel processing on
big data in application.
• Google, Yahoo, Facebook, IBM, Twitter, …
• The largest Hadoop cluster in the world has 4,500 nodes (Yahoo)
• Consists of two parts.
• Hadoop Distributed File System
• MapReduce
• There are some replacements
for MapReduce.
https://dzone.com/articles/how-hadoop-mapreduce-works
2
Barrier

Spark(2014-)
• Framework for data analytics on Hadoop.
• Use memory to cache data.
• Up to 10x faster than MapReduce for certain applications.
• Machine learning
• Graph computation
• Stream processing
• API for Scala/Java/Python/R.
3

Petuum(2015-)
• Framework for machine learning on Hadoop.
• Faster than Spark
• Barrier synchronization as bottleneck in MapReduce and Spark.
• Adopt P2P and async-like communication strategy to reduce
network communication costs.
• Guarantee the theoretical convergence to the optimal value
using the unique characters of ML programs.
• optimization-centric
• iterative convergent
• Implemented in C++.
• Providing Deep learning API.
4

Reference
• Powered by Apache Hadoop:
https://wiki.apache.org/hadoop/PoweredBy
• The Hadoop Ecosystem Table:
https://hadoopecosystemtable.github.io/
• A New Look at the System, Algorithm and Theory Foundations of
Distributed Machine Learning:
https://petuum.github.io/papers/SysAlgTheoryKDD2015.pdf
• Strategies and Principles of Distributed Machine Learning on Big Data:
https://arxiv.org/abs/1512.09295
5

Short introduction to ML frameworks on Hadoop

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Short introduction to ML frameworks on Hadoop

Ähnlich wie Short introduction to ML frameworks on Hadoop (20)

Mehr von Yuya Takashina

Mehr von Yuya Takashina (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Short introduction to ML frameworks on Hadoop