Spark is a fast and general engine for large-scale data processing. It runs on Hadoop clusters through YARN and Mesos, and can also run standalone. Spark is up to 100x faster than Hadoop for certain applications because it keeps data in memory rather than disk, and it supports iterative algorithms through its Resilient Distributed Dataset (RDD) abstraction. The presenter provides a demo of Spark's word count algorithm in Scala, Java, and Python to illustrate how easy it is to use Spark across languages.
9. Opinion from Cloudera
• The leading candidate for “successor to
MapReduce” today is Apache Spark
• No vendor — no new project — is likely to catch
up. Chasing Spark would be a waste of time,
and would delay availability of real-time analytic
and processing services for no good reason. !
• From http://0rz.tw/y3OfM
10. What is Spark
• From UC Berkeley AMP Lab
• Most activity Big data open
source project since Hadoop
15. Spark vs Hadoop
• Spark run on Yarn, Mesos or Standalone mode
• Spark’s main concept is based on MapReduce
• Spark can read from
• HDFS: data locality
• HBase
• Cassandra
25. First iteration!
take 200 sec
3rd iteration!
take 20 sec
Page Rank algorithm in 1 billion record url
2nd iteration!
take 20 sec
26. RDD
• Resilient Distributed Dataset
• Collections of objects spread across a cluster,
stored in RAM or on Disk
• Built through parallel transformations