Lightening Fast Big Data Analytics using Apache Spark
www.unicomlearning.com
Lightning Fast Big Data Analytics using
Apache Spark
Manish Gupta
Solutions Architect – Product Engineering and Development
30th Jan 2014 - Delhi
www.bigdatainnovation.org
www.unicomlearning.com
www.bigdatainnovation.org
What is Hadoop?
It’s an open-sourced software for distributed storage of large datasets on commodity
class hardware in a highly fault-tolerant, scalable and a flexible way.
HDFS
It also provide a programming model/framework for processing these large datasets
in a massively-parallel, fault-tolerant and data-location aware fashion.
MR
Map
Input
Reduce
Map
Map
Output
Reduce
www.unicomlearning.com
www.bigdatainnovation.org
Limitations of Map Reduce
HDFS
read
HDFS
write
HDFS
read
iter. 1
Input
Map
iter. 2
Map
. . .
Reduce
Map
Input
HDFS
write
Output
Reduce
Slow due to replication, serialization, and disk IO
Inefficient for:
•
Iterative algorithms (Machine Learning, Graphs & Network Analysis)
•
Interactive Data Mining (R, Excel, Adhoc Reporting, Searching)
www.unicomlearning.com
www.bigdatainnovation.org
Approach: Leverage Memory?
Memory bus >> disk & SSDs
Many datasets fit into memory
1TB = 1 billion records @ 1 KB
Memory Capacity also follows the
Moore’s Law
A single 8GB stick of RAM is about
$80 right now. In 2021, you’d be
able to buy a single stick of RAM
that contains 64GB for the same
price.
www.unicomlearning.com
www.bigdatainnovation.org
Spark
“A big data analytics cluster-computing framework written in Scala.”
Open Sourced originally developed in AMPLab at UC Berkley.
Provides In-Memory analytics which is faster than Hadoop/Hive (upto 100x).
Designed for running Iterative algorithms & Interactive analytics
Highly compatible with Hadoop’s Storage APIs.
- Can run on your existing Hadoop Cluster Setup.
Developers can write driver programs using multiple programming languages.
…
www.unicomlearning.com
www.bigdatainnovation.org
Spark
A simple analytical operation:
1
pagecount = spark.textFile( "/wiki/pagecounts“ )
pagecount.count()
2
englishPages = pagecount.filter( _.split(" ")(1) == "en“ )
englishPages.cache()
englishPages.count()
englishTuples = englishPages.map( line => line.split(" ") )
englishKeyValues = englishTuples.map( line => (line(0), line(3).toInt) )
englishKeyValues.reduceByKey( _+_, 1).collect
Select count(*)
from pagecounts
Select Col1, sum(Col4)
from pagecounts
Where Col2 = “en”
Group by Col1
www.unicomlearning.com
www.bigdatainnovation.org
Shark
HIVE on SPARK = SHARK
A large scale data warehouse system just like Apache Hive.
Highly compatible with Hive (HQL, metastore, serialization formats, and
UDFs)
Built on top of Spark (thus a faster execution engine).
Provision of creating In-memory materialized tables (Cached Tables).
And cached tables utilizes columnar storage instead of raw storage.
Row Storage
Column Storage
1
ABC
4.1
1
2
3
2
XYZ
3.5
ABC
XYZ
PPP
3
PPP
6.4
4.1
3.5
6.4
www.unicomlearning.com
www.bigdatainnovation.org
Spark Programming Model
Driver Program
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
Writes
User (Developer)
RDD
(Resilient
Distributed
Dataset)
•
•
•
•
•
•
Immutable Data structure
In-memory (explicitly)
Fault Tolerant
Parallel Data Structure
Controlled partitioning to
optimize data placement
Can be manipulated using
rich set of operators.
www.unicomlearning.com
www.bigdatainnovation.org
RDD
Programming Interface: Programmer can perform 3 types of operations:
Transformations
•
Create a new
dataset from and
existing one.
•
Actions
•
Lazy in nature. They
are executed only
when some action is
performed.
•
•
Example :
• Map(func)
• Filter(func)
• Distinct()
Returns to the
driver program a
value or exports
data to a storage
system after
performing a
computation.
Example:
• Count()
• Reduce(funct)
• Collect
• Take()
Persistence
•
For caching datasets
in-memory for
future operations.
•
Option to store on
disk or RAM or
mixed (Storage
Level).
•
Example:
• Persist()
• Cache()
www.unicomlearning.com
www.bigdatainnovation.org
Spark
How Spark Works:
RDD: Parallel collection with partitions
User application create RDDs, transform
them, and run actions.
This results in a DAG (Directed Acyclic Graph) of
operators.
DAG is compiled into stages
Each stage is executed as a series of Task (one
Task for each Partition).
www.unicomlearning.com
www.bigdatainnovation.org
Example & Demo
Dataset:
Wiki Page View Stats
20 GB of webpage view counts
3 days worth of data
<date_time> <project_code> <page_title> <num_hits> <page_size>
Base RDD to All Wiki Pages
val allPages = sc.textFile("/wiki/pagecounts")
allPages.take(10).foreach(println)
allPages.count()
Transformed RDD for all English pages (cached)
val englishPages = allPages.filter(_.split(" ")(1) == "en")
englishPages.cache()
englishPages.count()
englishPages.count()
www.unicomlearning.com
www.bigdatainnovation.org
Example & Demo
Dataset:
Wiki Page View Stats
20 GB of webpage view counts
3 days worth of data
<date_time> <project_code> <page_title> <num_hits> <page_size>
Select date, sum(pageviews) from pagecounts group by date
englishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(3).toInt)).reduceByKey(_+_, 1).collect.foreach(println)
Select date, count(distinct pageURL) from pagecounts group by date
englishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(2))).distinct().countByKey().foreach(println)
Select distinct(datetime) from pagecounts order by datetime
englishPages.map(line => line.split(" ")).map(line => (line(0), 1)).distinct().sortByKey().collect().foreach(println)
www.unicomlearning.com
www.bigdatainnovation.org
Example & Demo
Dataset:
Network Datasets
Directed and Bi-directed Graphs
One small Facebook Social Network
127 nodes (Friends)
1668 Edges (Friendships)
Bi-directed graph
Google’s internal site network
15713 Nodes (web pages)
170845 Edges (hyperlinks)
Directed Graph
www.unicomlearning.com
www.bigdatainnovation.org
Example & Demo
Page Rank Calculation:
•
•
•
•
Estimate the node importance
Each directed link from A -> B is a vote to B from A.
More links to a page, more important a page is.
When a page with higher PR, points to something, then it’s vote weighs more.
1.
Start each page at a rank of 1
2. On each iteration, have page p contribute (rank
of p) / (no. of neighbors of p) to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
www.unicomlearning.com
www.bigdatainnovation.org
Example & Demo
Scala Code:
var iters = 100
val lines = sc.textFile("/dataset/google/edges.csv",1)
val links = lines.map{ s =>
val parts = s.split( "t“ )
(parts(0), parts(1))
}.distinct().groupByKey().cache()
var ranks = links.mapValues(v => 1.0)
for (i <- 1 to iters) {
val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>
val size = urls.size
urls.map(url => (url, rank / size))
}
ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
}
val output = ranks.map(l=>(l._2,l._1)).sortByKey(false).map(l=>(l._2,l._1))
output.take(20).foreach(tup => println( tup._2 + " : " + tup._1 ))
www.unicomlearning.com
www.bigdatainnovation.org
Conclusion
Because of In-memory processing, computations are very fast. Developers can
write iterative algorithms without writing out a result set after each pass
through the data.
Suitable for scenarios when sufficient memory available in your cluster.
It provides an integrated framework for advanced analytics like Graph
processing, Stream Processing, Machine Learning etc. This simplifies
integration.
It’s community is expanding and development is happening very aggressively.
It’s comparatively newer than Hadoop and only few users.
www.unicomlearning.com
www.bigdatainnovation.org
Benchmarks
AMPLab performed a quantitative and qualitative comparisons of 4
system
HIVE, Impala, Redshift and Shark
Done on Common Crawl Corpus Dataset
81 TB size
Consists of 3 tables:
Page Rankings
User Visits
Documents
Data was partitioned in such a way that each node had:
25GB of User Visits
1GB of Ranking
30GB of Web Crawl (document)
Source: https://amplab.cs.berkeley.edu/benchmark/#
Solutions Architect in GlobalLogic.Been working for the last 10 years on large databases, data warehouses, ETLs, data mining, and now for around 2-3 years on Big Data Analytics, Machine Learning & distributed System.GlobalLogic is a 6000+ headcount company is into Full Product Life Cycle service and one of the fastest growing R&D services firm.Provide Advisory, Professional Services, Engineering and Support service to 250+ customers globallyWill speak about an In memory cluster computing framework that can really Nitrogen Boost your existing Hadoop based Big Data setup for analytics.
Quickly touch upon Hadoop, What it does, HDFS, Map Reduce, and some of it’s limitationsIntroduce Spark and one of the tool build on top of Spark called Shark (The SQL Interface to Spark)Little bit on Spark’s architecture and it’s basic programming modelShowcase a demo about Spark and Shark’s functionalityWill speak a bit about the future of Spark, where it’s heading and about some of it’s existing customers and contributors.
Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure.Large is basically 10-100 GBs and above only.It is the driving force behind the big data industry growth.Provides 2 basic components:HDFS: Large Scale Storage SystemMap Reduce: Distributed Cluster computing frameworkTypical Hadoop setup comprises of :Cluster of a particular Hadoop DistributionTools like Hive, Pig and Mahout running on top of Hadoop (internally processing HDFS data using Map Reduce jobs)Set of tools for importing/exporting data into HDFS from/to external systems like RDBMS or Server Logs.
- One of the reason why Map Reduced is criticized is – Restricted programming framework - MapReduce tasks must be written as acyclic dataflow programs - Stateless mapper followed by a stateless reducer, that are executed by a batch job scheduler - Repeated querying of datasets become difficult - thus hard to write iterative algorithms- After each iteration of Map-Reduce, data has to be persisted on disc for next iteration to proceed with processing.
SparkContext: represents the connection to a Spark cluster provides the entry point for interacting with Spark. we can interact our jobs.Driver program: The process runniwith Spark and distribute ng the main() function of the application and creating the SparkContextCluster manager: An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)Worker node: Any node that can run application code in the clusterExecutor: A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.Task: A unit of work that will be sent to one executorJob: A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.Stage: Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.
Resilient Distributed Datasets or RDD are the distributed memory abstractions that lets programmer perform in-memory parallel computations on large clusters. And that too in a highly fault tolerant manner.This is the main concept around which the whole Spark framework revolves around.Currently 2 types of RDDs:Parallelized collections: Created by calling parallelize method on an existing Scala collection. Developer can specify the number of slices to cut the dataset into. Ideally 2-3 slices per CPU.Hadoop Datasets: These distributed datasets are created from any file stored on HDFS or other storage systems supported by Hadoop (S3, Hbaseetc). These are created using SparkContext’s textFile method. Default number of slices in this case is 1 slice per file block.
Transformations: Like map – takes an RDD as an input, passes & process each element to a function, and return a new transformed RDD as an output.By default, each transformed RDD is recomputed each time you run an action on it. Unless you specify the RDD to be cached in memory. Spark will try to keep the elements around the cluster for faster access.RDD can be persisted on discs as well.Caching is the Key tool for iterative algorithms.Using persist, one can specify the Storage Level for persisting an RDD. Cache is just a short hand for default storage level. Which is MEMORY_ONLY.MEMORY_ONLYStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.MEMORY_AND_DISKStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.MEMORY_ONLY_SERStore RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.MEMORY_AND_DISK_SERSimilar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.DISK_ONLYStore the RDD partitions only on disk.MEMORY_ONLY_2, MEMORY_AND_DISK_2 etcSame as the levels above, but replicate each partition on two cluster nodes.Which Storage level is best:Few things to consider:Try to keep in-memory as much as possibleTry not to spill to disc unless your computed datasets are memory expensiveUse replication only if you want fault tolerance
PageRank is an algorithm used by Google Search to rank websites in their search engine results. PageRank was named after Larry Page,[1] one of the founders of Google. PageRank is a way of measuring the importance of website pages.PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
Spark Streaming: For stream processingContinuously executes various parallel operations on an Input Stream of Data.System receives a continuous data and divide is into batches. And each batch is considered and processed as an RDD.Graph X:Distributed Graph SystemDesigned to efficiently execute Graph algorithms using Spark parallel and in-memory computation frameworkMLBase:Goal of MLBase is to make distributed machine learning easy.BlinkDB:Approximate query engineAllows for trade-off between accuracy and response timeHighly interactive on very large datasetsIn process of deploying this at FacebookAMPLab have demonstrated how complex queries on 17 TB data (running on 100 node cluster) can be completed in less than 2 seconds !You specify queries with time boundationSelect avg(SessionTime) from tblSession where UserGender=‘MALE’ within 2 SECONDS
-Interpreter: It’s actually the Scala command line (interpreter) that’s been modified for SparkHadoop I/O: for Reading/Writing from HDFSStanadlone: Custom Resource Manager- Operators: Map, Join, Group by etc on RDDNetworking: Replication, Caching, GraphBlock Manager: Very Simple Key-Value store that used as cacheBroadcaster: Sending / Receiving event, Heartbeat etc-