Workshop
December 9, 2015
LBS College of Engineering
www.sarithdivakar.info | www.csegyan.org
http://sarithdivakar.info/2015/12/09/wordcount-program-in-python-using-apache-spark-for-data-stored-in-hadoop-hdfs/
Call Girls Begur Just Call đ 7737669865 đ Top Class Call Girl Service Bangalore
Â
Big data processing with apache spark
1. BIG DATA PROCESSING WITH
APACHE SPARK
December 9, 2015
LBS College of Engineering
www.sarithdivakar.info | www.csegyan.org
2. WHAT IS BIG DATA?
ï Terabytes of Data
ï Petabytes of Data
ï Exabytes of Data
ï Yottabytes of Data
ï Brontobytes of Data
ï Geobytes of Data
3. WHERE BIG DATA COMES FROM?
ï Huge amount of data is created everyday!
ï It comes from Us!
ï No digitized process becomes digitized
ï Digital India
ï Programmee to transform India to a digitally empowered
society and knowledge economy
4. EXAMPLES OF DIGITIZATION
ï Online banking
ï Online shopping
ï E-learning
ï Emails
ï Social medias
ï Decrease in cost of storage & data capture technology
make up new era of data revolution
5. TRENDS IN BIG DATA
ï Digitalization of virtually everything: e.g. Oneâs personal life
7. KEY ENABLERS OF BIG DATA
ï Increase in storage
capacities
ï Increase in processing
power
ï Availability of Data
8. FEATURES OF BIG DATA GENERATED
ï Digitally generated
ï Passively produced
ï Automatically collected
ï Geographically or temporarily
trackable
ï Continuously analyzed
9.
10. DIMENSIONS OF BIG DATA
ï Volume: Every minute 72 hours of videos are uploaded on
YouTube
ï Variety: Excel tables & databases (Structured), Pure text,
photo, audio, video, web, GPS data, sensor data, documents,
sms, etc. New data formats for new applications
ï Velocity: Batch processing not possible as data is streamed.
ï Veracity/variability: Uncertainty inherent within some type of
data
ï Value: Economic/business value of different data may vary
11. CHALLENGES IN BIG DATA
ï Capture
ï Storage
ï Search
ï Sharing
ï Transfer
ï Analysis
ï Visualization
12. NEED FOR BIG DATA ANALYTICS
ï Big Data needs to be captured, stored, organized and
analyzed
ï It is Large & Complex
ï Cannot manage with current methodologies or data
mining tools
THEN NOW
Data warehousing, Datamining & Database
technologies
Did not analyze email, PDF andVideo files
Worked with huge amount of data Analyzing semi structured and Un structured
data
Prediction based on data Access and store all huge data created
13. BIG DATA ANALYTICS
ï Big Data analytics refers to tools
and methodologies that aim to
transform massive quantities of
raw data into âdata about dataâ
for analytical purposes.
ï Discovery of meaningful
patterns in data
ï Used for decision making
14. EXCAVATING HIDDENTREASURES FROM
BIG DATA
ï Insights into data can provide business advantage
ï Some key early indications can mean fortunes to
business
ï More precise analysis with more data
ï Integrate Big Data with traditional data: Enhance
business intelligence analysis
15.
16. UNSTRUCTURED DATATYPES
ï Email and other forms of electronic
communication
ï Web based content(Click streams,
social media)
ï Digitized audio and video
ï Machine generated data(RFID,
GPS, sensor-generated data, log
files) and IoT
17. APPLICATIONS OF BIG DATA ANALYSIS
ï Business: Customer personalization, customer needs
ï Technology: Reduce process time
ï Health: DNA mining to detect hereditary diseases
ï Smart cities: Cities with good economic development and high
quality of life could be analyzed
ï Oil and Gas: Analyzing sensor generated data for production
optimization, cost management risk management, healthy and
safe drilling
ï Telecommunications: Network analytics and optimization from
device, sensor and GPS to enhance social and promotion
18. OPPORTUNITIES BIG DATA OFFERS
ï Early warning
ï Real-time awareness
ï Real-time feedback
19. CHALLENGES IN BIG DATA
ï Heterogeneity and incompleteness
ï Scale
ï Timeliness
ï Privacy
ï Human collaboration
20. BIG DATA AND CLOUD: CONVERGING
TECHNOLOGIES
ï Big data: Extracting value out of
âvariety, velocity and volumeâ
from unstructured information
available
ï Cloud: On demand, elastic,
scalable pay per use self service
model
21. ANSWERTHESE BEFORE MOVINGTO BIG
DATA ANALYSIS
ï Do you have an effective big data problem?
ï Can the business benefit from using Big
Data?
ï Do your data volumes really require these
distributed mechanisms?
22. TECHNOLOGYTO HANDLE BIG DATA
ï Google was the first company to effectively use big data
ï Engineers at google created massively distributed
systems
ï Collected and analyzed massive collections of web
pages & relationships between them and created
âGoogle Search Engineâ capable of querying billions of
pages
23.
24. FIRST GENERATION OF DISTRIBUTED
SYSTEMS
ï Proprietary
ï Custom Hardware and
software
ï Centralized data
ï Hardware based fault
recovery
ï Eg:Teradata, Netezza etc
25. SECOND GENERATION OF DISTRIBUTED
SYSTEMS
ï Open source
ï Commodity hardware
ï Distributed data
ï Software based fault recovery
ï Eg : Hadoop, HPCC
26. APACHE HADOOP
ï Apache Hadoop is a framework that allows for the
distributed processing of large data sets across clusters
of commodity computers using a simple programming
model.
38. WHYWE NEED NEW GENERATION?
ï Lot has been changed from 2000
ï Both hardware and software gone through changes
ï Big data has become necessity now
ï Letâs look at what changed over decade
39. CHANGES IN HARDWARE
State of hardware in 2000 State of hardware now
Disk was cheap so disk was primary source
of data
RAM is the king
Network was costly so data locality RAM is primary source of data and we use
disk for fallback
RAM was very costly Network is speedier
Single core machines were dominant Multi core machines are commonplace
40. SHORTCOMINGS OF SECOND GENERATION
ï Batch processing is primary objective
ï Not designed to change depending upon use cases
ï Tight coupling betweenAPI and run time
ï Do not exploit new hardware capabilities
ï Too much complex
41. MAPREDUCE LIMITATIONS
ï If you wanted to do something complicated, you would have to
string together a series of MapReduce jobs and execute them in
sequence.
ï Each of those jobs have high-latency, and none could start until
the previous job had finished completely.
ï The Job output data between each step has to be stored in the
distributed file system before the next step can begin.
ï Hence, this approach tends to be slow due to replication & disk
storage.
42. HADOOP VS SPARK
HADOOP SPARK
Stores data on disk Sores data in memory (RAM)
Commodity hardware can be utilized Need high end systems with greater RAM
Uses Replication to achieve fault tolerance Uses different data storage models to achieve
fault tolerance (Eg. RDD)
Speed of processing is less due to disk read
write
100x faster than Hadoop
Supports only Java & R Supports Java, Python, R, Scala etc. Ease of
programming is high.
Everything is just Map and Reduce Supports Map, Reduce, SQL. Streaming etc
Data should be in HDFS Data can be in HDFS,Cassandra,Hbase or S3.
Runs on Hadoop, Cloud, Mesos or standalone
43. THIRD GENERATION DISTRIBUTED
SYSTEMS
ï Handle both batch processing and real time
ï Exploit RAM as much as disk
ï Multiple core aware
ï Do not reinvent the wheel
ï They use
ï HDFS for storage
ï Apache Mesos /YARN for distribution
ï Plays well with Hadoop
44. APACHE SPARK
ï Open source Big Data processing framework
ï Apache Spark started as a research project at UC
Berkeley in the AMPLab(Now Databricks), which
focuses on big data analytics.
ï Open sourced in early 2010.
ï Many of the ideas behind the system are presented in
various research papers.
46. SPARK FEATURES
ï Spark gives us a comprehensive, unified framework
ï Manage big data processing requirements with a variety
of data sets
ï Diverse in nature (text data, graph data etc)
ï Source of data (batch v. real-time streaming data).
ï Spark lets you quickly write applications in Java, Scala,
or Python.
47. DIRECTED ACYCLIC GRAPH (DAG)
ï Spark allows programmers to
develop complex, multi-step data
pipelines using directed acyclic
graph (DAG) pattern.
ï It also supports in-memory data
sharing across DAGs, so that
different jobs can work with the
same data.
49. WHY UNIFICATION MATTERS?
ï Good for developers : One platform to learn
ï Good for users :Take apps every where
ï Good for distributors : More apps
50. UNIFICATION BRINGS ONE ABSTRACTION
ï All different processing systems in spark share same
abstraction called RDD
ï RDD is Resilient Distributed Dataset
ï As they share same abstraction you can mix and match
different kind of processing in same application
52. RUNS EVERYWHERE
ï You can spark on top any distributed system
ï It can run on
ï Hadoop 1.x
ï Hadoop 2.x
ï Apache Mesos
ï Itâs own cluster
ï Itâs just a user space
library
53. SMALL AND SIMPLE
ï Apache Spark is highly
modular
ï The original version
contained only 1600 lines of
scala code
ï Apache Spark API is
extremely simple compared
Java API of M/R
ï API is concise and consistent
55. DATA STORAGE
ï Spark uses HDFS file system for data storage purposes.
ï It works with any Hadoop compatible data source
including HDFS, HBase, Cassandra, etc.
56. API
ï The API provides the application developers to create
Spark based applications using a standard API interface.
Spark provides API for Scala, Java, and Python
programming languages.
57. RESOURCE MANAGEMENT
ï Spark can be deployed as a Stand-alone server or it can
be on a distributed computing framework like Mesos or
YARN
59. SPARK RUNNING ARCHITECTURE
ï Connects to a cluster manager which allocate resources
across applications
ï Acquires executors on cluster nodesâ worker processes
to run computations and store data
ï Sends appcode to the executors
ï Sends tasks for the executors to run
60. SPARK RUNNING ARCHITECTURE
sc = new SparkContext
f = sc.textFile(ââŠâ)
f.filter(âŠ)
.count()
...
Your program
Spark client
(app master)
Spark worker
HDFS, HBase, âŠ
Block
manager
Task threads
RDD graph
Scheduler
Block tracker
Shuffle tracker
Cluster
manager
61. SCHEDULING PROCESS
RDD Objects
agnostic to
operators!
doesnât know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
62. RDD - RESILIENT DISTRIBUTED DATASET
ï Resilient Distributed Datasets (RDD) are the primary abstraction
in Spark â a fault-tolerant collection of elements that can be
operated on in parallel
ï A big collection of data having following properties
ï Immutable
ï Lazy evaluated
ï Cacheable
ï Type inferred
63. RDD - RESILIENT DISTRIBUTED DATASET â
TWOTYPES
ï Parallelized collections â take an existing Scala
collection and run functions on it in parallel
ï Hadoop datasets / files â run functions on each record of
a file in Hadoop distributed file system or any other
storage system supported by Hadoop
64. SPARK COMPONENTS & ECOSYSTEM
ï Spark driver (context)
ï Spark DAG scheduler
ï Cluster management
systems
ï YARN
ï Apache Mesos
ï Data sources
ï In memory
ï HDFS
ï No SQL
68. IN MEMORY
ï In Spark, you can cache hdfs data in main memory of
worker nodes
ï Spark analysis can be executed directly on in memory
data
ï Shuffling also can be done from in memory
ï Fault tolerant
69. INTEGRATION WITH HADOOP
ï No separate storage layer
ï Integrates well with HDFS
ï Can run on Hadoop 1.0 and Hadoop 2.0YARN
ï Excellent integration with ecosystem projects like
ï Apache Hive, HBase etc
70. MULTI LANGUAGE API
ï Written in Scala but API is not limited to it
ï OffersAPI in
ï Scala
ï Java
ï Python
ï You can also do SQL using SparkSQL
75. WRITE
f = open('demo.txt','r')
data = f.read()
print(data)
f = open('demo.txt','a')
f.write('I am trying to write a file')
f.close()
PYTHON EXAMPLES
READ
78. RDD CREATION â FROM COLLECTIONS
A = range(1,100000)
print(A)
raw_data = sc. parallelize(A)
raw_data.count()
raw_data.take(5)
Creating a RDD from a Collection
Creating a Collection
To view the sample data
Count the number of lines in the loaded files
79. RDD CREATION â FROM FILES
Getting the data files
import urllib
f = urllib.urlretrieve ("https://sparksarith.azurewebsites.net/Sarith/test.csv", "tv.csv")
Count the number of lines in the loaded files
data_file = "./tv.csv"
raw_data = sc.textFile(data_file)
Creating a RDD from a file
raw_data.count()
To view the sample data
raw_data.take(5)
80. IMMUTABILITY
ï Immutability means once created it never changes
ï Big data by default immutable in nature
ï Immutability helps to
ï Parallelize
ï Caching
ï const int a = 0 //immutable
ï int b = 0; // mutable
ï b ++ // in place (Updation)
ï c = a + 1 (Copy)
ï Immutability is about value not about reference
81. IMMUTABILITY IN COLLECTIONS
Mutable Immutable
var collection = [1,2,4,5]
for ( i = 0; i<collection.length; i++) {
collection[i]+=1;
}
Uses loop for updating
collection is updated in place
val collection = [1,2,4,5]
val newCollection =
collection.map(value
=> value +1)
Uses transformation for change
Creates a new copy of collection.
Leaves
collection intact
82. CHALLENGES OF IMMUTABILITY
ï Immutability is great for parallelism but not good for
space
ï Doing multiple transformations result in
ï Multiple copies of data
ï Multiple passes over data
ï In big data, multiple copies and multiple passes will have
poor performance characteristics.
83. LETâS GET LAZY
ï Laziness means not
computing transformation till
itâs need
ï Laziness defers evaluation
ï Laziness allows separating
execution from evaluation
84. LAZINESS AND IMMUTABILITY
ï You can be lazy only if the underneath data is
immutable
ï You cannot combine transformation if transformation
has side effect
ï So combining laziness and immutability gives better
performance and distributed processing.
85. CHALLENGES OF LAZINESS
ï Laziness poses challenges in terms of data type
ï If laziness defers execution, determining the type of the
variable becomes challenging
ï If we canât determine the right type, it allows to have
semantic issues
ï Running big data programs and getting semantics errors
are not fun.
86. TRANSFORMATIONS
ï Transformations are the operations on RDD that return new RDD
ï By using the map transformation in Spark, we can apply a function to every
element in our RDD
ï Collect will get all the elements in the RDD into memory to work with them
csv_data = raw_data.map(lambda x : x.split(â,â))
all_data = csv_data.collect()
all_data
len(all_data)
87. SET OPERATIONS ON RDD
ï Spark support many of the operations we have in mathematical sets, such as
union and intersection, even when the RDDs themselves are not properly sets
ï Union of RDDs doesn't remove duplicates
a=[1,2,3,4,5]
b=[1,2,3,6]
dist_a = sc.parallelize(a)
dist_b = sc.parallelize(b)
substract_data = dist_a.subtract(dist_b)
substract_data.take(10)
union_data=dist_a.union(dist_b)
union_data.take(10)
[1, 2, 3, 4, 5, 1, 2, 3, 6]
distinct_data=union_data.distinct()
distinct_data.take(10)
[2, 4, 6, 1, 3, 5]
88. KEYVALUE PAIRS - RDD
ï Spark provides specific functions to deal with RDDs which elements are key/value
pairs
ï They are commonly used for grouping and aggregations
data = ['nithin,25','appu,40','anil,20','nithin,35','anil,30','anil,50â]
raw_data = sc.parallelize(data)
raw_data.collect()
key_value = raw_data.map(lambda line:(line.split(',')[0],int(line.split(',')[1])))
grouped_data = key_value.reduceByKey(lambda x,y:x+y)
grouped_data.collect()
grouped_data.keys().collect()
grouped_data.values().collect()
sorted_data = grouped_data.sortByKey()
sorted_data.collect()
89. CACHING
ï Immutable data allows you to cache data for long time
ï Lazy transformation allows to recreate data on failure
ï Transformations can be saved also
ï Caching data improves execution engine performance
raw_data.cache()
raw_data.collect()
90. SAVINGYOUR DATA
ï saveAsTextFile(path) is used for storing the RDD inside your harddisk
ï Path is a directory and spark will output the multiple files under that directory. It
allows the spark to write the output from the multiple nodes
raw_data.saveAsTextFile('opt')
91. SPARK EXECUTION MODEL
ï Create DAG of RDDs to represent computation
ï Create logical execution plan for DAG
ï Schedule and execute individual tasks
102. REFERENCES
ï 1. âData Mining and DataWarehousingâ, M.Sudheep Elayidom, SOE, CUSAT
ï 2. âResilient Distributed Datasets:A Fault-Tolerant Abstraction for In-Memory
Cluster Computingâ. Matei Zaharia, Mosharaf Chowdhury,Tathagata Das, Ankur
Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica.
NSDI 2012. April 2012. Best Paper Award.
ï 3. âWhat is Big Dataâ, https://www-01.ibm.com/software/in/data/bigdata/
ï 4. âApache Hadoopâ, https://hadoop.apache.org/
ï 5. âApache Sparkâ, http://spark.apache.org/
ï 6. âSpark: Cluster Computing with Working Setsâ. Matei Zaharia, Mosharaf
Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010. June
2010.
103. CREDITS
ï Dr. M Sudheep Elayidom, Associate Professor, Div Of Computer Science & Engg,
SOE, CUSAT
ï Nithink K Anil, Quantiph, Mumbai, Maharashtra, India
ï Lija Mohan, Lija Mohan, Div Of Computer Science & Engg, SOE, CUSAT