SlideShare ist ein Scribd-Unternehmen logo
1 von 114
Downloaden Sie, um offline zu lesen
Big Data?
What is Big Data?
Big Data is also data but with a huge size. Big Data is a
term used to describe a collection of data that is huge
in volume and yet growing exponentially with time. In
short such data is so large and complex that none of
the traditional data management tools are able to
store it or process it efficiently.
The New York Stock Exchange generates about one terabyte of new trade data
per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the
databases of social media site Facebook, every day. This data is mainly generated
in terms of photo and video uploads, message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight
time. With many thousand flights per day, generation of data reaches up to
many Petabytes.
Types Of Big Data
Big Data' could be found in three forms:
•Structured
•Unstructured
•Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of
fixed format is termed as a 'structured' data. Over the period of time,
talent in computer science has achieved greater success in developing
techniques for working with such kind of data (where the format is
well known in advance) and also deriving value out of it. However,
nowadays, we are foreseeing issues when a size of such data grows to
a huge extent, typical sizes are being in the rage of multiple
zettabytes.
Employee_ID
Employee_Name
Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000
Examples Of Structured Data
An 'Employee' table in a database is an example of Structured Data
Unstructured
Any data with unknown form or the structure is classified as
unstructured data. In addition to the size being huge, un-structured data
poses multiple challenges in terms of its processing for deriving value
out of it. A typical example of unstructured data is a heterogeneous data
source containing a combination of simple text files, images, videos etc.
Now day organizations have wealth of data available with them but
unfortunately, they don't know how to derive value out of it since this
data is in its raw form or unstructured format.
Examples Of Un-structured Data
The output returned by 'Google Search'
Semi-structured
Semi-structured data can contain both the forms of data. We can
see semi-structured data as a structured in form but it is actually
not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML
file.
Examples Of Semi-structured Data
Personal data stored in an XML file-
<rec><name>PrashantRao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Characteristics Of Big Data
(i) Volume – The name Big Data itself is related to a size which is
enormous. Size of data plays a very crucial role in determining
value out of data. Also, whether a particular data can actually
be considered as a Big Data or not, is dependent upon the
volume of data. Hence, 'Volume' is one characteristic which
needs to be considered while dealing with Big Data.
(ii) Variety – The next aspect of Big Data is its variety.
Variety refers to heterogeneous sources and the nature of data,
both structured and unstructured. During earlier days, spreadsheets
and databases were the only sources of data considered by most of
the applications. Nowadays, data in the form of emails, photos,
videos, monitoring devices, PDFs, audio, etc. are also being
considered in the analysis applications. This variety of unstructured
data poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term 'velocity' refers to the speed of generation
of data. How fast the data is generated and processed to meet the
demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from
Sources like business processes, application logs, networks, and
social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown
by the data at times, thus hampering the process of being able to
handle and manage the data effectively.
(v) Value-
Value is most important aspect in Big data. Though the potential
value of big data is huge. It is all well and good to have access to big
data but unless we can turn it into value it is become useless. It
become very costly to implement IT infrastructure system to store
big data and business are going to require a return on investment.
Big Data Examples: Applications of Big Data in Real Life
Big Data has totally changed and revolutionized the way
businesses and organizations work. In this blog, we will
go deep into the major Big Data applications in various
sectors and industries and learn how these sectors are
being benefitted by these applications.
Big Data in Education Industry
Education industry is flooding with huge amounts of data related to
students, faculty, courses, results, and what not. Now, we have
realized that proper study and analysis of this data can provide
insights which can be used to improve the operational
effectiveness and working of educational institutes.
Big Data in Healthcare Industry
Healthcare is yet another industry which is bound to generate a huge
amount of data.
Following are some of the ways in which big data has contributed to
healthcare:
Big data reduces costs of treatment since there is less chances of
having to perform unnecessary diagnosis.
It helps in predicting outbreaks of epidemics and also in deciding
what preventive measures could be taken to minimize the effects
of the same.
It helps avoid preventable diseases by detecting them in early
stages. It prevents them from getting any worse which in turn
makes their treatment easy and effective.
Patients can be provided with evidence-based medicine which is
identified and prescribed after doing research on past medical
results.
Big Data in Government Sector
Governments, be it of any country, come face to face with a very
huge amount of data on almost daily basis. The reason for this is,
they have to keep track of various records and databases regarding
their citizens, their growth, energy resources, geographical surveys,
and many more. All this data contributes to big data. The proper
study and analysis of this data, hence, helps governments in endless
ways.
Few of them are as follows:
Welfare Schemes
•In making faster and informed decisions regarding various political
programs
•To identify areas that are in immediate need of attention
•To stay up to date in the field of agriculture by keeping track of all
existing land and livestock.
•To overcome national challenges such as unemployment, terrorism,
energy resources exploration, and much more.
Cyber Security
•Big Data is hugely used for deceit recognition.
•It is also used in catching tax evaders.
Big Data in Media and Entertainment Industry
With people having access to various digital gadgets, generation of
large amount of data is inevitable and this is the main cause of the rise
in big data in media and entertainment industry.
Other than this, social media platforms are another way in which
huge amount of data is being generated. Although, businesses in the
media and entertainment industry have realized the importance of
this data, and they have been able to benefit from it for their
growth.
Some of the benefits extracted from big data in the media and
entertainment industry are given below:
Predicting the interests of audiences
Optimized or on-demand scheduling of media streams in digital
media distribution platforms
Getting insights from customer reviews
Effective targeting of the advertisements
Big Data in Weather Patterns
There are weather sensors and satellites deployed all around the
globe. A huge amount of data is collected from them, and then this
data is used to monitor the weather and environmental conditions.
All of the data collected from these sensors and satellites contribute
to big data and can be used in different ways such as:
In weather forecasting
•To study global warming
•In understanding the patterns of natural disasters
•To make necessary preparations in the case of crises
•To predict the availability of usable water around the world
Since the rise of big data, it has been used in various ways to make
transportation more efficient and easy. Following are some of the
areas where big data contributes to transportation.
Route planning: Big data can be used to understand and estimate
users’ needs on different routes and on multiple modes of
transportation and then utilize route planning to reduce their wait
time.
Congestion management and traffic control: Using big data, real-
time estimation of congestion and traffic patterns is now possible.
For examples, people are using Google Maps to locate the least
traffic-prone routes.
of traffic.
Big Data in Transportation Industry
Safety level of traffic: Using the real-time
processing of big data and predictive
analysis to identify accident-prone areas
can help reduce accidents and increase
the safety level
Big Data in Banking Sector
The amount of data in the banking sector is skyrocketing every
second. According to GDC prognosis, this data is estimated to
grow 700 percent by the end of the next year. Proper study and
analysis of this data can help detect any and all illegal activities
that are being carried out such as:
Misuse of credit/debit cards
Venture credit hazard treatment
Business clarity
Customer statistics alteration
Money laundering
Risk mitigation
BIG DATA PROGRAMMING MODEL
Design of Hadoop Distributed File System (HDFS)
• Master-Slave design
• Master Node
– Single NameNode for managing metadata
• Slave Nodes
– Multiple DataNodes for storing data
• Other
– Secondary NameNode as a backup
HDFS Architecture
NameNode
DataNode
DataNode DataNode DataNode
DataNode
DataNode
DataNode DataNode
Secondary
NameNode
Client
Heartbeat, Cmd, Data
NameNode keeps the metadata, the name, location and directory
DataNode provide storage for blocks of data
HDFS
File B1 B2 B3 B4
Node
B1
Node Node
B2
Node B1
Node Node
B2
B3
Node Node Node
B4 B3
B1
B4
Node
Node
Node
B3
B4
B2
B1
What happens; if node(s) fail?
Replication of Blocks for fault tolerance
Apply function Map:
Apply a function to all the elements of
List
list1=[1,2,3,4,5];
square x = x * x
list2=Map square(list1)
print list2
-> [1,4,9,16,25]
Reduce:
Combine all the elements of list for a
summary
list1 = [1,2,3,4,5];
A = reduce (+) list1
Print A
-> 15
Map Reduce Paradigm
• Map and Reduce are based on functional programming
Input Output
Map Reduce
Node
Map
MapReduce Word Count Example
File
A
B
C
D
Node
Map
A
Node
Map
Node
Map
B
C
D
Node
Reduce
Node
Reduce
F
Node
Reduce
Node
Reduce
E
G
H
Shuffle
&
Sort
I am Sam
Sam I am
(I,1)
(am,1)
(Sam,1)
(I,1)
(am,1)
(Sam,1)
(I,2)
(am,2)
(Sam,2)
(…,..)
(..,..)
………
………
SPARK Outline
• Introduction to Apache Hadoop and Spark for developing
applications
• Components of Hadoop, HDFS, MapReduce and HBase
• Capabilities of Spark and the differences from a typical
MapReduce solution
• Some Spark use cases for data analysis
Cloud and Distributed Computing
• The second trend is pervasiveness of cloud based storage and
computational resources
– For processing of these big datasets
• Cloud characteristics
– Provide a scalable standard environment
– On-demand computing
– Pay as you need
– Dynamically scalable
– Cheaper
One Solution is Apache Spark
• A new general framework, which solves many of the short comings of
MapReduce
• It capable of leveraging the Hadoop ecosystem, e.g. HDFS, YARN, HBase,
S3, …
• Has many other workflows, i.e. join, filter, flatMapdistinct, groupByKey,
reduceByKey, sortByKey, collect, count, first…
– (around 30 efficient distributed operations)
• In-memory caching of data (for iterative, graph, and machine learning
algorithms, etc.)
• Native Scala, Java, Python, and R support
• Supports interactive shells for exploratory data analysis
• Spark API is extremely simple to use
• Developed at AMPLab UC Berkeley, now by Databricks.com
Spark Uses Memory instead of Disk
Iteration1 Iteration2
HDFS read
Iteration1 Iteration2
HDFS
read
HDFS
Write
HDFS
read HDFS
Write
Spark: In-Memory Data Sharing
Hadoop: Use Disk for Data Sharing
Sort competition
Hadoop MR
Record (2013)
Spark
Record (2014)
Data Size 102.5 TB 100 TB
Elapsed Time 72 mins 23 mins
# Nodes 2100 206
# Cores 50400 physical 6592 virtualized
Cluster disk
throughput
3150 GB/s
(est.)
618 GB/s
Network
dedicated data
center, 10Gbps
virtualized (EC2) 10Gbps
network
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min
Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records)
http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Spark, 3x
faster with
1/10 the
nodes
Apache Spark
Apache Spark supports data analysis, machine learning, graphs, streaming data, etc. It
can read/write from a range of data types and allows development in multiple
languages.
Spark Core
Spark
Streaming
MLlib GraphX
ML Pipelines
Spark SQL
DataFrames
Data Sources
Scala, Java, Python, R, SQL
Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style (GlusterFS, Lustre)
Resilient Distributed Datasets (RDDs)
• RDDs (Resilient Distributed Datasets) is Data Containers
• All the different processing components in Spark
share the same abstraction called RDD
• As applications share the RDD abstraction, you can
mix different kind of transformations to create new
RDDs
• Created by parallelizing a collection or reading a file
• Fault tolerant
DataFrames & SparkSQL
• DataFrames (DFs) is one of the other distributed datasets organized
in named columns
• Similar to a relational database, Python Pandas Data frame or R’s
DataTables
– Immutable once constructed
– Track lineage
– Enable distributed computations
• How to construct Data frames
– Read from file(s)
– Transforming an existing DFs(Spark or Pandas)
– Parallelizing a python collection list
– Apply transformations and actions
DataFrame example
// Create a new DataFrame that contains “students”
students = users.filter(users.age < 21)
//Alternatively, using Pandas-like syntax
students = users[users.age < 21]
//Count the number of students users by gender
students.groupBy("gender").count()
// Join young students with another DataFrame called
logs
students.join(logs, logs.userId == users.userId,
“left_outer")
RDDs vs. DataFrames
• RDDs provide a low level interface into Spark
• DataFrames have a schema
• DataFrames are cached and optimized by Spark
• DataFrames are built on top of the RDDs and the core
Spark API
Example: performance
Spark Operations
Transformations
(create a new RDD)
map
filter
sample
groupByKey
reduceByKey
sortByKey
intersection
flatMap
union
join
cogroup
cross
mapValues
reduceByKey
Actions
(return results to
driver program)
collect first
Reduce take
Count takeOrdered
takeSample countByKey
take save
lookupKey foreach
Directed Acyclic Graphs (DAG)
A
B
S
C
E
D
F
DAGs track dependencies (also known as Lineage )
 nodes are RDDs
 arrows are Transformations
Narrow Vs. Wide transformation
A,1 A,[1,2]
A,2
Narrow Wide
Map groupByKey
Vs.
Actions
• What is an action
– The final stage of the workflow
– Triggers the execution of the DAG
– Returns the results to the driver
– Or writes the data to HDFS or to a file
Spark Workflow
FlatMap Map groupbyKey
Spark
Context
Driver
Program
Collect
Python RDD API Examples
• Word count
text_file = sc.textFile("hdfs://usr/godil/text/book.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://usr/godil/output/wordCount.txt")
• Logistic Regression
# Every record of this DataFrame contains the label and
# features represented by a vector.
df = sqlContext.createDataFrame(data, ["label", "features"])
# Set parameters for the algorithm.
# Here, we limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)
# Fit the model to the data.
model = lr.fit(df)
# Given a dataset, predict each point's label, and show the results.
model.transform(df).show()
Examples from http://spark.apache.org/
RDD Persistence and Removal
• RDD Persistence
– RDD.persist()
– Storage level:
• MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER,
DISK_ONLY,…….
• RDD Removal
– RDD.unpersist()
Broadcast Variables and Accumulators
(Shared Variables )
• Broadcast variables allow the programmer to keep a read-only
variable cached on each node, rather than sending a copy of it
with tasks
>broadcastV1 = sc.broadcast([1, 2, 3,4,5,6])
>broadcastV1.value
[1,2,3,4,5,6]
• Accumulators are variables that are only “added” to through
an associative operation and can be efficiently supported in
parallel
accum = sc.accumulator(0)
accum.add(x)
accum.value
Spark’s Main Use Cases
• Streaming Data
• Machine Learning
• Interactive Analysis
• Data Warehousing
• Batch Processing
• Exploratory Data Analysis
• Graph Data Analysis
• Spatial (GIS) Data Analysis
• And many more
Spark Use Cases
• Fingerprint Matching
– Developed a Spark based fingerprint minutia
detection and fingerprint matching code
• Twitter Sentiment Analysis
– Developed a Spark based Sentiment Analysis code
for a Twitter dataset
Spark in the Real World (I)
• Uber – the online taxi company gathers terabytes of event data from its
mobile users every day.
– By using Kafka, Spark Streaming, and HDFS, to build a continuous ETL
pipeline
– Convert raw unstructured event data into structured data as it is collected
– Uses it further for more complex analytics and optimization of operations
• Pinterest – Uses a Spark ETL pipeline
– Leverages Spark Streaming to gain immediate insight into how users all
over the world are engaging with Pins—in real time.
– Can make more relevant recommendations as people navigate the site
– Recommends related Pins
– Determine which products to buy, or destinations to visit
Spark in the Real World (II)
Here are Few other Real World Use Cases:
• Conviva – 4 million video feeds per month
– This streaming video company is second only to YouTube.
– Uses Spark to reduce customer churn by optimizing video streams and
managing live video traffic
– Maintains a consistently smooth, high quality viewing experience.
• Capital One – is using Spark and data science algorithms to understand customers
in a better way.
– Developing next generation of financial products and services
– Find attributes and patterns of increased probability for fraud
• Netflix – leveraging Spark for insights of user viewing habits and then
recommends movies to them.
– User data is also used for content creation
Spark: when not to use
• Even though Spark is versatile, that doesn’t mean Spark’s
in-memory capabilities are the best fit for all use cases:
– For many simple use cases Apache MapReduce and
Hive might be a more appropriate choice
– Spark was not designed as a multi-user environment
– Spark users are required to know that memory they
have is sufficient for a dataset
– Adding more users adds complications, since the users
will have to coordinate memory usage to run code
HPC and Big Data Convergence
• Clouds and supercomputers are collections of computers
networked together in a datacenter
• Clouds have different networking, I/O, CPU and cost trade-offs
than supercomputers
• Cloud workloads are data oriented vs. computation oriented
and are less closely coupled than supercomputers
• Principles of parallel computing same on both
• Apache Hadoop and Spark vs. Open MPI
HPC and Big Data K-Means example
MPI definitely outpaces Hadoop, but can be boosted using a hybrid approach of other
technologies that blend HPC and big data, including Spark and HARP. Dr. Geoffrey Fox,
Indiana University. (http://arxiv.org/pdf/1403.1528.pdf)
Conclusion
• Hadoop (HDFS, MapReduce)
– Provides an easy solution for processing of Big Data
– Brings a paradigm shift in programming distributed system
• Spark
– Has extended MapReduce for in memory computations
– for streaming, interactive, iterative and machine learning
tasks
• Changing the World
– Made data processing cheaper and more efficient and
scalable
– Is the foundation of many other tools and software

Weitere ähnliche Inhalte

Ähnlich wie big-data.pdf

Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.saranya270513
 
Analysis on big data concepts and applications
Analysis on big data concepts and applicationsAnalysis on big data concepts and applications
Analysis on big data concepts and applicationsIJARIIT
 
Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...
Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...
Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...Arab Federation for Digital Economy
 
IRJET- Scope of Big Data Analytics in Industrial Domain
IRJET- Scope of Big Data Analytics in Industrial DomainIRJET- Scope of Big Data Analytics in Industrial Domain
IRJET- Scope of Big Data Analytics in Industrial DomainIRJET Journal
 
Big data analytics and its impact on internet users
Big data analytics and its impact on internet usersBig data analytics and its impact on internet users
Big data analytics and its impact on internet usersStruggler Ever
 
Analysis of Big Data
Analysis of Big DataAnalysis of Big Data
Analysis of Big DataIRJET Journal
 
Big Data in Economics An Introduction
Big Data in Economics An IntroductionBig Data in Economics An Introduction
Big Data in Economics An Introductionijtsrd
 
A SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICSA SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICSijistjournal
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfvvpadhu
 

Ähnlich wie big-data.pdf (20)

Unit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdfUnit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdf
 
Unit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdfUnit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdf
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
 
Analysis on big data concepts and applications
Analysis on big data concepts and applicationsAnalysis on big data concepts and applications
Analysis on big data concepts and applications
 
Big data Paper
Big data PaperBig data Paper
Big data Paper
 
Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...
Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...
Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
IRJET- Scope of Big Data Analytics in Industrial Domain
IRJET- Scope of Big Data Analytics in Industrial DomainIRJET- Scope of Big Data Analytics in Industrial Domain
IRJET- Scope of Big Data Analytics in Industrial Domain
 
Big data analytics and its impact on internet users
Big data analytics and its impact on internet usersBig data analytics and its impact on internet users
Big data analytics and its impact on internet users
 
Analysis of Big Data
Analysis of Big DataAnalysis of Big Data
Analysis of Big Data
 
Lecture #03
Lecture #03Lecture #03
Lecture #03
 
Big Data in Economics An Introduction
Big Data in Economics An IntroductionBig Data in Economics An Introduction
Big Data in Economics An Introduction
 
new.pptx
new.pptxnew.pptx
new.pptx
 
A SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICSA SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICS
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdf
 
Bigdata
BigdataBigdata
Bigdata
 
Big data assignment
Big data assignmentBig data assignment
Big data assignment
 
Big Data Challenges faced by Organizations
Big Data Challenges faced by OrganizationsBig Data Challenges faced by Organizations
Big Data Challenges faced by Organizations
 
BIG DATA AND HADOOP.pdf
BIG DATA AND HADOOP.pdfBIG DATA AND HADOOP.pdf
BIG DATA AND HADOOP.pdf
 

Kürzlich hochgeladen

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 

Kürzlich hochgeladen (20)

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 

big-data.pdf

  • 2. What is Big Data? Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently. The New York Stock Exchange generates about one terabyte of new trade data per day. Social Media The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc. A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.
  • 3. Types Of Big Data Big Data' could be found in three forms: •Structured •Unstructured •Semi-structured
  • 4. Structured Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data. Over the period of time, talent in computer science has achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes. Employee_ID Employee_Name Gender Department Salary_In_lacs 2365 Rajesh Kulkarni Male Finance 650000 3398 Pratibha Joshi Female Admin 650000 7465 Shushil Roy Male Admin 500000 7500 Shubhojit Das Male Finance 500000 7699 Priya Sane Female Finance 550000 Examples Of Structured Data An 'Employee' table in a database is an example of Structured Data
  • 5. Unstructured Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. A typical example of unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos etc. Now day organizations have wealth of data available with them but unfortunately, they don't know how to derive value out of it since this data is in its raw form or unstructured format. Examples Of Un-structured Data The output returned by 'Google Search'
  • 6. Semi-structured Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a data represented in an XML file. Examples Of Semi-structured Data Personal data stored in an XML file- <rec><name>PrashantRao</name><sex>Male</sex><age>35</age></rec> <rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec> <rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec> <rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec> <rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
  • 7. Characteristics Of Big Data (i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one characteristic which needs to be considered while dealing with Big Data. (ii) Variety – The next aspect of Big Data is its variety. Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.
  • 8. (iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data. Big Data Velocity deals with the speed at which data flows in from Sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous. (iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively. (v) Value- Value is most important aspect in Big data. Though the potential value of big data is huge. It is all well and good to have access to big data but unless we can turn it into value it is become useless. It become very costly to implement IT infrastructure system to store big data and business are going to require a return on investment.
  • 9. Big Data Examples: Applications of Big Data in Real Life Big Data has totally changed and revolutionized the way businesses and organizations work. In this blog, we will go deep into the major Big Data applications in various sectors and industries and learn how these sectors are being benefitted by these applications.
  • 10. Big Data in Education Industry Education industry is flooding with huge amounts of data related to students, faculty, courses, results, and what not. Now, we have realized that proper study and analysis of this data can provide insights which can be used to improve the operational effectiveness and working of educational institutes.
  • 11. Big Data in Healthcare Industry Healthcare is yet another industry which is bound to generate a huge amount of data. Following are some of the ways in which big data has contributed to healthcare: Big data reduces costs of treatment since there is less chances of having to perform unnecessary diagnosis.
  • 12. It helps in predicting outbreaks of epidemics and also in deciding what preventive measures could be taken to minimize the effects of the same. It helps avoid preventable diseases by detecting them in early stages. It prevents them from getting any worse which in turn makes their treatment easy and effective. Patients can be provided with evidence-based medicine which is identified and prescribed after doing research on past medical results.
  • 13. Big Data in Government Sector Governments, be it of any country, come face to face with a very huge amount of data on almost daily basis. The reason for this is, they have to keep track of various records and databases regarding their citizens, their growth, energy resources, geographical surveys, and many more. All this data contributes to big data. The proper study and analysis of this data, hence, helps governments in endless ways.
  • 14. Few of them are as follows: Welfare Schemes •In making faster and informed decisions regarding various political programs •To identify areas that are in immediate need of attention •To stay up to date in the field of agriculture by keeping track of all existing land and livestock. •To overcome national challenges such as unemployment, terrorism, energy resources exploration, and much more. Cyber Security •Big Data is hugely used for deceit recognition. •It is also used in catching tax evaders.
  • 15. Big Data in Media and Entertainment Industry With people having access to various digital gadgets, generation of large amount of data is inevitable and this is the main cause of the rise in big data in media and entertainment industry.
  • 16. Other than this, social media platforms are another way in which huge amount of data is being generated. Although, businesses in the media and entertainment industry have realized the importance of this data, and they have been able to benefit from it for their growth. Some of the benefits extracted from big data in the media and entertainment industry are given below: Predicting the interests of audiences Optimized or on-demand scheduling of media streams in digital media distribution platforms Getting insights from customer reviews Effective targeting of the advertisements
  • 17. Big Data in Weather Patterns There are weather sensors and satellites deployed all around the globe. A huge amount of data is collected from them, and then this data is used to monitor the weather and environmental conditions. All of the data collected from these sensors and satellites contribute to big data and can be used in different ways such as: In weather forecasting •To study global warming •In understanding the patterns of natural disasters •To make necessary preparations in the case of crises •To predict the availability of usable water around the world
  • 18. Since the rise of big data, it has been used in various ways to make transportation more efficient and easy. Following are some of the areas where big data contributes to transportation. Route planning: Big data can be used to understand and estimate users’ needs on different routes and on multiple modes of transportation and then utilize route planning to reduce their wait time. Congestion management and traffic control: Using big data, real- time estimation of congestion and traffic patterns is now possible. For examples, people are using Google Maps to locate the least traffic-prone routes. of traffic. Big Data in Transportation Industry Safety level of traffic: Using the real-time processing of big data and predictive analysis to identify accident-prone areas can help reduce accidents and increase the safety level
  • 19. Big Data in Banking Sector The amount of data in the banking sector is skyrocketing every second. According to GDC prognosis, this data is estimated to grow 700 percent by the end of the next year. Proper study and analysis of this data can help detect any and all illegal activities that are being carried out such as: Misuse of credit/debit cards Venture credit hazard treatment Business clarity Customer statistics alteration Money laundering Risk mitigation
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74. Design of Hadoop Distributed File System (HDFS) • Master-Slave design • Master Node – Single NameNode for managing metadata • Slave Nodes – Multiple DataNodes for storing data • Other – Secondary NameNode as a backup
  • 75. HDFS Architecture NameNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode Secondary NameNode Client Heartbeat, Cmd, Data NameNode keeps the metadata, the name, location and directory DataNode provide storage for blocks of data
  • 76. HDFS File B1 B2 B3 B4 Node B1 Node Node B2 Node B1 Node Node B2 B3 Node Node Node B4 B3 B1 B4 Node Node Node B3 B4 B2 B1 What happens; if node(s) fail? Replication of Blocks for fault tolerance
  • 77.
  • 78. Apply function Map: Apply a function to all the elements of List list1=[1,2,3,4,5]; square x = x * x list2=Map square(list1) print list2 -> [1,4,9,16,25] Reduce: Combine all the elements of list for a summary list1 = [1,2,3,4,5]; A = reduce (+) list1 Print A -> 15 Map Reduce Paradigm • Map and Reduce are based on functional programming Input Output Map Reduce
  • 79. Node Map MapReduce Word Count Example File A B C D Node Map A Node Map Node Map B C D Node Reduce Node Reduce F Node Reduce Node Reduce E G H Shuffle & Sort I am Sam Sam I am (I,1) (am,1) (Sam,1) (I,1) (am,1) (Sam,1) (I,2) (am,2) (Sam,2) (…,..) (..,..) ……… ………
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89. SPARK Outline • Introduction to Apache Hadoop and Spark for developing applications • Components of Hadoop, HDFS, MapReduce and HBase • Capabilities of Spark and the differences from a typical MapReduce solution • Some Spark use cases for data analysis
  • 90. Cloud and Distributed Computing • The second trend is pervasiveness of cloud based storage and computational resources – For processing of these big datasets • Cloud characteristics – Provide a scalable standard environment – On-demand computing – Pay as you need – Dynamically scalable – Cheaper
  • 91. One Solution is Apache Spark • A new general framework, which solves many of the short comings of MapReduce • It capable of leveraging the Hadoop ecosystem, e.g. HDFS, YARN, HBase, S3, … • Has many other workflows, i.e. join, filter, flatMapdistinct, groupByKey, reduceByKey, sortByKey, collect, count, first… – (around 30 efficient distributed operations) • In-memory caching of data (for iterative, graph, and machine learning algorithms, etc.) • Native Scala, Java, Python, and R support • Supports interactive shells for exploratory data analysis • Spark API is extremely simple to use • Developed at AMPLab UC Berkeley, now by Databricks.com
  • 92. Spark Uses Memory instead of Disk Iteration1 Iteration2 HDFS read Iteration1 Iteration2 HDFS read HDFS Write HDFS read HDFS Write Spark: In-Memory Data Sharing Hadoop: Use Disk for Data Sharing
  • 93. Sort competition Hadoop MR Record (2013) Spark Record (2014) Data Size 102.5 TB 100 TB Elapsed Time 72 mins 23 mins # Nodes 2100 206 # Cores 50400 physical 6592 virtualized Cluster disk throughput 3150 GB/s (est.) 618 GB/s Network dedicated data center, 10Gbps virtualized (EC2) 10Gbps network Sort rate 1.42 TB/min 4.27 TB/min Sort rate/node 0.67 GB/min 20.7 GB/min Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records) http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html Spark, 3x faster with 1/10 the nodes
  • 94. Apache Spark Apache Spark supports data analysis, machine learning, graphs, streaming data, etc. It can read/write from a range of data types and allows development in multiple languages. Spark Core Spark Streaming MLlib GraphX ML Pipelines Spark SQL DataFrames Data Sources Scala, Java, Python, R, SQL Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style (GlusterFS, Lustre)
  • 95. Resilient Distributed Datasets (RDDs) • RDDs (Resilient Distributed Datasets) is Data Containers • All the different processing components in Spark share the same abstraction called RDD • As applications share the RDD abstraction, you can mix different kind of transformations to create new RDDs • Created by parallelizing a collection or reading a file • Fault tolerant
  • 96. DataFrames & SparkSQL • DataFrames (DFs) is one of the other distributed datasets organized in named columns • Similar to a relational database, Python Pandas Data frame or R’s DataTables – Immutable once constructed – Track lineage – Enable distributed computations • How to construct Data frames – Read from file(s) – Transforming an existing DFs(Spark or Pandas) – Parallelizing a python collection list – Apply transformations and actions
  • 97. DataFrame example // Create a new DataFrame that contains “students” students = users.filter(users.age < 21) //Alternatively, using Pandas-like syntax students = users[users.age < 21] //Count the number of students users by gender students.groupBy("gender").count() // Join young students with another DataFrame called logs students.join(logs, logs.userId == users.userId, “left_outer")
  • 98. RDDs vs. DataFrames • RDDs provide a low level interface into Spark • DataFrames have a schema • DataFrames are cached and optimized by Spark • DataFrames are built on top of the RDDs and the core Spark API Example: performance
  • 99. Spark Operations Transformations (create a new RDD) map filter sample groupByKey reduceByKey sortByKey intersection flatMap union join cogroup cross mapValues reduceByKey Actions (return results to driver program) collect first Reduce take Count takeOrdered takeSample countByKey take save lookupKey foreach
  • 100. Directed Acyclic Graphs (DAG) A B S C E D F DAGs track dependencies (also known as Lineage )  nodes are RDDs  arrows are Transformations
  • 101. Narrow Vs. Wide transformation A,1 A,[1,2] A,2 Narrow Wide Map groupByKey Vs.
  • 102. Actions • What is an action – The final stage of the workflow – Triggers the execution of the DAG – Returns the results to the driver – Or writes the data to HDFS or to a file
  • 103. Spark Workflow FlatMap Map groupbyKey Spark Context Driver Program Collect
  • 104. Python RDD API Examples • Word count text_file = sc.textFile("hdfs://usr/godil/text/book.txt") counts = text_file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("hdfs://usr/godil/output/wordCount.txt") • Logistic Regression # Every record of this DataFrame contains the label and # features represented by a vector. df = sqlContext.createDataFrame(data, ["label", "features"]) # Set parameters for the algorithm. # Here, we limit the number of iterations to 10. lr = LogisticRegression(maxIter=10) # Fit the model to the data. model = lr.fit(df) # Given a dataset, predict each point's label, and show the results. model.transform(df).show() Examples from http://spark.apache.org/
  • 105. RDD Persistence and Removal • RDD Persistence – RDD.persist() – Storage level: • MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, DISK_ONLY,……. • RDD Removal – RDD.unpersist()
  • 106. Broadcast Variables and Accumulators (Shared Variables ) • Broadcast variables allow the programmer to keep a read-only variable cached on each node, rather than sending a copy of it with tasks >broadcastV1 = sc.broadcast([1, 2, 3,4,5,6]) >broadcastV1.value [1,2,3,4,5,6] • Accumulators are variables that are only “added” to through an associative operation and can be efficiently supported in parallel accum = sc.accumulator(0) accum.add(x) accum.value
  • 107. Spark’s Main Use Cases • Streaming Data • Machine Learning • Interactive Analysis • Data Warehousing • Batch Processing • Exploratory Data Analysis • Graph Data Analysis • Spatial (GIS) Data Analysis • And many more
  • 108. Spark Use Cases • Fingerprint Matching – Developed a Spark based fingerprint minutia detection and fingerprint matching code • Twitter Sentiment Analysis – Developed a Spark based Sentiment Analysis code for a Twitter dataset
  • 109. Spark in the Real World (I) • Uber – the online taxi company gathers terabytes of event data from its mobile users every day. – By using Kafka, Spark Streaming, and HDFS, to build a continuous ETL pipeline – Convert raw unstructured event data into structured data as it is collected – Uses it further for more complex analytics and optimization of operations • Pinterest – Uses a Spark ETL pipeline – Leverages Spark Streaming to gain immediate insight into how users all over the world are engaging with Pins—in real time. – Can make more relevant recommendations as people navigate the site – Recommends related Pins – Determine which products to buy, or destinations to visit
  • 110. Spark in the Real World (II) Here are Few other Real World Use Cases: • Conviva – 4 million video feeds per month – This streaming video company is second only to YouTube. – Uses Spark to reduce customer churn by optimizing video streams and managing live video traffic – Maintains a consistently smooth, high quality viewing experience. • Capital One – is using Spark and data science algorithms to understand customers in a better way. – Developing next generation of financial products and services – Find attributes and patterns of increased probability for fraud • Netflix – leveraging Spark for insights of user viewing habits and then recommends movies to them. – User data is also used for content creation
  • 111. Spark: when not to use • Even though Spark is versatile, that doesn’t mean Spark’s in-memory capabilities are the best fit for all use cases: – For many simple use cases Apache MapReduce and Hive might be a more appropriate choice – Spark was not designed as a multi-user environment – Spark users are required to know that memory they have is sufficient for a dataset – Adding more users adds complications, since the users will have to coordinate memory usage to run code
  • 112. HPC and Big Data Convergence • Clouds and supercomputers are collections of computers networked together in a datacenter • Clouds have different networking, I/O, CPU and cost trade-offs than supercomputers • Cloud workloads are data oriented vs. computation oriented and are less closely coupled than supercomputers • Principles of parallel computing same on both • Apache Hadoop and Spark vs. Open MPI
  • 113. HPC and Big Data K-Means example MPI definitely outpaces Hadoop, but can be boosted using a hybrid approach of other technologies that blend HPC and big data, including Spark and HARP. Dr. Geoffrey Fox, Indiana University. (http://arxiv.org/pdf/1403.1528.pdf)
  • 114. Conclusion • Hadoop (HDFS, MapReduce) – Provides an easy solution for processing of Big Data – Brings a paradigm shift in programming distributed system • Spark – Has extended MapReduce for in memory computations – for streaming, interactive, iterative and machine learning tasks • Changing the World – Made data processing cheaper and more efficient and scalable – Is the foundation of many other tools and software