SlideShare ist ein Scribd-Unternehmen logo
1 von 39
‹#›© Cloudera, Inc. All rights reserved.
Mirko Kämpf | 2015
Apache Spark:
Next Generation Data
Processing for Hadoop
‹#›© Cloudera, Inc. All rights reserved.
Agenda
• The Data Science Process (DSP)
- Why or when to use Spark
• The role of: Apache Hadoop and Apache Spark
- History & Hadoop Ecosystem
• Apache Spark: Overview and Concepts
• Practical Tips
‹#›© Cloudera, Inc. All rights reserved.
The Data Science Process
Application of Big-Data-Technology
Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
‹#›© Cloudera, Inc. All rights reserved.
Huge Data Sets in Science
Application of Big-Data-Technology
Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
‹#›© Cloudera, Inc. All rights reserved.
“Spark offers tools for Data Science
and components for Data
Products.”
—How can Apache Spark fit into my world?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
• Just export as CSV / JSON and use a DataFrame to join with other DS.
Why not?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
• Just export as CSV / JSON and use a DataFrame to join with other DS.
• Think about additional analysis methods! Maybe it is already built into Apache
Spark!
Why not?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
• Just export as CSV / JSON and use a DataFrame to join with other DS.
• Think about additional analysis methods! Maybe it is build into Spark.
• OK, Spark will probably not help to speed up your system, but maybe you can
offload data to Hadoop, which releases some resources.
Why not?
‹#›© Cloudera, Inc. All rights reserved.
“Spark offers fast in memory processing on
huge distributed and even on heterogeneous
datasets.”
—What type of data fits into Spark?
‹#›© Cloudera, Inc. All rights reserved.
History of Spark
Spark is really young, but has a very
active community!
‹#›© Cloudera, Inc. All rights reserved.
Timeline: Spark Adoption
‹#›© Cloudera, Inc. All rights reserved.
Apache Spark:
Overview & Concepts
‹#›© Cloudera, Inc. All rights reserved.
Hadoop Ecosystem incl. Apache Spark
Spark can be an entry point to your Big Data world …
‹#›© Cloudera, Inc. All rights reserved.
“Apache Spark is distributed on top of Hadoop
and brings parallel processing
to powerful workstations.”
—Do I need a Hadoop cluster to work with Apache Spark?
‹#›© Cloudera, Inc. All rights reserved.
Spark vs. MapReduce
‹#›© Cloudera, Inc. All rights reserved.
How to interact with Spark?
‹#›© Cloudera, Inc. All rights reserved.
Spark Components
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
MLLib: GraphX:
Basic statistics
summary statistics, correlations, stratified sampling,
hypothesis testing, random data generation
Classification and regression
linear models (SVMs, logistic / linear regression)
naive Bayes, decision trees
ensembles of trees (Random Forests / Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means, Gaussian mixture, power iteration clustering (PIC)
latent Dirichlet allocation (LDA), streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
…
PageRank
Connected Components
Triangle Counting
Pregel API
‹#›© Cloudera, Inc. All rights reserved.
How to use your code in Spark?
A. Interactively, by loading it into the spark-shell.
B. Contribute to existing Spark projects.
C. Create your module and use it in a spark-shell session.
D. Build a data-product which uses Apache Spark.
For simple and reliable usage of Java classes
and complete third-party libraries, we define
a Spark Module as a self-contained artifact
created by Maven. This module can easily
be shared by multiple users via repositories.
http://blog.cloudera.com/blog/2015/03/how-to-build-re-usable-spark-programs-using-spark-shell-and-maven/
‹#›© Cloudera, Inc. All rights reserved.
Apache Spark:
Overview & Concepts
‹#›© Cloudera, Inc. All rights reserved.
Spark Context
‹#›© Cloudera, Inc. All rights reserved.
RDDs and DataFrames
‹#›© Cloudera, Inc. All rights reserved.
Creation of RDDs
‹#›© Cloudera, Inc. All rights reserved.
Datatypes in RDDs
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
Spark in a Cluster
‹#›© Cloudera, Inc. All rights reserved.
Spark in a Cluster
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
DStream: The heart of Spark Streaming
‹#›© Cloudera, Inc. All rights reserved.
“Efficient hardware utilization, caching,
simple APIs, and access to a variety of data
in Hadoop is key to success.”
—What makes Spark so different, compared to core MapReduce?
‹#›© Cloudera, Inc. All rights reserved.
Practical Tips
‹#›© Cloudera, Inc. All rights reserved.
Development Techniques
• Build your tools and analysis procedures in small cycles.
• Test all phases of your work and document carefully.
• Document what you expect! => Requirements management …
• Collect what you get! => Operational logs …
• Reuse well tested components and modularize your analysis scripts.
• Learn „state of the art“ tools and share your work!
‹#›© Cloudera, Inc. All rights reserved.
Data Management
• Think about typical access patterns:
• random access to each record or field?
• access to entire groups of records?
• variable size or fixed size sets?
• „full table scan“
• OPTIMIZE FOR YOUR DOMINANT ACCESS PATTERN!
• Select efficient storage formats: Avro, Parquet
• Index your data in SOLR for random access and data exploration
• Indexing can be done by just a few clicks in HUE …
‹#›© Cloudera, Inc. All rights reserved.
Collecting Sensor Data with Spark Streaming …
• Spark Streaming works on fixed time slices only (in current version, 1.5)
• Use the original time stamp?
• Requires additional storage and bandwidth
• Original system clock defines resolution
• Use „Spark-Time“ or a local time reference:
• You may lose information!
• You have a limited resolution, defined by batch size.
‹#›© Cloudera, Inc. All rights reserved.
Thank you !
Enjoy Apache Spark and all your data …

Weitere ähnliche Inhalte

Was ist angesagt?

Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & ZeppelinVinay Shukla
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...Databricks
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark Summit
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersDataWorks Summit/Hadoop Summit
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...Databricks
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ ZooskCloudera, Inc.
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
 
Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!Progress
 
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Cloudera, Inc.
 
H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudSri Ambati
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsairisData
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbenchRan Wei
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsTimothy Spann
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 

Was ist angesagt? (20)

Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan Saldich
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop Clusters
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!
 
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
 
H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks Cloud
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 

Andere mochten auch

Andere mochten auch (9)

Web 2.0 y sus usos
Web 2.0 y sus usosWeb 2.0 y sus usos
Web 2.0 y sus usos
 
Petra Costa em Santos
Petra Costa em SantosPetra Costa em Santos
Petra Costa em Santos
 
edited publication list_Nicole_L_McNiven
edited publication list_Nicole_L_McNivenedited publication list_Nicole_L_McNiven
edited publication list_Nicole_L_McNiven
 
Bbn media kit - 08 - 2016
Bbn   media kit - 08 - 2016Bbn   media kit - 08 - 2016
Bbn media kit - 08 - 2016
 
3-й этап Минской Городской Лиги Каратэ сезона 2016-2017.
3-й этап Минской Городской Лиги Каратэ сезона 2016-2017.3-й этап Минской Городской Лиги Каратэ сезона 2016-2017.
3-й этап Минской Городской Лиги Каратэ сезона 2016-2017.
 
Dev ops, noops or hypeops - Networkshop44
Dev ops, noops or hypeops -  Networkshop44Dev ops, noops or hypeops -  Networkshop44
Dev ops, noops or hypeops - Networkshop44
 
Pres1 ppp
Pres1 pppPres1 ppp
Pres1 ppp
 
Campus network refresh - Networkshop44
Campus network refresh -  Networkshop44Campus network refresh -  Networkshop44
Campus network refresh - Networkshop44
 
Genre presentation
Genre presentationGenre presentation
Genre presentation
 

Ähnlich wie Apache Spark in Scientific Applciations

Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSWJason Hubbard
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15MLconf
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConfQubole
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Cloudera, Inc.
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 

Ähnlich wie Apache Spark in Scientific Applciations (20)

Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark 101
Spark 101Spark 101
Spark 101
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Apache spark
Apache sparkApache spark
Apache spark
 

Mehr von Dr. Mirko Kämpf

Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the CloudsDr. Mirko Kämpf
 
Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)Dr. Mirko Kämpf
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentationDr. Mirko Kämpf
 
Enterprise Metadata Integration
Enterprise Metadata IntegrationEnterprise Metadata Integration
Enterprise Metadata IntegrationDr. Mirko Kämpf
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningDr. Mirko Kämpf
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapDr. Mirko Kämpf
 
From Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleFrom Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleDr. Mirko Kämpf
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4Dr. Mirko Kämpf
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationDr. Mirko Kämpf
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"Dr. Mirko Kämpf
 

Mehr von Dr. Mirko Kämpf (12)

Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the Clouds
 
Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentation
 
Enterprise Metadata Integration
Enterprise Metadata IntegrationEnterprise Metadata Integration
Enterprise Metadata Integration
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
From Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleFrom Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on Scale
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation Optimization
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
 

Kürzlich hochgeladen

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 

Kürzlich hochgeladen (20)

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 

Apache Spark in Scientific Applciations

  • 1. ‹#›© Cloudera, Inc. All rights reserved. Mirko Kämpf | 2015 Apache Spark: Next Generation Data Processing for Hadoop
  • 2. ‹#›© Cloudera, Inc. All rights reserved. Agenda • The Data Science Process (DSP) - Why or when to use Spark • The role of: Apache Hadoop and Apache Spark - History & Hadoop Ecosystem • Apache Spark: Overview and Concepts • Practical Tips
  • 3. ‹#›© Cloudera, Inc. All rights reserved. The Data Science Process Application of Big-Data-Technology Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
  • 4. ‹#›© Cloudera, Inc. All rights reserved. Huge Data Sets in Science Application of Big-Data-Technology Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
  • 5. ‹#›© Cloudera, Inc. All rights reserved. “Spark offers tools for Data Science and components for Data Products.” —How can Apache Spark fit into my world?
  • 6. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow?
  • 7. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow? • Just export as CSV / JSON and use a DataFrame to join with other DS. Why not?
  • 8. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow? • Just export as CSV / JSON and use a DataFrame to join with other DS. • Think about additional analysis methods! Maybe it is already built into Apache Spark! Why not?
  • 9. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow? • Just export as CSV / JSON and use a DataFrame to join with other DS. • Think about additional analysis methods! Maybe it is build into Spark. • OK, Spark will probably not help to speed up your system, but maybe you can offload data to Hadoop, which releases some resources. Why not?
  • 10. ‹#›© Cloudera, Inc. All rights reserved. “Spark offers fast in memory processing on huge distributed and even on heterogeneous datasets.” —What type of data fits into Spark?
  • 11. ‹#›© Cloudera, Inc. All rights reserved. History of Spark Spark is really young, but has a very active community!
  • 12. ‹#›© Cloudera, Inc. All rights reserved. Timeline: Spark Adoption
  • 13. ‹#›© Cloudera, Inc. All rights reserved. Apache Spark: Overview & Concepts
  • 14. ‹#›© Cloudera, Inc. All rights reserved. Hadoop Ecosystem incl. Apache Spark Spark can be an entry point to your Big Data world …
  • 15. ‹#›© Cloudera, Inc. All rights reserved. “Apache Spark is distributed on top of Hadoop and brings parallel processing to powerful workstations.” —Do I need a Hadoop cluster to work with Apache Spark?
  • 16. ‹#›© Cloudera, Inc. All rights reserved. Spark vs. MapReduce
  • 17. ‹#›© Cloudera, Inc. All rights reserved. How to interact with Spark?
  • 18. ‹#›© Cloudera, Inc. All rights reserved. Spark Components
  • 19. ‹#›© Cloudera, Inc. All rights reserved.
  • 20. ‹#›© Cloudera, Inc. All rights reserved. MLLib: GraphX: Basic statistics summary statistics, correlations, stratified sampling, hypothesis testing, random data generation Classification and regression linear models (SVMs, logistic / linear regression) naive Bayes, decision trees ensembles of trees (Random Forests / Gradient-Boosted Trees) isotonic regression Collaborative filtering alternating least squares (ALS) Clustering k-means, Gaussian mixture, power iteration clustering (PIC) latent Dirichlet allocation (LDA), streaming k-means Dimensionality reduction singular value decomposition (SVD) principal component analysis (PCA) … PageRank Connected Components Triangle Counting Pregel API
  • 21. ‹#›© Cloudera, Inc. All rights reserved. How to use your code in Spark? A. Interactively, by loading it into the spark-shell. B. Contribute to existing Spark projects. C. Create your module and use it in a spark-shell session. D. Build a data-product which uses Apache Spark. For simple and reliable usage of Java classes and complete third-party libraries, we define a Spark Module as a self-contained artifact created by Maven. This module can easily be shared by multiple users via repositories. http://blog.cloudera.com/blog/2015/03/how-to-build-re-usable-spark-programs-using-spark-shell-and-maven/
  • 22. ‹#›© Cloudera, Inc. All rights reserved. Apache Spark: Overview & Concepts
  • 23. ‹#›© Cloudera, Inc. All rights reserved. Spark Context
  • 24. ‹#›© Cloudera, Inc. All rights reserved. RDDs and DataFrames
  • 25. ‹#›© Cloudera, Inc. All rights reserved. Creation of RDDs
  • 26. ‹#›© Cloudera, Inc. All rights reserved. Datatypes in RDDs
  • 27. ‹#›© Cloudera, Inc. All rights reserved.
  • 28. ‹#›© Cloudera, Inc. All rights reserved.
  • 29. ‹#›© Cloudera, Inc. All rights reserved. Spark in a Cluster
  • 30. ‹#›© Cloudera, Inc. All rights reserved. Spark in a Cluster
  • 31. ‹#›© Cloudera, Inc. All rights reserved.
  • 32. ‹#›© Cloudera, Inc. All rights reserved.
  • 33. ‹#›© Cloudera, Inc. All rights reserved. DStream: The heart of Spark Streaming
  • 34. ‹#›© Cloudera, Inc. All rights reserved. “Efficient hardware utilization, caching, simple APIs, and access to a variety of data in Hadoop is key to success.” —What makes Spark so different, compared to core MapReduce?
  • 35. ‹#›© Cloudera, Inc. All rights reserved. Practical Tips
  • 36. ‹#›© Cloudera, Inc. All rights reserved. Development Techniques • Build your tools and analysis procedures in small cycles. • Test all phases of your work and document carefully. • Document what you expect! => Requirements management … • Collect what you get! => Operational logs … • Reuse well tested components and modularize your analysis scripts. • Learn „state of the art“ tools and share your work!
  • 37. ‹#›© Cloudera, Inc. All rights reserved. Data Management • Think about typical access patterns: • random access to each record or field? • access to entire groups of records? • variable size or fixed size sets? • „full table scan“ • OPTIMIZE FOR YOUR DOMINANT ACCESS PATTERN! • Select efficient storage formats: Avro, Parquet • Index your data in SOLR for random access and data exploration • Indexing can be done by just a few clicks in HUE …
  • 38. ‹#›© Cloudera, Inc. All rights reserved. Collecting Sensor Data with Spark Streaming … • Spark Streaming works on fixed time slices only (in current version, 1.5) • Use the original time stamp? • Requires additional storage and bandwidth • Original system clock defines resolution • Use „Spark-Time“ or a local time reference: • You may lose information! • You have a limited resolution, defined by batch size.
  • 39. ‹#›© Cloudera, Inc. All rights reserved. Thank you ! Enjoy Apache Spark and all your data …