SlideShare ist ein Scribd-Unternehmen logo
1 von 62
Page1 © Hortonworks Inc. 2014
Introduction to Big Data Analytics using
Apache Spark on HDInsights on Azure
(SaaS) and/or HDP on Azure(PaaS)
Hortonworks. We do Hadoop.
Alex Zeltov

@azeltov
Page2 © Hortonworks Inc. 2014
In this workshop
• Introduction to HDP and Spark
• Build a Data analytics application:
- Spark Programming: Scala, Python, R
- Core Spark: working with RDDs, DataFrames
- Spark SQL: structured data access
• Conclusion and Q/A
Page3 © Hortonworks Inc. 2014
Introduction to HDP and Spark
http://hortonworks.com/hadoop/spark/
Page4 © Hortonworks Inc. 2014
What is Spark?
• Spark is
– an open-source software solution that performs rapid calculations
on in-memory datasets
- Open Source [Apache hosted & licensed]
• Free to download and use in production
• Developed by a community of developers
- Spark supports using well known languages such as: Scala, Python, R, Java
- Spark SQL: Seamlessly mix SQL queries with Spark programs
- In-memory datasets
• RDD (Resilient Distributed Data) is the basis for what Spark enables
• Resilient – the models can be recreated on the fly from known state
• Distributed – the dataset is often partitioned across multiple nodes for
increased scalability and parallelism
Page5 © Hortonworks Inc. 2014
Spark is certified as YARN Ready and is a part of HDP.
Hortonworks Data Platform 2.4
GOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
YARN: Data Operating System
(Cluster Resource Management)
MapReduce
Apache Falcon
Apache Sqoop
Apache Flume
Apache Kafka
ApacheHive
ApachePig
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
1 • • • • • • • • • • •
• • • • • • • • • • • •
HDFS
(Hadoop Distributed File System)
Apache Ambari
Apache
ZooKeeper
Apache Oozie
Deployment Choice
Linux Windows On-premises Cloud
Apache Atlas
Cloudbreak
SECURITY
Apache Ranger
Apache Knox
Apache Atlas
HDFS Encryption
ISVEngines
Page6 © Hortonworks Inc. 2014
Spark Components
Spark allows you to do data processing, ETL, machine learning,
stream processing, SQL querying from one framework
Page7 © Hortonworks Inc. 2014
Ease of Use
• Write applications quickly in Java, Scala, Python, R.
• Spark offers over 80 high-level operators that make it easy to
build parallel apps. And you can use it interactively from the
Scala, Python and R shells.
Page8 © Hortonworks Inc. 2014
Generality
• Combine SQL, streaming, and complex analytics.
• Spark powers a stack of libraries including SQL and DataFrames, MLlib for
machine learning, GraphX, and Spark Streaming. You can combine these
libraries seamlessly in the same application.
Runs Everywhere:
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access
diverse data sources including HDFS, Cassandra, HBase, S3, WASB
• https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-zeppelin-notebook-jupyter-spark-sql/
Page9 © Hortonworks Inc. 2014
Emerging Spark Patterns
• Spark as query federation engine
 Bring data from multiple sources to join/query in Spark
• Use multiple Spark libraries together
 Common to see Core, ML & Sql used together
• Use Spark with various Hadoop ecosystem projects
 Use Spark & Hive together
 Spark & HBase together
Page10 © Hortonworks Inc. 2014
Elegant Developer APIs
DataFrames, Machine Learning, and SQL
Made for Data Science
All apps need to get predictive at scale and fine granularity
Democratize Machine Learning
Spark is doing to ML on Hadoop what Hive did for SQL on
Hadoop
Community
Broad developer, customer and partner interest
Realize Value of Data Operating System
A key tool in the Hadoop toolbox
Why We Love Spark at Hortonworks
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS
Page11 © Hortonworks Inc. 2014
More Data Sources APIs
9/04/2016
Page12 © Hortonworks Inc. 2014
Spark Motivation
• MapReduce – involves lots of disk I/O (Disk I/O is very slow)
• Spark – Keep more data in memory
•
Input
Page13 © Hortonworks Inc. 2014
What is Hadoop?
Apache Hadoop is an open-source software framework
written in Java for distributed storage and distributed
processing of very large data sets on computer clusters built
from commodity hardware.
The core of Apache Hadoop consists of a storage part
Hadoop Distributed File System (HDFS) and a processing
part (MapReduce).
Page14 © Hortonworks Inc. 2014
Interacting with Spark
Page15 © Hortonworks Inc. 2014
Interacting with Spark
• Spark’s interactive REPL shell (in Python or Scala)
• Web-based Notebooks:
• Zeppelin: A web-based notebook that enables interactive data
analytics.
• Jupyter: Evolved from the IPython Project
• SparkNotebook: forked from the scala-notebook
• RStudio: for Spark R , Zeppelin support coming soon
https://community.hortonworks.com/articles/25558/running-sparkr-in-rstudio-using-hdp-24.html
Page16 © Hortonworks Inc. 2014
Apache Zeppelin
• A web-based notebook that enables interactive data
analytics.
• Multiple language backend
• Multi-purpose Notebook is the place for all your
needs
 Data Ingestion
 Data Discovery
 Data Analytics
 Data Visualization
 Collaboration
Page17 © Hortonworks Inc. 2014
Zeppelin- Multiple language backend
Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.
Page18 © Hortonworks Inc. 2014
Zeppelin – Dependency Management
• Load libraries recursively from Maven repository
• Load libraries from local filesystem
• %dep
• // add maven repository
• z.addRepo("RepoName").url("RepoURL”)
• // add artifact from filesystem
• z.load("/path/to.jar")
• // add artifact from maven repository, with no dependency
• z.load("groupId:artifactId:version").excludeAll()
Page19 © Hortonworks Inc. 2014 19
Community Plugins
• 100+ connectors
http://spark-packages.org/
Page20 © Hortonworks Inc. 2014
Programming Spark
Page21 © Hortonworks Inc. 2014
How Does Spark Work?
• RDD
• Your data is loaded in parallel into structured collections
• Actions
• Manipulate the state of the working model by forming new RDDs
and performing calculations upon them
• Persistence
• Long-term storage of an RDD’s state
Page22 © Hortonworks Inc. 2014
Resilient Distributed Datasets
• The primary abstraction in Spark
» Immutable once constructed
» Track lineage information to efficiently recompute lost data
» Enable operations on collection of elements in parallel
• You construct RDDs
» by parallelizing existing collections (lists)
» by transforming an existing RDDs
» from files in HDFS or any other storage system
Page23 © Hortonworks Inc. 2014
item-1
item-2
item-3
item-4
item-5
item-6
item-7
item-8
item-9
item-10
item-11
item-12
item-13
item-14
item-15
item-16
item-17
item-18
item-19
item-20
item-21
item-22
item-23
item-24
item-25
more partitions = more parallelism
Worker
Spark
executor
Worker
Spark
executor
Worker
Spark
executor
RDDs
• Programmer specifies number of partitions for an RDD
(Default value used if unspecified)
RDD split into 5 partitions
Page24 © Hortonworks Inc. 2014
RDDs
• Two types of operations:transformations and actions
• Transformations are lazy (not computed immediately)
• Transformed RDD is executed when action runs on it
• Persist (cache) RDDs in memory or disk
Page25 © Hortonworks Inc. 2014
Example RDD Transformations
•map(func)
•filter(func)
•distinct(func)
• All create a new DataSet from an existing one
• Do not create the DataSet until an action is performed (Lazy)
• Each element in an RDD is passed to the target function and the
result forms a new RDD
Page26 © Hortonworks Inc. 2014
Example Action Operations
•count()
•reduce(func)
•collect()
•take()
• Either:
• Returns a value to the driver program
• Exports state to external system
Page27 © Hortonworks Inc. 2014
Example Persistence Operations
•persist() -- takes options
•cache() -- only one option: in-memory
• Stores RDD Values
• in memory (what doesn’t fit is recalculated when necessary)
• Replication is an option for in-memory
• to disk
• blended
Page28 © Hortonworks Inc. 2014
Spark Applications
Are a definition in code of
• RDD creation
• Actions
• Persistence
Results in the creation of a DAG (Directed Acyclic Graph) [workflow]
• Each DAG is compiled into stages
• Each Stage is executed as a series of Tasks
• Each Task operates in parallel on assigned partitions
Page29 © Hortonworks Inc. 2014
Spark Context
• A Spark program first creates a SparkContext object
• Tells Spark how and where to access a cluster
• Use SparkContext to create RDDs
• SparkContext, SQLContext, ZeppelinContext:
• are automatically created and exposed as variable names 'sc', 'sqlContext' and
'z', respectively, both in scala and python environments using Zeppelin
• iPython and programs must use a constructor to create a new SparkContext
Note: that scala / python environment shares the same SparkContext, SQLContext,
ZeppelinContext instance.
Page30 © Hortonworks Inc. 2014
1. Resilient Distributed Dataset [RDD] Graph
val v = sc.textFile("hdfs://…some-hdfs-data")
mapmap reduceByKey collecttextFile
v.flatMap(line=>line.split(" "))
.map(word=>(word, 1)))
.reduceByKey(_ + _, 3)
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
Array[(String, Int)]
RDD[(String, Int)]
Page31 © Hortonworks Inc. 2014
Processing A File in Scala
//Load the file:
val file = sc.textFile("hdfs://…/user/DAW/littlelog.csv")
//Trim away any empty rows:
val fltr = file.filter(_.length > 0)
//Print out the remaining rows:
fltr.foreach(println)
31
Page32 © Hortonworks Inc. 2014
Looking at the State in the Machine
//run debug command to inspect RDD:
scala> fltr.toDebugString
//simplified output:
res1: String =
FilteredRDD[2] at filter at <console>:14
MappedRDD[1] at textFile at <console>:12
HadoopRDD[0] at textFile at <console>:12
32
Page33 © Hortonworks Inc. 2014
A Word on Anonymous Functions
Scala programmers make great use of anonymous functions as can
be seen in the code:
flatMap( line => line.split(" ") )
33
Argument
to the
function
Body of
the
function
Page34 © Hortonworks Inc. 2014
Scala Functions Come In a Variety of Styles
flatMap( line => line.split(" ") )
flatMap((line:String) => line.split(" "))
flatMap(_.split(" "))
34
Argument to the
function (type inferred)
Body of the function
Argument to the
function (explicit type)
Body of the
function
No Argument to the
function declared
(placeholder) instead
Body of the function includes placeholder _ which allows for exactly one use of
one arg for each _ present. _ essentially means ‘whatever you pass me’
Page35 © Hortonworks Inc. 2014
And Finally – the Formal ‘def’
def myFunc(line:String): Array[String]={
return line.split(",")
}
//and now that it has a name:
myFunc("Hi Mom, I’m home.").foreach(println)
Return type of the function)
Body of the function
Argument to the function)
Page36 © Hortonworks Inc. 2014
LAB: Spark RDD & Data Frames Demo –
Philly Crime Data Set
Page37 © Hortonworks Inc. 2014
Spark DataFrames
Page38 © Hortonworks Inc. 2014
What are DataFrames?
• Distributed Collection of Data organized in Columns
• Equivalent to Tables in Databases or DataFrame in R/PYTHON
• Much richer optimization than any other implementation of DF
• Can be constructed from a wide variety of sources and APIs
• Greater accessiblity
• Declarative rather thanimperative
• Catalyst Optimizer
Why DataFrames?
Page39 © Hortonworks Inc. 2014
Writing a DataFrame
val df = sqlContext.jsonFile("/tmp/people.json")
df.show()
df.printSchema()
df.select ("First Name").show()
df.select("First Name","Age").show()
df.filter(df("age")>40).show()
df.groupBy("age").count().show()
Page40 © Hortonworks Inc. 2014
Querying RDD Using SQL
import org.apache.spark.sql.types.{StructType,StructField,StringType}
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, true)))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val people = sc.textFile("/tmp/people.txt")
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
peopleDataFrame.registerTempTable("people")
val results = sqlContext.sql("SELECT name FROM people")
results.map(t => "Name: " + t(0)).collect().foreach(println)
Page41 © Hortonworks Inc. 2014
Querying RDD Using SQL
// SQL statements can be run directly on RDD’s
val teenagers =
sqlC.sql("SELECT name FROM people
WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support
// normal RDD operations:
val nameList = teenagers.map(t => "Name: " + t(0)).collect()
// Language integrated queries (ala LINQ)
val teenagers =
people.where('age >= 10).where('age <= 19).select('name)
Page42 © Hortonworks Inc. 2014
Dataframes for Apache Spark
DataFrame SQL
DataFrame R
DataFrame Python
DataFrame Scala
RDD Python
RDD Scala
Time to aggregate 10 million integer pairs (in seconds)
DataFrames can be significantly faster than RDDs. And they
perform the same, regardless of language.
Page43 © Hortonworks Inc. 2014
Transformations Actions
filter count
select collect
drop show
join take
Transformations contribute to the query plan but
nothing is executed until an action is called
Dataframes – Transformation & Actions
Page44 © Hortonworks Inc. 2014
LAB: DataFrames
http://sandbox.hortonworks.com:8081/#/notebook/2B4B7EWY7
http://sandbox.hortonworks.com:8081/#/notebook/2B5RMG4AM
DataFrames + SQL
DataFrames JSON
Page45 © Hortonworks Inc. 2014
DataFrames and JDBC
val jdbc_attendees = sqlContext.load("jdbc", Map("url" ->
"jdbc:mysql://localhost:3306/db1?user=root&password=xxx","dbtable" -> "attendees"))
jdbc_attendees.show()
jdbc.attendees.count()
jdbc_attendees.registerTempTable("jdbc_attendees")
val countall = sqlContext.sql("select count(*) from jdbc_attendees")
countall.map(t=>"Records count is "+t(0)).collect().foreach(println)
Page46 © Hortonworks Inc. 2014
Code ‘select count’
Equivalent SQL Statement:
Select count(*) from pagecounts WHERE state = ‘FL’
Scala statement:
val file = sc.textFile("hdfs://…/log.txt")
val numFL = file.filter(line =>
line.contains("fl")).count()
scala> println(numFL)
46
1. Load the page as an RDD
2. Filter the lines of the page
eliminating any that do not
contain “fl“
3. Count those lines that
remain
4. Print the value of the
counted lines containing ‘fl’
Page47 © Hortonworks Inc. 2014
Spark SQL
47
Page48 © Hortonworks Inc. 2014 48
Platform APIs
• Joining Data from Different
Sources
• Access Data using DataFrames /
SQL
Page49 © Hortonworks Inc. 2014
LAB: JDBC and 3rd party packages
http://sandbox.hortonworks.com:8081/#/notebook/2B2P8RE82
Page50 © Hortonworks Inc. 2014
What About Integration With Hive?
scala> val hiveCTX = new org.apache.spark.sql.hive.HiveContext(sc)
scala> hiveCTX.hql("SHOW TABLES").collect().foreach(println)
…
[omniture]
[omniturelogs]
[orc_table]
[raw_products]
[raw_users]
…
50
Page51 © Hortonworks Inc. 2014
More Integration With Hive:
scala> hCTX.hql("DESCRIBE raw_users").collect().foreach(println)
[swid,string,null]
[birth_date,string,null]
[gender_cd,string,null]
scala> hCTX.hql("SELECT * FROM raw_users WHERE gender_cd='F' LIMIT
5").collect().foreach(println)
[0001BDD9-EABF-4D0D-81BD-D9EABFCD0D7D,8-Apr-84,F]
[00071AA7-86D2-4EB9-871A-A786D27EB9BA,7-Feb-88,F]
[00071B7D-31AF-4D85-871B-7D31AFFD852E,22-Oct-64,F]
[000F36E5-9891-4098-9B69-CEE78483B653,24-Mar-85,F]
[00102F3F-061C-4212-9F91-1254F9D6E39F,1-Nov-91,F]
51
Page52 © Hortonworks Inc. 2014
LAB: HIVE ORC
http://sandbox.hortonworks.com:8081/#/notebook/2B6KUW16Z
Page53 © Hortonworks Inc. 2014
Spark Streaming
Page54 © Hortonworks Inc. 2014
MicroBatch Spark Streams
Page55 © Hortonworks Inc. 2014
Physical Execution
Page56 © Hortonworks Inc. 2014
Spark Streaming 101
• Spark has significant library support for streaming applications
val ssc = new StreamingContext(sc, Seconds(5))
val tweetStream = TwitterUtils.createStream(ssc, Some(auth))
• Allows to combine Streaming with Batch/ETL,SQL & ML
• Read data from HDFS, Flume, Kafka, Twitter, ZeroMQ & custom.
• Chop input data stream into batches
• Spark processes batches & results published in batches
• Fundamental unit is Discretized Streams (DStreams)
Page57 © Hortonworks Inc. 2014
Spark MLLib
Page58 © Hortonworks Inc. 2014
Spark MLlib – Algorithms Offered
• Classification: logistic regression, linear SVM,
– naïve Bayes, least squares, classification tree
• Regression: generalized linear models (GLMs),
– regression tree
• Collaborative filtering: alternating least squares (ALS),
– non-negative matrix factorization (NMF)
• Clustering: k-means
• Decomposition: SVD, PCA
• Optimization: stochastic gradient descent, L-BFGS
Page59 © Hortonworks Inc. 2014 59
ML - Pipelines
• New algorithms KMeans [SPARK-7879], Naive Bayes [SPARK-
8600], Bisecting KMeans
• [SPARK-6517], Multi-layer Perceptron (ANN) [SPARK-2352],
Weighting for
• Linear Models [SPARK-7685]
• New transformers (close to parity with SciKit learn):
CountVectorizer [SPARK-8703],
• PCA [SPARK-8664], DCT [SPARK-8471], N-Grams [SPARK-8455]
• Calling into single machine solvers (coming soon as a package)
Page60 © Hortonworks Inc. 2014
Twitter Language Classifier
Goal: connect to real time twitter stream and print only
those tweets whose language match our chosen language.
Main issue: how to detect the language during run time?
Solution: build a language classifier model offline capable of
detecting language of tweet (Mlib). Then, apply it to real
time twitter stream and do filtering (Spark Streaming).
Page61 © Hortonworks Inc. 2014
Conclusion and Q&A
Page62 © Hortonworks Inc. 2014
Learn More Spark + Hadoop Perfect Together
HDP Spark General Info:
http://hortonworks.com/hadoop/spark/
Learn more about our Focus on Spark:
http://hortonworks.com/hadoop/spark/#section_6
Get the HDP Spark 1.5.1 Tech Preview:
http://hortonworks.com/hadoop/spark/#section_5
Get started with Spark and Zeppelin and download the Sandbox:
http://hortonworks.com/sandbox
Try these tutorials:
http://hortonworks.com/hadoop/spark/#tutorials
http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/
Learn more about GeoSpatial Spark processing with Magellan:
http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/

Weitere ähnliche Inhalte

Was ist angesagt?

Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge
 
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and CloudbreakData Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and CloudbreakDataWorks Summit
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsTimothy Spann
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 PivotalOpenSourceHub
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiApache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiDatabricks
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark OverviewairisData
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 

Was ist angesagt? (20)

Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
LinkedIn
LinkedInLinkedIn
LinkedIn
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and CloudbreakData Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan Kessler
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiApache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 

Andere mochten auch

Rob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopRob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopGhassan Al-Yafie
 
Introduction to GitHub
Introduction to GitHubIntroduction to GitHub
Introduction to GitHubNishan Bose
 
AWS User Group Sydney - Atlassian 5-10-16
AWS User Group Sydney - Atlassian 5-10-16AWS User Group Sydney - Atlassian 5-10-16
AWS User Group Sydney - Atlassian 5-10-16PolarSeven Pty Ltd
 
The concept of Datalake with Hadoop
The concept of Datalake with HadoopThe concept of Datalake with Hadoop
The concept of Datalake with HadoopAvkash Chauhan
 
Led Zeppelin Powerpoint
Led Zeppelin PowerpointLed Zeppelin Powerpoint
Led Zeppelin Powerpointledzeppelin0
 
应用开发利器 IBM Bluemix平台云介绍
应用开发利器 IBM Bluemix平台云介绍应用开发利器 IBM Bluemix平台云介绍
应用开发利器 IBM Bluemix平台云介绍Hardway Hou
 
Comment bâtir un cloud hybride en mode IaaS ou SaaS et apporter le meilleur d...
Comment bâtir un cloud hybride en mode IaaS ou SaaS et apporter le meilleur d...Comment bâtir un cloud hybride en mode IaaS ou SaaS et apporter le meilleur d...
Comment bâtir un cloud hybride en mode IaaS ou SaaS et apporter le meilleur d...Microsoft Technet France
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopAvkash Chauhan
 
Getting Started with Cloud Foundry on Bluemix
Getting Started with Cloud Foundry on BluemixGetting Started with Cloud Foundry on Bluemix
Getting Started with Cloud Foundry on BluemixJake Peyser
 
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @ShanghaiLuke Han
 
Windows Azure Platform
Windows Azure PlatformWindows Azure Platform
Windows Azure PlatformDavid Chou
 
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...confluent
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSTreasure Data, Inc.
 
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...WSO2
 
SpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople
 
Introduction to github slideshare
Introduction to github slideshareIntroduction to github slideshare
Introduction to github slideshareRakesh Sukumar
 
Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016Hortonworks
 
データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」
データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」
データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」Masayuki Matsushita
 
Introduction to Hortonworks Data Cloud for AWS
Introduction to Hortonworks Data Cloud for AWSIntroduction to Hortonworks Data Cloud for AWS
Introduction to Hortonworks Data Cloud for AWSYifeng Jiang
 

Andere mochten auch (20)

Rob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopRob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoop
 
Introduction to GitHub
Introduction to GitHubIntroduction to GitHub
Introduction to GitHub
 
AWS User Group Sydney - Atlassian 5-10-16
AWS User Group Sydney - Atlassian 5-10-16AWS User Group Sydney - Atlassian 5-10-16
AWS User Group Sydney - Atlassian 5-10-16
 
The concept of Datalake with Hadoop
The concept of Datalake with HadoopThe concept of Datalake with Hadoop
The concept of Datalake with Hadoop
 
Led Zeppelin Powerpoint
Led Zeppelin PowerpointLed Zeppelin Powerpoint
Led Zeppelin Powerpoint
 
应用开发利器 IBM Bluemix平台云介绍
应用开发利器 IBM Bluemix平台云介绍应用开发利器 IBM Bluemix平台云介绍
应用开发利器 IBM Bluemix平台云介绍
 
Comment bâtir un cloud hybride en mode IaaS ou SaaS et apporter le meilleur d...
Comment bâtir un cloud hybride en mode IaaS ou SaaS et apporter le meilleur d...Comment bâtir un cloud hybride en mode IaaS ou SaaS et apporter le meilleur d...
Comment bâtir un cloud hybride en mode IaaS ou SaaS et apporter le meilleur d...
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
 
Getting Started with Cloud Foundry on Bluemix
Getting Started with Cloud Foundry on BluemixGetting Started with Cloud Foundry on Bluemix
Getting Started with Cloud Foundry on Bluemix
 
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
 
Windows Azure Platform
Windows Azure PlatformWindows Azure Platform
Windows Azure Platform
 
Azure vs. amazon
Azure vs. amazonAzure vs. amazon
Azure vs. amazon
 
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWS
 
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
 
SpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache Hadoop
 
Introduction to github slideshare
Introduction to github slideshareIntroduction to github slideshare
Introduction to github slideshare
 
Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016
 
データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」
データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」
データ活用を推進する「Pivotal HDB(Apache HAWQ(ホーク))」
 
Introduction to Hortonworks Data Cloud for AWS
Introduction to Hortonworks Data Cloud for AWSIntroduction to Hortonworks Data Cloud for AWS
Introduction to Hortonworks Data Cloud for AWS
 

Ähnlich wie Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsights on Azure (SaaS) and/or HDP on Azure(PaaS

Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Spark and Hadoop Technology
Spark and Hadoop Technology Spark and Hadoop Technology
Spark and Hadoop Technology Avinash Gautam
 

Ähnlich wie Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsights on Azure (SaaS) and/or HDP on Azure(PaaS (20)

Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Spark and Hadoop Technology
Spark and Hadoop Technology Spark and Hadoop Technology
Spark and Hadoop Technology
 
Apache spark
Apache sparkApache spark
Apache spark
 

Kürzlich hochgeladen

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Kürzlich hochgeladen (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsights on Azure (SaaS) and/or HDP on Azure(PaaS

  • 1. Page1 © Hortonworks Inc. 2014 Introduction to Big Data Analytics using Apache Spark on HDInsights on Azure (SaaS) and/or HDP on Azure(PaaS) Hortonworks. We do Hadoop. Alex Zeltov @azeltov
  • 2. Page2 © Hortonworks Inc. 2014 In this workshop • Introduction to HDP and Spark • Build a Data analytics application: - Spark Programming: Scala, Python, R - Core Spark: working with RDDs, DataFrames - Spark SQL: structured data access • Conclusion and Q/A
  • 3. Page3 © Hortonworks Inc. 2014 Introduction to HDP and Spark http://hortonworks.com/hadoop/spark/
  • 4. Page4 © Hortonworks Inc. 2014 What is Spark? • Spark is – an open-source software solution that performs rapid calculations on in-memory datasets - Open Source [Apache hosted & licensed] • Free to download and use in production • Developed by a community of developers - Spark supports using well known languages such as: Scala, Python, R, Java - Spark SQL: Seamlessly mix SQL queries with Spark programs - In-memory datasets • RDD (Resilient Distributed Data) is the basis for what Spark enables • Resilient – the models can be recreated on the fly from known state • Distributed – the dataset is often partitioned across multiple nodes for increased scalability and parallelism
  • 5. Page5 © Hortonworks Inc. 2014 Spark is certified as YARN Ready and is a part of HDP. Hortonworks Data Platform 2.4 GOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS YARN: Data Operating System (Cluster Resource Management) MapReduce Apache Falcon Apache Sqoop Apache Flume Apache Kafka ApacheHive ApachePig ApacheHBase ApacheAccumulo ApacheSolr ApacheSpark ApacheStorm 1 • • • • • • • • • • • • • • • • • • • • • • • HDFS (Hadoop Distributed File System) Apache Ambari Apache ZooKeeper Apache Oozie Deployment Choice Linux Windows On-premises Cloud Apache Atlas Cloudbreak SECURITY Apache Ranger Apache Knox Apache Atlas HDFS Encryption ISVEngines
  • 6. Page6 © Hortonworks Inc. 2014 Spark Components Spark allows you to do data processing, ETL, machine learning, stream processing, SQL querying from one framework
  • 7. Page7 © Hortonworks Inc. 2014 Ease of Use • Write applications quickly in Java, Scala, Python, R. • Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.
  • 8. Page8 © Hortonworks Inc. 2014 Generality • Combine SQL, streaming, and complex analytics. • Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. Runs Everywhere: Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3, WASB • https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-zeppelin-notebook-jupyter-spark-sql/
  • 9. Page9 © Hortonworks Inc. 2014 Emerging Spark Patterns • Spark as query federation engine  Bring data from multiple sources to join/query in Spark • Use multiple Spark libraries together  Common to see Core, ML & Sql used together • Use Spark with various Hadoop ecosystem projects  Use Spark & Hive together  Spark & HBase together
  • 10. Page10 © Hortonworks Inc. 2014 Elegant Developer APIs DataFrames, Machine Learning, and SQL Made for Data Science All apps need to get predictive at scale and fine granularity Democratize Machine Learning Spark is doing to ML on Hadoop what Hive did for SQL on Hadoop Community Broad developer, customer and partner interest Realize Value of Data Operating System A key tool in the Hadoop toolbox Why We Love Spark at Hortonworks YARN Scala Java Python R APIs Spark Core Engine Spark SQL Spark Streaming MLlib GraphX 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS
  • 11. Page11 © Hortonworks Inc. 2014 More Data Sources APIs 9/04/2016
  • 12. Page12 © Hortonworks Inc. 2014 Spark Motivation • MapReduce – involves lots of disk I/O (Disk I/O is very slow) • Spark – Keep more data in memory • Input
  • 13. Page13 © Hortonworks Inc. 2014 What is Hadoop? Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. The core of Apache Hadoop consists of a storage part Hadoop Distributed File System (HDFS) and a processing part (MapReduce).
  • 14. Page14 © Hortonworks Inc. 2014 Interacting with Spark
  • 15. Page15 © Hortonworks Inc. 2014 Interacting with Spark • Spark’s interactive REPL shell (in Python or Scala) • Web-based Notebooks: • Zeppelin: A web-based notebook that enables interactive data analytics. • Jupyter: Evolved from the IPython Project • SparkNotebook: forked from the scala-notebook • RStudio: for Spark R , Zeppelin support coming soon https://community.hortonworks.com/articles/25558/running-sparkr-in-rstudio-using-hdp-24.html
  • 16. Page16 © Hortonworks Inc. 2014 Apache Zeppelin • A web-based notebook that enables interactive data analytics. • Multiple language backend • Multi-purpose Notebook is the place for all your needs  Data Ingestion  Data Discovery  Data Analytics  Data Visualization  Collaboration
  • 17. Page17 © Hortonworks Inc. 2014 Zeppelin- Multiple language backend Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.
  • 18. Page18 © Hortonworks Inc. 2014 Zeppelin – Dependency Management • Load libraries recursively from Maven repository • Load libraries from local filesystem • %dep • // add maven repository • z.addRepo("RepoName").url("RepoURL”) • // add artifact from filesystem • z.load("/path/to.jar") • // add artifact from maven repository, with no dependency • z.load("groupId:artifactId:version").excludeAll()
  • 19. Page19 © Hortonworks Inc. 2014 19 Community Plugins • 100+ connectors http://spark-packages.org/
  • 20. Page20 © Hortonworks Inc. 2014 Programming Spark
  • 21. Page21 © Hortonworks Inc. 2014 How Does Spark Work? • RDD • Your data is loaded in parallel into structured collections • Actions • Manipulate the state of the working model by forming new RDDs and performing calculations upon them • Persistence • Long-term storage of an RDD’s state
  • 22. Page22 © Hortonworks Inc. 2014 Resilient Distributed Datasets • The primary abstraction in Spark » Immutable once constructed » Track lineage information to efficiently recompute lost data » Enable operations on collection of elements in parallel • You construct RDDs » by parallelizing existing collections (lists) » by transforming an existing RDDs » from files in HDFS or any other storage system
  • 23. Page23 © Hortonworks Inc. 2014 item-1 item-2 item-3 item-4 item-5 item-6 item-7 item-8 item-9 item-10 item-11 item-12 item-13 item-14 item-15 item-16 item-17 item-18 item-19 item-20 item-21 item-22 item-23 item-24 item-25 more partitions = more parallelism Worker Spark executor Worker Spark executor Worker Spark executor RDDs • Programmer specifies number of partitions for an RDD (Default value used if unspecified) RDD split into 5 partitions
  • 24. Page24 © Hortonworks Inc. 2014 RDDs • Two types of operations:transformations and actions • Transformations are lazy (not computed immediately) • Transformed RDD is executed when action runs on it • Persist (cache) RDDs in memory or disk
  • 25. Page25 © Hortonworks Inc. 2014 Example RDD Transformations •map(func) •filter(func) •distinct(func) • All create a new DataSet from an existing one • Do not create the DataSet until an action is performed (Lazy) • Each element in an RDD is passed to the target function and the result forms a new RDD
  • 26. Page26 © Hortonworks Inc. 2014 Example Action Operations •count() •reduce(func) •collect() •take() • Either: • Returns a value to the driver program • Exports state to external system
  • 27. Page27 © Hortonworks Inc. 2014 Example Persistence Operations •persist() -- takes options •cache() -- only one option: in-memory • Stores RDD Values • in memory (what doesn’t fit is recalculated when necessary) • Replication is an option for in-memory • to disk • blended
  • 28. Page28 © Hortonworks Inc. 2014 Spark Applications Are a definition in code of • RDD creation • Actions • Persistence Results in the creation of a DAG (Directed Acyclic Graph) [workflow] • Each DAG is compiled into stages • Each Stage is executed as a series of Tasks • Each Task operates in parallel on assigned partitions
  • 29. Page29 © Hortonworks Inc. 2014 Spark Context • A Spark program first creates a SparkContext object • Tells Spark how and where to access a cluster • Use SparkContext to create RDDs • SparkContext, SQLContext, ZeppelinContext: • are automatically created and exposed as variable names 'sc', 'sqlContext' and 'z', respectively, both in scala and python environments using Zeppelin • iPython and programs must use a constructor to create a new SparkContext Note: that scala / python environment shares the same SparkContext, SQLContext, ZeppelinContext instance.
  • 30. Page30 © Hortonworks Inc. 2014 1. Resilient Distributed Dataset [RDD] Graph val v = sc.textFile("hdfs://…some-hdfs-data") mapmap reduceByKey collecttextFile v.flatMap(line=>line.split(" ")) .map(word=>(word, 1))) .reduceByKey(_ + _, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] Array[(String, Int)] RDD[(String, Int)]
  • 31. Page31 © Hortonworks Inc. 2014 Processing A File in Scala //Load the file: val file = sc.textFile("hdfs://…/user/DAW/littlelog.csv") //Trim away any empty rows: val fltr = file.filter(_.length > 0) //Print out the remaining rows: fltr.foreach(println) 31
  • 32. Page32 © Hortonworks Inc. 2014 Looking at the State in the Machine //run debug command to inspect RDD: scala> fltr.toDebugString //simplified output: res1: String = FilteredRDD[2] at filter at <console>:14 MappedRDD[1] at textFile at <console>:12 HadoopRDD[0] at textFile at <console>:12 32
  • 33. Page33 © Hortonworks Inc. 2014 A Word on Anonymous Functions Scala programmers make great use of anonymous functions as can be seen in the code: flatMap( line => line.split(" ") ) 33 Argument to the function Body of the function
  • 34. Page34 © Hortonworks Inc. 2014 Scala Functions Come In a Variety of Styles flatMap( line => line.split(" ") ) flatMap((line:String) => line.split(" ")) flatMap(_.split(" ")) 34 Argument to the function (type inferred) Body of the function Argument to the function (explicit type) Body of the function No Argument to the function declared (placeholder) instead Body of the function includes placeholder _ which allows for exactly one use of one arg for each _ present. _ essentially means ‘whatever you pass me’
  • 35. Page35 © Hortonworks Inc. 2014 And Finally – the Formal ‘def’ def myFunc(line:String): Array[String]={ return line.split(",") } //and now that it has a name: myFunc("Hi Mom, I’m home.").foreach(println) Return type of the function) Body of the function Argument to the function)
  • 36. Page36 © Hortonworks Inc. 2014 LAB: Spark RDD & Data Frames Demo – Philly Crime Data Set
  • 37. Page37 © Hortonworks Inc. 2014 Spark DataFrames
  • 38. Page38 © Hortonworks Inc. 2014 What are DataFrames? • Distributed Collection of Data organized in Columns • Equivalent to Tables in Databases or DataFrame in R/PYTHON • Much richer optimization than any other implementation of DF • Can be constructed from a wide variety of sources and APIs • Greater accessiblity • Declarative rather thanimperative • Catalyst Optimizer Why DataFrames?
  • 39. Page39 © Hortonworks Inc. 2014 Writing a DataFrame val df = sqlContext.jsonFile("/tmp/people.json") df.show() df.printSchema() df.select ("First Name").show() df.select("First Name","Age").show() df.filter(df("age")>40).show() df.groupBy("age").count().show()
  • 40. Page40 © Hortonworks Inc. 2014 Querying RDD Using SQL import org.apache.spark.sql.types.{StructType,StructField,StringType} val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) val sqlContext = new org.apache.spark.sql.SQLContext(sc) val people = sc.textFile("/tmp/people.txt") val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim)) val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema) peopleDataFrame.registerTempTable("people") val results = sqlContext.sql("SELECT name FROM people") results.map(t => "Name: " + t(0)).collect().foreach(println)
  • 41. Page41 © Hortonworks Inc. 2014 Querying RDD Using SQL // SQL statements can be run directly on RDD’s val teenagers = sqlC.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") // The results of SQL queries are SchemaRDDs and support // normal RDD operations: val nameList = teenagers.map(t => "Name: " + t(0)).collect() // Language integrated queries (ala LINQ) val teenagers = people.where('age >= 10).where('age <= 19).select('name)
  • 42. Page42 © Hortonworks Inc. 2014 Dataframes for Apache Spark DataFrame SQL DataFrame R DataFrame Python DataFrame Scala RDD Python RDD Scala Time to aggregate 10 million integer pairs (in seconds) DataFrames can be significantly faster than RDDs. And they perform the same, regardless of language.
  • 43. Page43 © Hortonworks Inc. 2014 Transformations Actions filter count select collect drop show join take Transformations contribute to the query plan but nothing is executed until an action is called Dataframes – Transformation & Actions
  • 44. Page44 © Hortonworks Inc. 2014 LAB: DataFrames http://sandbox.hortonworks.com:8081/#/notebook/2B4B7EWY7 http://sandbox.hortonworks.com:8081/#/notebook/2B5RMG4AM DataFrames + SQL DataFrames JSON
  • 45. Page45 © Hortonworks Inc. 2014 DataFrames and JDBC val jdbc_attendees = sqlContext.load("jdbc", Map("url" -> "jdbc:mysql://localhost:3306/db1?user=root&password=xxx","dbtable" -> "attendees")) jdbc_attendees.show() jdbc.attendees.count() jdbc_attendees.registerTempTable("jdbc_attendees") val countall = sqlContext.sql("select count(*) from jdbc_attendees") countall.map(t=>"Records count is "+t(0)).collect().foreach(println)
  • 46. Page46 © Hortonworks Inc. 2014 Code ‘select count’ Equivalent SQL Statement: Select count(*) from pagecounts WHERE state = ‘FL’ Scala statement: val file = sc.textFile("hdfs://…/log.txt") val numFL = file.filter(line => line.contains("fl")).count() scala> println(numFL) 46 1. Load the page as an RDD 2. Filter the lines of the page eliminating any that do not contain “fl“ 3. Count those lines that remain 4. Print the value of the counted lines containing ‘fl’
  • 47. Page47 © Hortonworks Inc. 2014 Spark SQL 47
  • 48. Page48 © Hortonworks Inc. 2014 48 Platform APIs • Joining Data from Different Sources • Access Data using DataFrames / SQL
  • 49. Page49 © Hortonworks Inc. 2014 LAB: JDBC and 3rd party packages http://sandbox.hortonworks.com:8081/#/notebook/2B2P8RE82
  • 50. Page50 © Hortonworks Inc. 2014 What About Integration With Hive? scala> val hiveCTX = new org.apache.spark.sql.hive.HiveContext(sc) scala> hiveCTX.hql("SHOW TABLES").collect().foreach(println) … [omniture] [omniturelogs] [orc_table] [raw_products] [raw_users] … 50
  • 51. Page51 © Hortonworks Inc. 2014 More Integration With Hive: scala> hCTX.hql("DESCRIBE raw_users").collect().foreach(println) [swid,string,null] [birth_date,string,null] [gender_cd,string,null] scala> hCTX.hql("SELECT * FROM raw_users WHERE gender_cd='F' LIMIT 5").collect().foreach(println) [0001BDD9-EABF-4D0D-81BD-D9EABFCD0D7D,8-Apr-84,F] [00071AA7-86D2-4EB9-871A-A786D27EB9BA,7-Feb-88,F] [00071B7D-31AF-4D85-871B-7D31AFFD852E,22-Oct-64,F] [000F36E5-9891-4098-9B69-CEE78483B653,24-Mar-85,F] [00102F3F-061C-4212-9F91-1254F9D6E39F,1-Nov-91,F] 51
  • 52. Page52 © Hortonworks Inc. 2014 LAB: HIVE ORC http://sandbox.hortonworks.com:8081/#/notebook/2B6KUW16Z
  • 53. Page53 © Hortonworks Inc. 2014 Spark Streaming
  • 54. Page54 © Hortonworks Inc. 2014 MicroBatch Spark Streams
  • 55. Page55 © Hortonworks Inc. 2014 Physical Execution
  • 56. Page56 © Hortonworks Inc. 2014 Spark Streaming 101 • Spark has significant library support for streaming applications val ssc = new StreamingContext(sc, Seconds(5)) val tweetStream = TwitterUtils.createStream(ssc, Some(auth)) • Allows to combine Streaming with Batch/ETL,SQL & ML • Read data from HDFS, Flume, Kafka, Twitter, ZeroMQ & custom. • Chop input data stream into batches • Spark processes batches & results published in batches • Fundamental unit is Discretized Streams (DStreams)
  • 57. Page57 © Hortonworks Inc. 2014 Spark MLLib
  • 58. Page58 © Hortonworks Inc. 2014 Spark MLlib – Algorithms Offered • Classification: logistic regression, linear SVM, – naïve Bayes, least squares, classification tree • Regression: generalized linear models (GLMs), – regression tree • Collaborative filtering: alternating least squares (ALS), – non-negative matrix factorization (NMF) • Clustering: k-means • Decomposition: SVD, PCA • Optimization: stochastic gradient descent, L-BFGS
  • 59. Page59 © Hortonworks Inc. 2014 59 ML - Pipelines • New algorithms KMeans [SPARK-7879], Naive Bayes [SPARK- 8600], Bisecting KMeans • [SPARK-6517], Multi-layer Perceptron (ANN) [SPARK-2352], Weighting for • Linear Models [SPARK-7685] • New transformers (close to parity with SciKit learn): CountVectorizer [SPARK-8703], • PCA [SPARK-8664], DCT [SPARK-8471], N-Grams [SPARK-8455] • Calling into single machine solvers (coming soon as a package)
  • 60. Page60 © Hortonworks Inc. 2014 Twitter Language Classifier Goal: connect to real time twitter stream and print only those tweets whose language match our chosen language. Main issue: how to detect the language during run time? Solution: build a language classifier model offline capable of detecting language of tweet (Mlib). Then, apply it to real time twitter stream and do filtering (Spark Streaming).
  • 61. Page61 © Hortonworks Inc. 2014 Conclusion and Q&A
  • 62. Page62 © Hortonworks Inc. 2014 Learn More Spark + Hadoop Perfect Together HDP Spark General Info: http://hortonworks.com/hadoop/spark/ Learn more about our Focus on Spark: http://hortonworks.com/hadoop/spark/#section_6 Get the HDP Spark 1.5.1 Tech Preview: http://hortonworks.com/hadoop/spark/#section_5 Get started with Spark and Zeppelin and download the Sandbox: http://hortonworks.com/sandbox Try these tutorials: http://hortonworks.com/hadoop/spark/#tutorials http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/ Learn more about GeoSpatial Spark processing with Magellan: http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/