SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
APACHE SPARK & ELASTICSEARCH
Holden Karau
Reducing duplicated code and saving on network overhead
Who am I?
Holden Karau
● Software Engineer @ Databricks
● I’ve worked with Elasticsearch before
● I prefer she/her for pronouns
● Author of a book on Spark and co-writing another*
● github https://github.com/holdenk
● e-mail holden@databricks.com
● @holdenkarau
*Which is why I might be sleepy today.
What is Spark & Elasticsearch
Spark
● Apache Spark™ is a fast and general engine for large-
scale data processing.
● http://spark.apache.org/
Elasticsearch
● Elasticsearch is a real-time distributed search and
analytics engine.
● http://www.elasticsearch.org/
Talk overview
Goal: understand how to work with ES & Hadoop
● Spark & Spark streaming let us re-use indexing code
● Its a bit ugly right now….
● Demo* with tweets & top hash tags per region
● We can customize the ES connector to write to the shard based on
partition
● This is an early version of the talk (feedback welcome!)
Assumptions:
● Familiar(ish) with Elasticsearch or at least Solr
● Can read Scala
*Demo gods willing.
Why should you care?
Small differences between off-line and on-line
Spot the difference picture from http://en.wikipedia.org/wiki/Spot_the_difference#mediaviewer/File:
Spot_the_difference.png
Leads to
fire works photo by Hailey Toft
Cat picture from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Lets start with the on-line pipeline
val ssc = new StreamingContext(master, "IndexTweetsLive",
Seconds(1))
// Set up the system properties for twitter
System.setProperty("twitter4j.oauth.consumerKey", cK)
System.setProperty("twitter4j.oauth.consumerSecret", cS)
System.setProperty("twitter4j.oauth.accessToken", aT)
System.setProperty("twitter4j.oauth.accessTokenSecret",
ats)
val tweets = TwitterUtils.createStream(ssc, None)
Lets get ready to write the data into
Elasticsearch
Photo by Cloned Milkmen
Lets get ready to write the data into
Elasticsearch
def setupEsOnSparkContext(sc: SparkContext) = {
val jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("mapred.output.format.class",
"org.elasticsearch.hadoop.mr.EsOutputFormat")
jobConf.setOutputCommitter(classOf[FileOutputCommitter])
jobConf.set(ConfigurationOptions.ES_RESOURCE_WRITE,
“twitter/tweet”)
FileOutputFormat.setOutputPath(jobConf, new Path("-"))
jobconf
}
Add a schema
curl -XPUT 'http://localhost:
9200/twitter/tweet/_mapping' -d '
{
"tweet" : {
"properties" : {
"message" : {"type" : "string"},
"hashTags" : {"type" : "string"},
"location" : {"type" : "geo_point"}
}
}
}
'
Lets format our tweets
def prepareTweets(tweet: twitter4j.Status) = {
val loc = tweet.getGeoLocation()
val lat = loc.getLatitude()
val lon = loc.getLongitude()
val hashTags = tweet.getHashtagEntities().map(_.getText())
HashMap(
"docid" -> tweet.getId().toString,
"message" -> tweet.getText(),
"hashTags" -> hashTags.mkString(" "),
"location" -> s"$lat,$lon"
)
}
}
// Convert to HadoopWritable types
mapToOutput(fields)
}
And save them...
tweets.foreachRDD{(tweetRDD, time) =>
val sc = tweetRDD.context
// The jobConf isn’t serilizable so we create it here
val jobConf = SharedESConfig.setupEsOnSparkContext(sc,
esResource, Some(esNodes))
// Convert our tweets to something that can be indexed
val tweetsAsMap = tweetRDD.map(
SharedIndex.prepareTweets)
tweetsAsMap.saveAsHadoopDataset(jobConf)
}
Now let’s query them!
{"filtered" : {
"query" : {
"match_all" : {}
}
,"filter" :
{"geo_distance" :
{
"distance" : "${dist}km",
"location" :
{
"lat" : "${lat}",
"lon" : "${lon}"
}}}}}}
Now let’s find the hash tags :)
jobConf.set("es.query", query)
val currentTweets = sc.hadoopRDD(jobConf,
classOf[EsInputFormat[Object, MapWritable]],
classOf[Object], classOf[MapWritable])
val tweets = currentTweets.map{ case (key, value) =>
SharedIndex.mapWritableToInput(value) }
val hashTags = tweets.flatMap{t =>
t.getOrElse("hashTags", "").split(" ")
}
println(hashTags.countByValue())
oh wait :(
Sad panda by Jose Antonio Tovar
Now let’s find some common
words….
// Extract the top words
val words = tweets.flatMap{tweet =>
tweet.flatMap{elem =>
elem._2 match {
case null => Nil
case _ => elem._2.split(" ")
}}}
val wordCounts = words.countByValue()
println("------")
wordCounts.foreach{ case(key, value) => println(key +
":" + value) }
println("------")
object WordCountOrdering extends Ordering[(String, Int)]
{
def compare(a: (String, Int), b: (String, Int)) = {
b._2 compare a._2
}
}
val wc = words.map(x => (x, 1)).reduceByKey((x,y) =>
x+y)
val topWords = wc.takeOrdered(40)(WordCountOrdering)
ok, fine, the “top” words
NYC words
I,144
a,83
the,76
to,75
you,66
my,56
and,52
me,47
that,39
of,38
in,38
on,37
so,36
like,35
is,32
this,30
was,29
it,28
with,27
for,25
I'm,24
i,23
but,22
just,22
at,21
be,21
are,20
don't,19
have,18
lol,17
out,17
your,16
love,16
up,16
all,16
her,15
when,14
not,13
it's,13
SF words
I,18
the,16
to,15
you,9
a,9
my,8
me,8
for,8
at,7
of,7
want,6
Just,6
in,6
is,6
this,5
got,5
she,5
when,5
was,4
so,4
what,4
&,4
he,4
your,4
as,4
they,4
it,4
@,4
get,4
and,4
are,4
say,3
w/,3
do,3
dont,3
going,3
fuck,3
i,3
know,3
or,3
Indexing Part 2
(electric boogaloo)
Writing directly to a node with the correct shard saves us network overhead
Screen shot of elasticsearch-head http://mobz.github.io/elasticsearch-head/
Slight hack time
We clone the connector and do update EsOutputFormat.java [see https://github.
com/holdenk/elasticsearch-hadoop ]
private int detectCurrentInstance(Configuration conf) {
if (sparkInstance != null) {
if (log.isDebugEnabled()) {
log.debug(String.format("Using Spark patition info [%d]", sparkInstance, uri));
}
return sparkInstance;
}
….
}
Slight hack time
We clone the connector and do update EsOutputFormat.java [see https://github.
com/holdenk/elasticsearch-hadoop ]
public org.apache.hadoop.mapred.RecordWriter getRecordWriter(FileSystem ignored,
JobConf job, String name, Progressable progress) {
EsRecordWriter writer = new EsRecordWriter(job, progress);
// This is a special hack for Spark which sets the name as "part-[partitionnumber]" so if our
// jobconf asks for it we use this partition number as the shard number.
if (HadoopCfgUtils.useSparkPartition(job) && name.startsWith("part-") ) {
writer.setSparkInstance(Integer.valueOf(name.substring(5)));
}
return writer;
}
So what does that give us?
Spark sets the file name to part-[partition number]
If we have same partitioner we write directly
Likely the best place to use this is in re-indexing data
Re-index all the things*
// Read in our data set
val currentTweets = sc.hadoopRDD(jobConf,
classOf[EsInputFormat[Object, MapWritable]],
classOf[Object], classOf[MapWritable])
// Fetch them from twitter
val t4jt = tweets.flatMap{ tweet =>
val twitter = TwitterFactory.getSingleton()
val tweetID = tweet.getOrElse("docid", "")
Option(twitter.showStatus(tweetID.toLong))
}
t4jt.map(SharedIndex.prepareTweets)
.saveAsHadoopDataset(jobConf)
*Until you hit your twitter rate limit…. oops
Cat photo from https://www.flickr.com/photos/deerwooduk/579761138/in/photolist-4GCc4z-4GCbAV-6Ls27-34evHS-5UBnJv-TeqMG-4iNNn5-4w7s61-
6GMLYS-6H5QWY-6aJLUT-tqfrf-6mJ1Lr-84kGX-6mJ1GB-vVqN6-dY8aj5-y3jK-7C7P8Z-azEtd/
“Useful” links
● Feedback: holden@databricks.com
● Customized ES connector*: https://github.
com/holdenk/elasticsearch-hadoop
● Demo code: https://github.com/holdenk/elasticsearchspark
● Elasticsearch: http://www.elasticsearch.org/
● Spark: http://spark.apache.org/
● Spark streaming: http://spark.apache.org/streaming/

Weitere ähnliche Inhalte

Was ist angesagt?

Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureSpark Summit
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sjHolden Karau
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studyCharlie Hull
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1Maruf Hassan
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Karel Minarik
 
ElasticSearch in action
ElasticSearch in actionElasticSearch in action
ElasticSearch in actionCodemotion
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark MLHolden Karau
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Sematext Group, Inc.
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchPatricia Gorla
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning ElasticsearchAnurag Patel
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchRuslan Zavacky
 
Elasticsearch - under the hood
Elasticsearch - under the hoodElasticsearch - under the hood
Elasticsearch - under the hoodSmartCat
 

Was ist angesagt? (20)

Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative Infrastructure
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)
 
ElasticSearch in action
ElasticSearch in actionElasticSearch in action
ElasticSearch in action
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise Search
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Elasticsearch speed is key
Elasticsearch speed is keyElasticsearch speed is key
Elasticsearch speed is key
 
Elasticsearch - under the hood
Elasticsearch - under the hoodElasticsearch - under the hood
Elasticsearch - under the hood
 

Ähnlich wie 2014 spark with elastic search

Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...NoSQLmatters
 
A hands-on introduction to the ELF Object file format
A hands-on introduction to the ELF Object file formatA hands-on introduction to the ELF Object file format
A hands-on introduction to the ELF Object file formatrety61
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Jean-Georges Perrin
 
Solving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with RailsSolving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with Railsfreelancing_god
 
Fazendo mágica com ElasticSearch
Fazendo mágica com ElasticSearchFazendo mágica com ElasticSearch
Fazendo mágica com ElasticSearchPedro Franceschi
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Forcelandia 2016 PK Chunking
Forcelandia 2016 PK ChunkingForcelandia 2016 PK Chunking
Forcelandia 2016 PK ChunkingDaniel Peter
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...PROIDEA
 
Publishing a Perl6 Module
Publishing a Perl6 ModulePublishing a Perl6 Module
Publishing a Perl6 Moduleast_j
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring DataEric Bottard
 
When to NoSQL and when to know SQL
When to NoSQL and when to know SQLWhen to NoSQL and when to know SQL
When to NoSQL and when to know SQLSimon Elliston Ball
 
Building modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and javaBuilding modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and javaAlexander Gyoshev
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Ramamohan Chokkam
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Going crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHPGoing crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHPMariano Iglesias
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
 

Ähnlich wie 2014 spark with elastic search (20)

Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
A hands-on introduction to the ELF Object file format
A hands-on introduction to the ELF Object file formatA hands-on introduction to the ELF Object file format
A hands-on introduction to the ELF Object file format
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)
 
Solving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with RailsSolving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with Rails
 
Fazendo mágica com ElasticSearch
Fazendo mágica com ElasticSearchFazendo mágica com ElasticSearch
Fazendo mágica com ElasticSearch
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Forcelandia 2016 PK Chunking
Forcelandia 2016 PK ChunkingForcelandia 2016 PK Chunking
Forcelandia 2016 PK Chunking
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
 
Publishing a Perl6 Module
Publishing a Perl6 ModulePublishing a Perl6 Module
Publishing a Perl6 Module
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
When to NoSQL and when to know SQL
When to NoSQL and when to know SQLWhen to NoSQL and when to know SQL
When to NoSQL and when to know SQL
 
Building modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and javaBuilding modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and java
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Going crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHPGoing crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHP
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 

Kürzlich hochgeladen

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 

Kürzlich hochgeladen (20)

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

2014 spark with elastic search

  • 1. APACHE SPARK & ELASTICSEARCH Holden Karau Reducing duplicated code and saving on network overhead
  • 2. Who am I? Holden Karau ● Software Engineer @ Databricks ● I’ve worked with Elasticsearch before ● I prefer she/her for pronouns ● Author of a book on Spark and co-writing another* ● github https://github.com/holdenk ● e-mail holden@databricks.com ● @holdenkarau *Which is why I might be sleepy today.
  • 3. What is Spark & Elasticsearch Spark ● Apache Spark™ is a fast and general engine for large- scale data processing. ● http://spark.apache.org/ Elasticsearch ● Elasticsearch is a real-time distributed search and analytics engine. ● http://www.elasticsearch.org/
  • 4. Talk overview Goal: understand how to work with ES & Hadoop ● Spark & Spark streaming let us re-use indexing code ● Its a bit ugly right now…. ● Demo* with tweets & top hash tags per region ● We can customize the ES connector to write to the shard based on partition ● This is an early version of the talk (feedback welcome!) Assumptions: ● Familiar(ish) with Elasticsearch or at least Solr ● Can read Scala *Demo gods willing.
  • 5. Why should you care? Small differences between off-line and on-line Spot the difference picture from http://en.wikipedia.org/wiki/Spot_the_difference#mediaviewer/File: Spot_the_difference.png
  • 6. Leads to fire works photo by Hailey Toft
  • 7. Cat picture from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
  • 8. Lets start with the on-line pipeline val ssc = new StreamingContext(master, "IndexTweetsLive", Seconds(1)) // Set up the system properties for twitter System.setProperty("twitter4j.oauth.consumerKey", cK) System.setProperty("twitter4j.oauth.consumerSecret", cS) System.setProperty("twitter4j.oauth.accessToken", aT) System.setProperty("twitter4j.oauth.accessTokenSecret", ats) val tweets = TwitterUtils.createStream(ssc, None)
  • 9. Lets get ready to write the data into Elasticsearch Photo by Cloned Milkmen
  • 10. Lets get ready to write the data into Elasticsearch def setupEsOnSparkContext(sc: SparkContext) = { val jobConf = new JobConf(sc.hadoopConfiguration) jobConf.set("mapred.output.format.class", "org.elasticsearch.hadoop.mr.EsOutputFormat") jobConf.setOutputCommitter(classOf[FileOutputCommitter]) jobConf.set(ConfigurationOptions.ES_RESOURCE_WRITE, “twitter/tweet”) FileOutputFormat.setOutputPath(jobConf, new Path("-")) jobconf }
  • 11. Add a schema curl -XPUT 'http://localhost: 9200/twitter/tweet/_mapping' -d ' { "tweet" : { "properties" : { "message" : {"type" : "string"}, "hashTags" : {"type" : "string"}, "location" : {"type" : "geo_point"} } } } '
  • 12. Lets format our tweets def prepareTweets(tweet: twitter4j.Status) = { val loc = tweet.getGeoLocation() val lat = loc.getLatitude() val lon = loc.getLongitude() val hashTags = tweet.getHashtagEntities().map(_.getText()) HashMap( "docid" -> tweet.getId().toString, "message" -> tweet.getText(), "hashTags" -> hashTags.mkString(" "), "location" -> s"$lat,$lon" ) } } // Convert to HadoopWritable types mapToOutput(fields) }
  • 13. And save them... tweets.foreachRDD{(tweetRDD, time) => val sc = tweetRDD.context // The jobConf isn’t serilizable so we create it here val jobConf = SharedESConfig.setupEsOnSparkContext(sc, esResource, Some(esNodes)) // Convert our tweets to something that can be indexed val tweetsAsMap = tweetRDD.map( SharedIndex.prepareTweets) tweetsAsMap.saveAsHadoopDataset(jobConf) }
  • 14. Now let’s query them! {"filtered" : { "query" : { "match_all" : {} } ,"filter" : {"geo_distance" : { "distance" : "${dist}km", "location" : { "lat" : "${lat}", "lon" : "${lon}" }}}}}}
  • 15. Now let’s find the hash tags :) jobConf.set("es.query", query) val currentTweets = sc.hadoopRDD(jobConf, classOf[EsInputFormat[Object, MapWritable]], classOf[Object], classOf[MapWritable]) val tweets = currentTweets.map{ case (key, value) => SharedIndex.mapWritableToInput(value) } val hashTags = tweets.flatMap{t => t.getOrElse("hashTags", "").split(" ") } println(hashTags.countByValue())
  • 16. oh wait :( Sad panda by Jose Antonio Tovar
  • 17. Now let’s find some common words…. // Extract the top words val words = tweets.flatMap{tweet => tweet.flatMap{elem => elem._2 match { case null => Nil case _ => elem._2.split(" ") }}} val wordCounts = words.countByValue() println("------") wordCounts.foreach{ case(key, value) => println(key + ":" + value) } println("------")
  • 18. object WordCountOrdering extends Ordering[(String, Int)] { def compare(a: (String, Int), b: (String, Int)) = { b._2 compare a._2 } } val wc = words.map(x => (x, 1)).reduceByKey((x,y) => x+y) val topWords = wc.takeOrdered(40)(WordCountOrdering) ok, fine, the “top” words
  • 21. Indexing Part 2 (electric boogaloo) Writing directly to a node with the correct shard saves us network overhead Screen shot of elasticsearch-head http://mobz.github.io/elasticsearch-head/
  • 22. Slight hack time We clone the connector and do update EsOutputFormat.java [see https://github. com/holdenk/elasticsearch-hadoop ] private int detectCurrentInstance(Configuration conf) { if (sparkInstance != null) { if (log.isDebugEnabled()) { log.debug(String.format("Using Spark patition info [%d]", sparkInstance, uri)); } return sparkInstance; } …. }
  • 23. Slight hack time We clone the connector and do update EsOutputFormat.java [see https://github. com/holdenk/elasticsearch-hadoop ] public org.apache.hadoop.mapred.RecordWriter getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) { EsRecordWriter writer = new EsRecordWriter(job, progress); // This is a special hack for Spark which sets the name as "part-[partitionnumber]" so if our // jobconf asks for it we use this partition number as the shard number. if (HadoopCfgUtils.useSparkPartition(job) && name.startsWith("part-") ) { writer.setSparkInstance(Integer.valueOf(name.substring(5))); } return writer; }
  • 24. So what does that give us? Spark sets the file name to part-[partition number] If we have same partitioner we write directly Likely the best place to use this is in re-indexing data
  • 25. Re-index all the things* // Read in our data set val currentTweets = sc.hadoopRDD(jobConf, classOf[EsInputFormat[Object, MapWritable]], classOf[Object], classOf[MapWritable]) // Fetch them from twitter val t4jt = tweets.flatMap{ tweet => val twitter = TwitterFactory.getSingleton() val tweetID = tweet.getOrElse("docid", "") Option(twitter.showStatus(tweetID.toLong)) } t4jt.map(SharedIndex.prepareTweets) .saveAsHadoopDataset(jobConf) *Until you hit your twitter rate limit…. oops
  • 26. Cat photo from https://www.flickr.com/photos/deerwooduk/579761138/in/photolist-4GCc4z-4GCbAV-6Ls27-34evHS-5UBnJv-TeqMG-4iNNn5-4w7s61- 6GMLYS-6H5QWY-6aJLUT-tqfrf-6mJ1Lr-84kGX-6mJ1GB-vVqN6-dY8aj5-y3jK-7C7P8Z-azEtd/
  • 27. “Useful” links ● Feedback: holden@databricks.com ● Customized ES connector*: https://github. com/holdenk/elasticsearch-hadoop ● Demo code: https://github.com/holdenk/elasticsearchspark ● Elasticsearch: http://www.elasticsearch.org/ ● Spark: http://spark.apache.org/ ● Spark streaming: http://spark.apache.org/streaming/