SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Building an ETL pipeline for
Elasticsearch using Spark
* *
@2014 eXelate Inc. Confidential and Proprietary
Itai Yaffe, Big Data Infrastructure Developer
December 2015
Agenda
• About eXelate
• About the team
• eXelate’s architecture overview
• The need
• The problem
• Why Elasticsearch and how do we use it?
• Loading the data
• Re-designing the loading process
• Additional improvements
• To summarize
* *
©2011 eXelate Inc. Confidential and Proprietary
About eXelate, a Nielsen company
• Founded in 2007
• Acquired by Nielsen on March, 2015
• A leader in the Ad Tech industry
• Provides data and software services through :
• eXchange (2 billion users)
• maX DMP (data management platform)
* *
©2011 eXelate Inc. Confidential and Proprietary
Our numbers
* *
©2011 eXelate Inc. Confidential and Proprietary
• ~10 billion events per day
• ~150TB of data per day
• Hybrid cloud infrastructure
• 4 Data Centers
• Amazon Web Services
About the team
• The BDI (Big Data Infrastructure) team is in charge
of shipping, transforming and loading eXelate’s data
into various data stores, making it ready to be
queried efficiently
• For the last year and a half, we’ve been transitioning
our legacy systems to modern, scale-out
infrastructure (Spark, Kafka, etc.)
* *
©2011 eXelate Inc. Confidential and Proprietary
About me
• Dealing with Big Data challenges for the last 3.5 years,
using :
• Cassandra
• Spark
• Elasticsearch
• And others…
• Joined eXelate on May 2014
• Previously : OpTier, Mamram
• LinkedIn : https://www.linkedin.com/in/itaiy
• Email : itai.yaffe@nielsen.com
* *
©2011 eXelate Inc. Confidential and Proprietary
eXelate’s architecture overview
* *
©2011 eXelate Inc. Confidential and Proprietary
Serving
(frontend
servers)
Incoming
HTTP
requests
ETL
ETL
ETL
DMP
applications
(SaaS)
DB
DWH
The need
* *
©2011 eXelate Inc. Confidential and Proprietary
The need
* *
©2011 eXelate Inc. Confidential and Proprietary
The need
• From the data perspective :
• ETL – collect raw data and load it into
Elasticsearch periodically
• Tens of millions of events per day
• Data is already labeled
• Query - allow ad hoc calculations based on the
stored data
• Mainly counting unique users related to a specific
campaign in conjunction with
geographic/demographic data limited by date range
• The number of permutations is huge, so real-time
queries are a must! (and can’t be pre-calculated)
* *
©2011 eXelate Inc. Confidential and Proprietary
The problem
• We chose Elasticsearch as the data store (details to
follow)
• But… the ETL process was far from optimal
• Also affected query performance
* *
©2011 eXelate Inc. Confidential and Proprietary
Why Elasticsearch?
• Originally designed as a text search engine
• Today it has advanced real-time analytics
capabilities
• Distributed, scalable and highly available
* *
©2011 eXelate Inc. Confidential and Proprietary
How do we use Elasticsearch?
• We rely heavily on its counting capabilities
• Splitting the data into separate indices based on a
few criteria (e.g TTL, tags VS segments)
• Each user (i.e device) is stored as a document with
many nested document
* *
©2011 eXelate Inc. Confidential and Proprietary
How do we use Elasticsearch?
* *
©2011 eXelate Inc. Confidential and Proprietary
{
"_index": "sample",
"_type": "user",
"_id": "0c31644ad41e32c819be29ba16e14300",
"_version": 4,
"_score": 1,
"_source": {
"events": [
{
"event_time": "2014-01-18",
"segments": [
{
"segment": "female"
}
,{
"segment": "Airplane tickets"
}
]
},
{
"event_time": "2014-02-19",
"segments": [
{
"segment": "female"
}
,{
"segment": "Hotel reservations"
}
]
}
]
}
}
Loading the data
* *
©2011 eXelate Inc. Confidential and Proprietary
Standalone Java loader application
• Runs every few minutes
• Parses the log files
• For each user we encountered :
• Queries Elasticsearch to get the user’s document
• Merges the new data into the document on the
client-side
• Bulk-indexes documents into Elasticsearch
* *
©2011 eXelate Inc. Confidential and Proprietary
OK, so what’s the problem?
• Multiple updates per user per day
• Updates in Elasticsearch are expensive (basically delete +
insert)
• Merges are done on the client-side
• Involves redundant queries
• Leads to degradation of query performance
• Not scalable or high available
* *
©2011 eXelate Inc. Confidential and Proprietary
Re-designing the loading process
• Batch processing once a day during off-hours
• Daily dedup leads to ~75% less update operations in
Elasticsearch
• Using Spark as our processing framework
• Distributed, scalable and highly available
• Unified framework for batch, streaming, machine
learning, etc.
• Using update script
• Merges are done on the server-side
* *
©2011 eXelate Inc. Confidential and Proprietary
Elasticsearch update script
* *
©2011 eXelate Inc. Confidential and Proprietary
import groovy.json.JsonSlurper;
added=false;
def slurper = new JsonSlurper();
def result = slurper.parseText(param1);
ctx._ttl = ttl;
ctx._source.events.each() {
item->if (item.event_time == result[0].event_time) {
def segmentMap = [:];
item.segments.each() {
segmentMap.put(it.segment,it.segment)
};
result[0].segments.each{
if(!segmentMap[it.segment]){
item.segments += it
}
};
added=true;
}
};
if(!added) {
ctx._source.events += result
}
Re-designing the loading process
* *
©2011 eXelate Inc. Confidential and Proprietary
AWS S3
AWS Data Pipeline
AWS EMR
AWS SNS
Zoom-in
• Log files are compressed (.gz) CSVs
• Once a day :
• Files are copied and uncompressed into the EMR cluster
using S3DistCp
• The Spark application :
• Groups events by user and build JSON documents,
which include an inline udpate script
• Writes the JSON documents back to S3
• The Scala application reads the documents from S3 and
bulk-indexes them into Elasticsearch
• Notifications are sent via SNS
* *
©2011 eXelate Inc. Confidential and Proprietary
We discovered it wasn’t enough…
• Redundant moving parts
• Excessive network traffic
• Still not scalable enough
* *
©2011 eXelate Inc. Confidential and Proprietary
Elasticsearch-Spark plugin-in for the rescue…
* *
©2011 eXelate Inc. Confidential and Proprietary
AWS S3
AWS Data Pipeline
AWS EMR
Elasticsearch-Spark plug-in
AWS SNS
Deep-dive
• Bulk-indexing directly from Spark using
elasticsearch-hadoop plugin-in for Spark :
// Save created RDD records to a file
documentsRdd.saveAsTextFile(outputPath)
Is now :
// Save created RDD records directly to Elasticsearch
documentsRdd.saveJsonToEs(configData.documentResource,
scala.collection.Map(ConfigurationOptions.ES_MAPPING_ID ->
configData.documentIdFieldName))
• Storing the update script on the server-side (Elasticsearch)
* *
©2011 eXelate Inc. Confidential and Proprietary
Better…
• Single component for both processing and indexing
• Elastically scalable
• Out-of-the-box error handling and fault-tolerance
• Spark-level (e.g spark.task.maxFailures)
• Plug-in level (e.g
ConfigurationOptions.ES_BATCH_WRITE_RETRY_COUNT/
WAIT)
• Less network traffic (update script is stored in
Elasticsearch)
* *
©2011 eXelate Inc. Confidential and Proprietary
… But
• Number of deleted documents continually grows
• Also affects query performance
• Elasticsearch itself becomes the bottleneck
• org.elasticsearch.hadoop.EsHadoopException: Could not
write all entries [5/1047872] (maybe ES was
overloaded?). Bailing out...
• [INFO ][index.engine ] [NODE_NAME]
[INDEX_NAME][7] now throttling indexing:
numMergesInFlight=6, maxNumMerges=5
* *
©2011 eXelate Inc. Confidential and Proprietary
Expunging deleted documents
• Theoretically not a “best practice” but necessary
when doing significant bulk-indexing
• Done through the optimize API
• curl -XPOST
http://localhost:9200/_optimize?only_expunge_deletes
• curl -XPOST
http://localhost:9200/_optimize?max_num_segments=5
• A heavy operation (time, CPU , I/O)
* *
©2011 eXelate Inc. Confidential and Proprietary
Improving indexing performance
• Set index.refresh_interval to -1
• Set indices.store.throttle.type to none
• Properly set the retry-related configuration
properties (e.g spark.task.maxFailures)
* *
©2011 eXelate Inc. Confidential and Proprietary
What’s next?
• Further improve indexing performance, e.g :
• Reduce excessive concurrency on Elasticsearch nodes by
limiting Spark’s maximum concurrent tasks
• Bulk-index objects rather than JSON documents to avoid
excessive parsing
• Better monitoring (e.g using Spark Accumulators)
* *
©2011 eXelate Inc. Confidential and Proprietary
To summarize
• We use :
• S3 to store (raw) labeled data
• Spark on EMR to process the data
• Elasticsearch-hadoop plug-in for bulk-indexing
• Data Pipeline to manage the flow
• Elasticsearch for real-time analytics
* *
©2011 eXelate Inc. Confidential and Proprietary
To summarize
• Updates are expensive – consider daily dedup
• Avoid excessive querying and network traffic -
perform merges on the server-side
• Use an update script
• Store it on your Elasticsearch cluster
• Make sure your loading process is scalable and
fault-tolerant – use Spark
• Reduce # of moving parts
• Index the data directly using elasticsearch-hadoop plug-in
* *
©2011 eXelate Inc. Confidential and Proprietary
To summarize
• Improve indexing performance – properly configure
your cluster before indexing
• Avoid excessive disk usage – optimize your indices
• Can also help query performance
• Making the processing phase elastically scalable (i.e
using Spark) doesn’t mean the whole ETL flow is
elastically scalable
• Elasticsearch becomes the new bottleneck…
* *
©2011 eXelate Inc. Confidential and Proprietary
Questions?
Also - we’re hiring!
http://exelate.com/about-us/careers/
•DevOps team leader
•Senior frontend developers
•Senior Java developers
* *
©2011 eXelate Inc. Confidential and Proprietary
Thank you
©2011 eXelate Inc. Confidential and Proprietary
Itai Yaffe
Keep an eye on…
• S3 limitations :
• The penalty involved in moving files
• File partitioning and hash prefixes
* *
©2011 eXelate Inc. Confidential and Proprietary

Weitere ähnliche Inhalte

Was ist angesagt?

ETL VS ELT.pdf
ETL VS ELT.pdfETL VS ELT.pdf
ETL VS ELT.pdfBOSupport
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data MiningIffat Firozy
 
From Conventional Machine Learning to Deep Learning and Beyond.pptx
From Conventional Machine Learning to Deep Learning and Beyond.pptxFrom Conventional Machine Learning to Deep Learning and Beyond.pptx
From Conventional Machine Learning to Deep Learning and Beyond.pptxChun-Hao Chang
 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom IndustrySatyam Barsaiyan
 
Predicting Bank Customer Churn Using Classification
Predicting Bank Customer Churn Using ClassificationPredicting Bank Customer Churn Using Classification
Predicting Bank Customer Churn Using ClassificationVishva Abeyrathne
 
Customer-Centric Data Management for Better Customer Experiences
Customer-Centric Data Management for Better Customer ExperiencesCustomer-Centric Data Management for Better Customer Experiences
Customer-Centric Data Management for Better Customer ExperiencesInformatica
 
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...Neo4j
 
Introduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AIIntroduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AISemantic Web Company
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaRahul Bhatia
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachAIRCC Publishing Corporation
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Sri Ambati
 
How Data is Driving AI Innovation
How Data is Driving AI InnovationHow Data is Driving AI Innovation
How Data is Driving AI InnovationMatt Turner
 
Heart Failure Prediction Model Using ANN.pptx
Heart Failure Prediction Model Using ANN.pptxHeart Failure Prediction Model Using ANN.pptx
Heart Failure Prediction Model Using ANN.pptxJatinSinghSagoi
 
OSINT-_-nouveau-paradigme-du-renseignement.-.pdf
OSINT-_-nouveau-paradigme-du-renseignement.-.pdfOSINT-_-nouveau-paradigme-du-renseignement.-.pdf
OSINT-_-nouveau-paradigme-du-renseignement.-.pdfPoirosint
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar itemsViet-Trung TRAN
 

Was ist angesagt? (20)

ETL VS ELT.pdf
ETL VS ELT.pdfETL VS ELT.pdf
ETL VS ELT.pdf
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Hadoop
HadoopHadoop
Hadoop
 
From Conventional Machine Learning to Deep Learning and Beyond.pptx
From Conventional Machine Learning to Deep Learning and Beyond.pptxFrom Conventional Machine Learning to Deep Learning and Beyond.pptx
From Conventional Machine Learning to Deep Learning and Beyond.pptx
 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom Industry
 
Predicting Bank Customer Churn Using Classification
Predicting Bank Customer Churn Using ClassificationPredicting Bank Customer Churn Using Classification
Predicting Bank Customer Churn Using Classification
 
Customer-Centric Data Management for Better Customer Experiences
Customer-Centric Data Management for Better Customer ExperiencesCustomer-Centric Data Management for Better Customer Experiences
Customer-Centric Data Management for Better Customer Experiences
 
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
 
Prediction of housing price
Prediction of housing pricePrediction of housing price
Prediction of housing price
 
Introduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AIIntroduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AI
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_Bhatia
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
 
How Data is Driving AI Innovation
How Data is Driving AI InnovationHow Data is Driving AI Innovation
How Data is Driving AI Innovation
 
Chapitre i-intro
Chapitre i-introChapitre i-intro
Chapitre i-intro
 
Heart Failure Prediction Model Using ANN.pptx
Heart Failure Prediction Model Using ANN.pptxHeart Failure Prediction Model Using ANN.pptx
Heart Failure Prediction Model Using ANN.pptx
 
OSINT-_-nouveau-paradigme-du-renseignement.-.pdf
OSINT-_-nouveau-paradigme-du-renseignement.-.pdfOSINT-_-nouveau-paradigme-du-renseignement.-.pdf
OSINT-_-nouveau-paradigme-du-renseignement.-.pdf
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
 

Andere mochten auch

Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic searchHenry Saputra
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
ElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementMohamed hedi Abidi
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Tirer le meilleur de ses données avec ElasticSearch
Tirer le meilleur de ses données avec ElasticSearchTirer le meilleur de ses données avec ElasticSearch
Tirer le meilleur de ses données avec ElasticSearchSéven Le Mesle
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudDataWorks Summit
 
ElasticSearch on AWS
ElasticSearch on AWSElasticSearch on AWS
ElasticSearch on AWSPhilipp Garbe
 
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Chris Fregly
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchAbhishek Andhavarapu
 
Scaling real-time search and analytics with Elasticsearch
Scaling real-time search and analytics with ElasticsearchScaling real-time search and analytics with Elasticsearch
Scaling real-time search and analytics with Elasticsearchclintongormley
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit
 
Apache Flume
Apache FlumeApache Flume
Apache FlumeGetInData
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchSigmoid
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexasArvind Prabhakar
 
Elasticsearch in 15 minutes
Elasticsearch in 15 minutesElasticsearch in 15 minutes
Elasticsearch in 15 minutesDavid Pilato
 
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...sparktc
 
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeFeb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeYahoo Developer Network
 

Andere mochten auch (20)

Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic search
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
ElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementElasticSearch : Architecture et Développement
ElasticSearch : Architecture et Développement
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Tirer le meilleur de ses données avec ElasticSearch
Tirer le meilleur de ses données avec ElasticSearchTirer le meilleur de ses données avec ElasticSearch
Tirer le meilleur de ses données avec ElasticSearch
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloud
 
ElasticSearch on AWS
ElasticSearch on AWSElasticSearch on AWS
ElasticSearch on AWS
 
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
Scaling real-time search and analytics with Elasticsearch
Scaling real-time search and analytics with ElasticsearchScaling real-time search and analytics with Elasticsearch
Scaling real-time search and analytics with Elasticsearch
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug Grall
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and Elasticsearch
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Elasticsearch in 15 minutes
Elasticsearch in 15 minutesElasticsearch in 15 minutes
Elasticsearch in 15 minutes
 
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
 
Introducing Akka
Introducing AkkaIntroducing Akka
Introducing Akka
 
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeFeb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
 

Ähnlich wie Building an ETL pipeline for Elasticsearch using Spark

Elastic on a Hyper-Converged Infrastructure for Operational Log Analytics
Elastic on a Hyper-Converged Infrastructure for Operational Log AnalyticsElastic on a Hyper-Converged Infrastructure for Operational Log Analytics
Elastic on a Hyper-Converged Infrastructure for Operational Log AnalyticsElasticsearch
 
Exadata Smart Scan - What is so smart about it?
Exadata Smart Scan  - What is so smart about it?Exadata Smart Scan  - What is so smart about it?
Exadata Smart Scan - What is so smart about it?Uwe Hesse
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Optimize with Open Source
Optimize with Open SourceOptimize with Open Source
Optimize with Open SourceEDB
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Fwdays
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
 
Achieving cyber mission assurance with near real-time impact
Achieving cyber mission assurance with near real-time impactAchieving cyber mission assurance with near real-time impact
Achieving cyber mission assurance with near real-time impactElasticsearch
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Introducing DataStax Enterprise 4.7
Introducing DataStax Enterprise 4.7Introducing DataStax Enterprise 4.7
Introducing DataStax Enterprise 4.7DataStax
 
Using ELK Explore Defect Data
Using ELK Explore Defect DataUsing ELK Explore Defect Data
Using ELK Explore Defect Dataatf117
 
UsingELKExploreDefectData
UsingELKExploreDefectDataUsingELKExploreDefectData
UsingELKExploreDefectDataYabin Xu
 
Introducing the eDB360 Tool
Introducing the eDB360 ToolIntroducing the eDB360 Tool
Introducing the eDB360 ToolCarlos Sierra
 
Getting Started with Elasticsearch
Getting Started with ElasticsearchGetting Started with Elasticsearch
Getting Started with ElasticsearchAlibaba Cloud
 
Ten tools for ten big data areas 01 informatica
Ten tools for ten big data areas 01 informatica Ten tools for ten big data areas 01 informatica
Ten tools for ten big data areas 01 informatica Will Du
 

Ähnlich wie Building an ETL pipeline for Elasticsearch using Spark (20)

Elastic on a Hyper-Converged Infrastructure for Operational Log Analytics
Elastic on a Hyper-Converged Infrastructure for Operational Log AnalyticsElastic on a Hyper-Converged Infrastructure for Operational Log Analytics
Elastic on a Hyper-Converged Infrastructure for Operational Log Analytics
 
Exadata Smart Scan - What is so smart about it?
Exadata Smart Scan  - What is so smart about it?Exadata Smart Scan  - What is so smart about it?
Exadata Smart Scan - What is so smart about it?
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Optimize with Open Source
Optimize with Open SourceOptimize with Open Source
Optimize with Open Source
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
ABD217_From Batch to Streaming
ABD217_From Batch to StreamingABD217_From Batch to Streaming
ABD217_From Batch to Streaming
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
Achieving cyber mission assurance with near real-time impact
Achieving cyber mission assurance with near real-time impactAchieving cyber mission assurance with near real-time impact
Achieving cyber mission assurance with near real-time impact
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Introducing DataStax Enterprise 4.7
Introducing DataStax Enterprise 4.7Introducing DataStax Enterprise 4.7
Introducing DataStax Enterprise 4.7
 
Using ELK Explore Defect Data
Using ELK Explore Defect DataUsing ELK Explore Defect Data
Using ELK Explore Defect Data
 
UsingELKExploreDefectData
UsingELKExploreDefectDataUsingELKExploreDefectData
UsingELKExploreDefectData
 
Introducing the eDB360 Tool
Introducing the eDB360 ToolIntroducing the eDB360 Tool
Introducing the eDB360 Tool
 
Getting Started with Elasticsearch
Getting Started with ElasticsearchGetting Started with Elasticsearch
Getting Started with Elasticsearch
 
Ten tools for ten big data areas 01 informatica
Ten tools for ten big data areas 01 informatica Ten tools for ten big data areas 01 informatica
Ten tools for ten big data areas 01 informatica
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
 

Mehr von Itai Yaffe

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingItai Yaffe
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationItai Yaffe
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsItai Yaffe
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Itai Yaffe
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Itai Yaffe
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesItai Yaffe
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsItai Yaffe
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your DataItai Yaffe
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesItai Yaffe
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Itai Yaffe
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidItai Yaffe
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Itai Yaffe
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsItai Yaffe
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for DruidItai Yaffe
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidItai Yaffe
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerItai Yaffe
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Itai Yaffe
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureItai Yaffe
 

Mehr von Itai Yaffe (20)

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data Processing
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening Notes
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your Data
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening Notes
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for Druid
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and Druid
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructure
 

Kürzlich hochgeladen

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 

Kürzlich hochgeladen (20)

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 

Building an ETL pipeline for Elasticsearch using Spark

  • 1. Building an ETL pipeline for Elasticsearch using Spark * * @2014 eXelate Inc. Confidential and Proprietary Itai Yaffe, Big Data Infrastructure Developer December 2015
  • 2. Agenda • About eXelate • About the team • eXelate’s architecture overview • The need • The problem • Why Elasticsearch and how do we use it? • Loading the data • Re-designing the loading process • Additional improvements • To summarize * * ©2011 eXelate Inc. Confidential and Proprietary
  • 3. About eXelate, a Nielsen company • Founded in 2007 • Acquired by Nielsen on March, 2015 • A leader in the Ad Tech industry • Provides data and software services through : • eXchange (2 billion users) • maX DMP (data management platform) * * ©2011 eXelate Inc. Confidential and Proprietary
  • 4. Our numbers * * ©2011 eXelate Inc. Confidential and Proprietary • ~10 billion events per day • ~150TB of data per day • Hybrid cloud infrastructure • 4 Data Centers • Amazon Web Services
  • 5. About the team • The BDI (Big Data Infrastructure) team is in charge of shipping, transforming and loading eXelate’s data into various data stores, making it ready to be queried efficiently • For the last year and a half, we’ve been transitioning our legacy systems to modern, scale-out infrastructure (Spark, Kafka, etc.) * * ©2011 eXelate Inc. Confidential and Proprietary
  • 6. About me • Dealing with Big Data challenges for the last 3.5 years, using : • Cassandra • Spark • Elasticsearch • And others… • Joined eXelate on May 2014 • Previously : OpTier, Mamram • LinkedIn : https://www.linkedin.com/in/itaiy • Email : itai.yaffe@nielsen.com * * ©2011 eXelate Inc. Confidential and Proprietary
  • 7. eXelate’s architecture overview * * ©2011 eXelate Inc. Confidential and Proprietary Serving (frontend servers) Incoming HTTP requests ETL ETL ETL DMP applications (SaaS) DB DWH
  • 8. The need * * ©2011 eXelate Inc. Confidential and Proprietary
  • 9. The need * * ©2011 eXelate Inc. Confidential and Proprietary
  • 10. The need • From the data perspective : • ETL – collect raw data and load it into Elasticsearch periodically • Tens of millions of events per day • Data is already labeled • Query - allow ad hoc calculations based on the stored data • Mainly counting unique users related to a specific campaign in conjunction with geographic/demographic data limited by date range • The number of permutations is huge, so real-time queries are a must! (and can’t be pre-calculated) * * ©2011 eXelate Inc. Confidential and Proprietary
  • 11. The problem • We chose Elasticsearch as the data store (details to follow) • But… the ETL process was far from optimal • Also affected query performance * * ©2011 eXelate Inc. Confidential and Proprietary
  • 12. Why Elasticsearch? • Originally designed as a text search engine • Today it has advanced real-time analytics capabilities • Distributed, scalable and highly available * * ©2011 eXelate Inc. Confidential and Proprietary
  • 13. How do we use Elasticsearch? • We rely heavily on its counting capabilities • Splitting the data into separate indices based on a few criteria (e.g TTL, tags VS segments) • Each user (i.e device) is stored as a document with many nested document * * ©2011 eXelate Inc. Confidential and Proprietary
  • 14. How do we use Elasticsearch? * * ©2011 eXelate Inc. Confidential and Proprietary { "_index": "sample", "_type": "user", "_id": "0c31644ad41e32c819be29ba16e14300", "_version": 4, "_score": 1, "_source": { "events": [ { "event_time": "2014-01-18", "segments": [ { "segment": "female" } ,{ "segment": "Airplane tickets" } ] }, { "event_time": "2014-02-19", "segments": [ { "segment": "female" } ,{ "segment": "Hotel reservations" } ] } ] } }
  • 15. Loading the data * * ©2011 eXelate Inc. Confidential and Proprietary
  • 16. Standalone Java loader application • Runs every few minutes • Parses the log files • For each user we encountered : • Queries Elasticsearch to get the user’s document • Merges the new data into the document on the client-side • Bulk-indexes documents into Elasticsearch * * ©2011 eXelate Inc. Confidential and Proprietary
  • 17. OK, so what’s the problem? • Multiple updates per user per day • Updates in Elasticsearch are expensive (basically delete + insert) • Merges are done on the client-side • Involves redundant queries • Leads to degradation of query performance • Not scalable or high available * * ©2011 eXelate Inc. Confidential and Proprietary
  • 18. Re-designing the loading process • Batch processing once a day during off-hours • Daily dedup leads to ~75% less update operations in Elasticsearch • Using Spark as our processing framework • Distributed, scalable and highly available • Unified framework for batch, streaming, machine learning, etc. • Using update script • Merges are done on the server-side * * ©2011 eXelate Inc. Confidential and Proprietary
  • 19. Elasticsearch update script * * ©2011 eXelate Inc. Confidential and Proprietary import groovy.json.JsonSlurper; added=false; def slurper = new JsonSlurper(); def result = slurper.parseText(param1); ctx._ttl = ttl; ctx._source.events.each() { item->if (item.event_time == result[0].event_time) { def segmentMap = [:]; item.segments.each() { segmentMap.put(it.segment,it.segment) }; result[0].segments.each{ if(!segmentMap[it.segment]){ item.segments += it } }; added=true; } }; if(!added) { ctx._source.events += result }
  • 20. Re-designing the loading process * * ©2011 eXelate Inc. Confidential and Proprietary AWS S3 AWS Data Pipeline AWS EMR AWS SNS
  • 21. Zoom-in • Log files are compressed (.gz) CSVs • Once a day : • Files are copied and uncompressed into the EMR cluster using S3DistCp • The Spark application : • Groups events by user and build JSON documents, which include an inline udpate script • Writes the JSON documents back to S3 • The Scala application reads the documents from S3 and bulk-indexes them into Elasticsearch • Notifications are sent via SNS * * ©2011 eXelate Inc. Confidential and Proprietary
  • 22. We discovered it wasn’t enough… • Redundant moving parts • Excessive network traffic • Still not scalable enough * * ©2011 eXelate Inc. Confidential and Proprietary
  • 23. Elasticsearch-Spark plugin-in for the rescue… * * ©2011 eXelate Inc. Confidential and Proprietary AWS S3 AWS Data Pipeline AWS EMR Elasticsearch-Spark plug-in AWS SNS
  • 24. Deep-dive • Bulk-indexing directly from Spark using elasticsearch-hadoop plugin-in for Spark : // Save created RDD records to a file documentsRdd.saveAsTextFile(outputPath) Is now : // Save created RDD records directly to Elasticsearch documentsRdd.saveJsonToEs(configData.documentResource, scala.collection.Map(ConfigurationOptions.ES_MAPPING_ID -> configData.documentIdFieldName)) • Storing the update script on the server-side (Elasticsearch) * * ©2011 eXelate Inc. Confidential and Proprietary
  • 25. Better… • Single component for both processing and indexing • Elastically scalable • Out-of-the-box error handling and fault-tolerance • Spark-level (e.g spark.task.maxFailures) • Plug-in level (e.g ConfigurationOptions.ES_BATCH_WRITE_RETRY_COUNT/ WAIT) • Less network traffic (update script is stored in Elasticsearch) * * ©2011 eXelate Inc. Confidential and Proprietary
  • 26. … But • Number of deleted documents continually grows • Also affects query performance • Elasticsearch itself becomes the bottleneck • org.elasticsearch.hadoop.EsHadoopException: Could not write all entries [5/1047872] (maybe ES was overloaded?). Bailing out... • [INFO ][index.engine ] [NODE_NAME] [INDEX_NAME][7] now throttling indexing: numMergesInFlight=6, maxNumMerges=5 * * ©2011 eXelate Inc. Confidential and Proprietary
  • 27. Expunging deleted documents • Theoretically not a “best practice” but necessary when doing significant bulk-indexing • Done through the optimize API • curl -XPOST http://localhost:9200/_optimize?only_expunge_deletes • curl -XPOST http://localhost:9200/_optimize?max_num_segments=5 • A heavy operation (time, CPU , I/O) * * ©2011 eXelate Inc. Confidential and Proprietary
  • 28. Improving indexing performance • Set index.refresh_interval to -1 • Set indices.store.throttle.type to none • Properly set the retry-related configuration properties (e.g spark.task.maxFailures) * * ©2011 eXelate Inc. Confidential and Proprietary
  • 29. What’s next? • Further improve indexing performance, e.g : • Reduce excessive concurrency on Elasticsearch nodes by limiting Spark’s maximum concurrent tasks • Bulk-index objects rather than JSON documents to avoid excessive parsing • Better monitoring (e.g using Spark Accumulators) * * ©2011 eXelate Inc. Confidential and Proprietary
  • 30. To summarize • We use : • S3 to store (raw) labeled data • Spark on EMR to process the data • Elasticsearch-hadoop plug-in for bulk-indexing • Data Pipeline to manage the flow • Elasticsearch for real-time analytics * * ©2011 eXelate Inc. Confidential and Proprietary
  • 31. To summarize • Updates are expensive – consider daily dedup • Avoid excessive querying and network traffic - perform merges on the server-side • Use an update script • Store it on your Elasticsearch cluster • Make sure your loading process is scalable and fault-tolerant – use Spark • Reduce # of moving parts • Index the data directly using elasticsearch-hadoop plug-in * * ©2011 eXelate Inc. Confidential and Proprietary
  • 32. To summarize • Improve indexing performance – properly configure your cluster before indexing • Avoid excessive disk usage – optimize your indices • Can also help query performance • Making the processing phase elastically scalable (i.e using Spark) doesn’t mean the whole ETL flow is elastically scalable • Elasticsearch becomes the new bottleneck… * * ©2011 eXelate Inc. Confidential and Proprietary
  • 33. Questions? Also - we’re hiring! http://exelate.com/about-us/careers/ •DevOps team leader •Senior frontend developers •Senior Java developers * * ©2011 eXelate Inc. Confidential and Proprietary
  • 34. Thank you ©2011 eXelate Inc. Confidential and Proprietary Itai Yaffe
  • 35. Keep an eye on… • S3 limitations : • The penalty involved in moving files • File partitioning and hash prefixes * * ©2011 eXelate Inc. Confidential and Proprietary