SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Batch Indexing & Near Real Time,
keeping things fast
Marc Sturlese
Software engineer @ Trovit
Thursday, 2 May 2013
About me...
• Marc Sturlese – @sturlese
• Software engineer @Trovit. R&D focused
• Responsible for search and scalability
Thursday, 2 May 2013
Agenda
• Who we are
• Batch architecture. Hadoop & Hive
• Near real time architecture. Storm & stuff
• Putting it all together
• Alternatives and Future directions
• Questions
Thursday, 2 May 2013
Who we are
Trovit, a search engine for classifieds
Thursday, 2 May 2013
Who we are
Thursday, 2 May 2013
Batch Layer
• Hadoop based
• Documents are crunched by a pipeline of MR
jobs
• Hive to save stats of each phase
Thursday, 2 May 2013
Batch Layer
Pipeline overview
Incoming data
Deployment
Lucene Indexes
Ad Processor Diff Matching Expiration Deduplication Indexing
t – 1
External Data
Hive Stats
Hadoop Cluster
Thursday, 2 May 2013
Batch Layer
The good things!
• Index always built from scratch. Small number of
big segments
• Multicast deployment allows to send indexes to
all slaves at the same time.
• Backups convenient on HDFS
Thursday, 2 May 2013
Batch Layer
That was cool but...
• Not even close to real time
• Crunch documents in batch means to wait until
all is processed. This can take a few hours
• We want to show the user fresher results!
Thursday, 2 May 2013
Near real time Layer
Storm and stuff to the rescue
Thursday, 2 May 2013
Near real time Layer
Storm properties
• Distributed real time computation system
• Fault tolerance
• Horizontal scalability
• Low latency
• Reliability
Thursday, 2 May 2013
Near real time Layer
Storm in action
Slave
Slave
Solr prod replicas
Slave
XML feed
XML feed
Kafka partition
Kafka partition
Storm topologySources
Kafka spout
Kafka spout
XML spout Doc Manager bolt Indexer bolt
SHUFFLE
GROUPING GROUPING
FIELD
Thursday, 2 May 2013
Near real time Layer
Storm in action
• Spouts just read and send
• Doc Manager Bolt processes and classifies
• Indexer Bolt adds documents to Solr
• Replicated logic with different implementation
• Careful not to overload Solr slaves...
Thursday, 2 May 2013
Near real time Layer
Storm in action
Thursday, 2 May 2013
Near real time Layer
Storm in action. But...
Thursday, 2 May 2013
Near real time Layer
Storm in action. But...
• Now Solr has to handle user queries and storm
inserts
• Field grouping on Indexer Bolt for politeness
• Small bulks to reduce insert requests
• Committing on many cores, same host, same
time can be painful
Thursday, 2 May 2013
Near real time Layer
Storm in action - Committing
Indexer Bolt Cars US
Real state UK R1 Cars US R1 Cars US R2 Jobs BR R1 Jobs BR R2 Real state ES R1
Indexer Bolt Jobs BR
ZooKeeper Locker
Slave 1 Slave 2 Slave N
. . .
Thursday, 2 May 2013
Near real time Layer
Storm in action
• Adding documents now is fast
• Keep number of segments small
• Avoid merges on big segments
• Just add new docs (no deletes or updates)
Thursday, 2 May 2013
Mixed Architecture
Putting it all together
15
Slave
Slave
Solr prod replicas
Slave
XML feed
XML feed
Kafka partition
Kafka partition
Storm topologySources
Hbase doc info
Bulk add
Exists?
MR Pipeline
zk
Thursday, 2 May 2013
Mixed Architecture
Swapping indexes
• NRT docs might not be contained in the new
batch index (even fresher than the “being built”
batch index)
• This can lead to inconsistencies...
Thursday, 2 May 2013
Mixed Architecture
Swapping indexes. Time jumps!
Thursday, 2 May 2013
Mixed Architecture
Swapping indexes
HBase
XML feed t
Slave t+1
Slave t
Pipeline t
Pipeline t+1
XML feed t+1
XML feed t+2
NRT indexer
Batch indexer
Thursday, 2 May 2013
Mixed Architecture
Swapping indexes
HBase
XML feed t
Slave t+1
Slave t
Pipeline t
Pipeline t+1
XML feed t+1
XML feed t+2
NRT indexer
Batch indexer
Thursday, 2 May 2013
Mixed Architecture
Swapping indexes
HBase
XML feed t
Slave t+1
Slave t
Pipeline t
Pipeline t+1
XML feed t+1
XML feed t+2
NRT indexer
Batch indexer
NRT t+1
NRT t+2
Thursday, 2 May 2013
Mixed Architecture
Swapping indexes
HBase
XML feed t
Slave t+1
Slave t
Pipeline t
Pipeline t+1
XML feed t+1
XML feed t+2
NRT indexer
Batch indexer
NRT t+1
NRT t+2
Thursday, 2 May 2013
Mixed Architecture
Swapping indexes
• NRT indexed docs must be stored in a
temporary storage
• Fetch missing docs from the storage and add
them before the next deploy
• This avoids time jumps
Thursday, 2 May 2013
Mixed Architecture
Storm and Hadoop
• Near real time inserts, low latency
• Hadoop handles deletes and updates. No rush
on those
• No merges on big segments so optimal query
response times
• Tolerant to human errors
• Temporary lost of accuracy on the NRT layer
Thursday, 2 May 2013
Alternatives
SolrCloud - Why not?
• Good for the vast majority of use cases
• Incremental inserts/updates/deletes oriented.
Pay segment merges per real time
• Need to deploy full indexes fast (faster that rsync
or http replication)
• Now full deploy easier with aliases
Thursday, 2 May 2013
Future lines
Lucene real time feature
• Allows to see docs in the index before they are
committed
• Good but not a must right now for the use case
• Very easy to integrate on the current
architecture
Thursday, 2 May 2013
??Thursday, 2 May 2013
Thanks for your attention!
Marc Sturlese
marc@trovit.com
Lucene/Solr Revolution 2013, San Diego, May 1 2013
Thursday, 2 May 2013
CONFERENCE PARTY
The Tipsy Crow: 770 5th Ave
Starts after Stump The Chump
Your conference badge gets
you in the door
TOMORROW
Breakfast starts at 7:30
Keynotes start at 8:30
Thursday, 2 May 2013

Weitere ähnliche Inhalte

Was ist angesagt?

Spark Streaming and Expert Systems
Spark Streaming and Expert SystemsSpark Streaming and Expert Systems
Spark Streaming and Expert SystemsJim Haughwout
 
DSD-INT 2015 - Data management with open earth datalabs - Gerben de Boer, van...
DSD-INT 2015 - Data management with open earth datalabs - Gerben de Boer, van...DSD-INT 2015 - Data management with open earth datalabs - Gerben de Boer, van...
DSD-INT 2015 - Data management with open earth datalabs - Gerben de Boer, van...Deltares
 
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB Puppet
 
Presto in my_use_case2
Presto in my_use_case2Presto in my_use_case2
Presto in my_use_case2wyukawa
 
a real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxxa real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at DevoxxNathan Bijnens
 
Working with OpenStreetMap using Apache Spark and Geotrellis
Working with OpenStreetMap using Apache Spark and GeotrellisWorking with OpenStreetMap using Apache Spark and Geotrellis
Working with OpenStreetMap using Apache Spark and GeotrellisRob Emanuele
 
20190620 accelerating containers v3
20190620 accelerating containers v320190620 accelerating containers v3
20190620 accelerating containers v3Tim Bell
 
OpenStack Boston Meetup - April 20th 2017
OpenStack Boston Meetup - April 20th 2017OpenStack Boston Meetup - April 20th 2017
OpenStack Boston Meetup - April 20th 2017Stacy Véronneau
 
OpenStack 7th Birthday Deck
OpenStack 7th Birthday DeckOpenStack 7th Birthday Deck
OpenStack 7th Birthday DeckStacy Véronneau
 
Order from chaos: automating monitoring configuration
Order from chaos: automating monitoring configurationOrder from chaos: automating monitoring configuration
Order from chaos: automating monitoring configurationSensu Inc.
 
Python User Group November 2013, SQL(alchemy) and transient detection
Python User Group November 2013, SQL(alchemy) and transient detectionPython User Group November 2013, SQL(alchemy) and transient detection
Python User Group November 2013, SQL(alchemy) and transient detectionGijs Molenaar
 
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBHow a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBInfluxData
 

Was ist angesagt? (12)

Spark Streaming and Expert Systems
Spark Streaming and Expert SystemsSpark Streaming and Expert Systems
Spark Streaming and Expert Systems
 
DSD-INT 2015 - Data management with open earth datalabs - Gerben de Boer, van...
DSD-INT 2015 - Data management with open earth datalabs - Gerben de Boer, van...DSD-INT 2015 - Data management with open earth datalabs - Gerben de Boer, van...
DSD-INT 2015 - Data management with open earth datalabs - Gerben de Boer, van...
 
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
 
Presto in my_use_case2
Presto in my_use_case2Presto in my_use_case2
Presto in my_use_case2
 
a real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxxa real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxx
 
Working with OpenStreetMap using Apache Spark and Geotrellis
Working with OpenStreetMap using Apache Spark and GeotrellisWorking with OpenStreetMap using Apache Spark and Geotrellis
Working with OpenStreetMap using Apache Spark and Geotrellis
 
20190620 accelerating containers v3
20190620 accelerating containers v320190620 accelerating containers v3
20190620 accelerating containers v3
 
OpenStack Boston Meetup - April 20th 2017
OpenStack Boston Meetup - April 20th 2017OpenStack Boston Meetup - April 20th 2017
OpenStack Boston Meetup - April 20th 2017
 
OpenStack 7th Birthday Deck
OpenStack 7th Birthday DeckOpenStack 7th Birthday Deck
OpenStack 7th Birthday Deck
 
Order from chaos: automating monitoring configuration
Order from chaos: automating monitoring configurationOrder from chaos: automating monitoring configuration
Order from chaos: automating monitoring configuration
 
Python User Group November 2013, SQL(alchemy) and transient detection
Python User Group November 2013, SQL(alchemy) and transient detectionPython User Group November 2013, SQL(alchemy) and transient detection
Python User Group November 2013, SQL(alchemy) and transient detection
 
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBHow a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
 

Ähnlich wie Batch Indexing & Near Real Time, keeping things fast

Presentation meetup ElasticSearch Paris #10
Presentation meetup ElasticSearch Paris #10Presentation meetup ElasticSearch Paris #10
Presentation meetup ElasticSearch Paris #10Renaud Boutet
 
Cloud east shutl_talk
Cloud east shutl_talkCloud east shutl_talk
Cloud east shutl_talkVolker Pacher
 
Hotspot Garbage Collection - Tuning Guide
Hotspot Garbage Collection - Tuning GuideHotspot Garbage Collection - Tuning Guide
Hotspot Garbage Collection - Tuning GuidejClarity
 
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaSolr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaLucidworks
 
Logmatic at ElasticSearch November Paris meetup
Logmatic at ElasticSearch November Paris meetupLogmatic at ElasticSearch November Paris meetup
Logmatic at ElasticSearch November Paris meetuplogmatic.io
 
Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Bryan Bende
 
There's no magic... until you talk about databases
 There's no magic... until you talk about databases There's no magic... until you talk about databases
There's no magic... until you talk about databasesESUG
 
Icinga Web 2 is more
Icinga Web 2 is moreIcinga Web 2 is more
Icinga Web 2 is moreIcinga
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Fwdays
 
Infrastructure Migration
Infrastructure MigrationInfrastructure Migration
Infrastructure MigrationMatt Simmons
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisParis Data Engineers !
 
Conceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLConceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLMongoDB
 
2013 11-07 lsr-dublin_m_hausenblas_when solr is best
2013 11-07 lsr-dublin_m_hausenblas_when solr is best2013 11-07 lsr-dublin_m_hausenblas_when solr is best
2013 11-07 lsr-dublin_m_hausenblas_when solr is bestlucenerevolution
 
Testing the Data Warehouse—Big Data, Big Problems
Testing the Data Warehouse—Big Data, Big ProblemsTesting the Data Warehouse—Big Data, Big Problems
Testing the Data Warehouse—Big Data, Big ProblemsTechWell
 
Public Seminar_Final 18112014
Public Seminar_Final 18112014Public Seminar_Final 18112014
Public Seminar_Final 18112014Hossam Hassan
 

Ähnlich wie Batch Indexing & Near Real Time, keeping things fast (20)

Presentation meetup ElasticSearch Paris #10
Presentation meetup ElasticSearch Paris #10Presentation meetup ElasticSearch Paris #10
Presentation meetup ElasticSearch Paris #10
 
Cloud east shutl_talk
Cloud east shutl_talkCloud east shutl_talk
Cloud east shutl_talk
 
Hotspot Garbage Collection - Tuning Guide
Hotspot Garbage Collection - Tuning GuideHotspot Garbage Collection - Tuning Guide
Hotspot Garbage Collection - Tuning Guide
 
OSSCON: Big Search 4 Big Data
OSSCON: Big Search 4 Big DataOSSCON: Big Search 4 Big Data
OSSCON: Big Search 4 Big Data
 
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaSolr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
 
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
 
Logmatic at ElasticSearch November Paris meetup
Logmatic at ElasticSearch November Paris meetupLogmatic at ElasticSearch November Paris meetup
Logmatic at ElasticSearch November Paris meetup
 
Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014
 
There's no magic... until you talk about databases
 There's no magic... until you talk about databases There's no magic... until you talk about databases
There's no magic... until you talk about databases
 
Icinga Web 2 is more
Icinga Web 2 is moreIcinga Web 2 is more
Icinga Web 2 is more
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"
 
Infrastructure Migration
Infrastructure MigrationInfrastructure Migration
Infrastructure Migration
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin François
 
Conceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLConceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQL
 
2013 11-07 lsr-dublin_m_hausenblas_when solr is best
2013 11-07 lsr-dublin_m_hausenblas_when solr is best2013 11-07 lsr-dublin_m_hausenblas_when solr is best
2013 11-07 lsr-dublin_m_hausenblas_when solr is best
 
Testing the Data Warehouse—Big Data, Big Problems
Testing the Data Warehouse—Big Data, Big ProblemsTesting the Data Warehouse—Big Data, Big Problems
Testing the Data Warehouse—Big Data, Big Problems
 
Public Seminar_Final 18112014
Public Seminar_Final 18112014Public Seminar_Final 18112014
Public Seminar_Final 18112014
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 

Kürzlich hochgeladen

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Kürzlich hochgeladen (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Batch Indexing & Near Real Time, keeping things fast

  • 1. Batch Indexing & Near Real Time, keeping things fast Marc Sturlese Software engineer @ Trovit Thursday, 2 May 2013
  • 2. About me... • Marc Sturlese – @sturlese • Software engineer @Trovit. R&D focused • Responsible for search and scalability Thursday, 2 May 2013
  • 3. Agenda • Who we are • Batch architecture. Hadoop & Hive • Near real time architecture. Storm & stuff • Putting it all together • Alternatives and Future directions • Questions Thursday, 2 May 2013
  • 4. Who we are Trovit, a search engine for classifieds Thursday, 2 May 2013
  • 5. Who we are Thursday, 2 May 2013
  • 6. Batch Layer • Hadoop based • Documents are crunched by a pipeline of MR jobs • Hive to save stats of each phase Thursday, 2 May 2013
  • 7. Batch Layer Pipeline overview Incoming data Deployment Lucene Indexes Ad Processor Diff Matching Expiration Deduplication Indexing t – 1 External Data Hive Stats Hadoop Cluster Thursday, 2 May 2013
  • 8. Batch Layer The good things! • Index always built from scratch. Small number of big segments • Multicast deployment allows to send indexes to all slaves at the same time. • Backups convenient on HDFS Thursday, 2 May 2013
  • 9. Batch Layer That was cool but... • Not even close to real time • Crunch documents in batch means to wait until all is processed. This can take a few hours • We want to show the user fresher results! Thursday, 2 May 2013
  • 10. Near real time Layer Storm and stuff to the rescue Thursday, 2 May 2013
  • 11. Near real time Layer Storm properties • Distributed real time computation system • Fault tolerance • Horizontal scalability • Low latency • Reliability Thursday, 2 May 2013
  • 12. Near real time Layer Storm in action Slave Slave Solr prod replicas Slave XML feed XML feed Kafka partition Kafka partition Storm topologySources Kafka spout Kafka spout XML spout Doc Manager bolt Indexer bolt SHUFFLE GROUPING GROUPING FIELD Thursday, 2 May 2013
  • 13. Near real time Layer Storm in action • Spouts just read and send • Doc Manager Bolt processes and classifies • Indexer Bolt adds documents to Solr • Replicated logic with different implementation • Careful not to overload Solr slaves... Thursday, 2 May 2013
  • 14. Near real time Layer Storm in action Thursday, 2 May 2013
  • 15. Near real time Layer Storm in action. But... Thursday, 2 May 2013
  • 16. Near real time Layer Storm in action. But... • Now Solr has to handle user queries and storm inserts • Field grouping on Indexer Bolt for politeness • Small bulks to reduce insert requests • Committing on many cores, same host, same time can be painful Thursday, 2 May 2013
  • 17. Near real time Layer Storm in action - Committing Indexer Bolt Cars US Real state UK R1 Cars US R1 Cars US R2 Jobs BR R1 Jobs BR R2 Real state ES R1 Indexer Bolt Jobs BR ZooKeeper Locker Slave 1 Slave 2 Slave N . . . Thursday, 2 May 2013
  • 18. Near real time Layer Storm in action • Adding documents now is fast • Keep number of segments small • Avoid merges on big segments • Just add new docs (no deletes or updates) Thursday, 2 May 2013
  • 19. Mixed Architecture Putting it all together 15 Slave Slave Solr prod replicas Slave XML feed XML feed Kafka partition Kafka partition Storm topologySources Hbase doc info Bulk add Exists? MR Pipeline zk Thursday, 2 May 2013
  • 20. Mixed Architecture Swapping indexes • NRT docs might not be contained in the new batch index (even fresher than the “being built” batch index) • This can lead to inconsistencies... Thursday, 2 May 2013
  • 21. Mixed Architecture Swapping indexes. Time jumps! Thursday, 2 May 2013
  • 22. Mixed Architecture Swapping indexes HBase XML feed t Slave t+1 Slave t Pipeline t Pipeline t+1 XML feed t+1 XML feed t+2 NRT indexer Batch indexer Thursday, 2 May 2013
  • 23. Mixed Architecture Swapping indexes HBase XML feed t Slave t+1 Slave t Pipeline t Pipeline t+1 XML feed t+1 XML feed t+2 NRT indexer Batch indexer Thursday, 2 May 2013
  • 24. Mixed Architecture Swapping indexes HBase XML feed t Slave t+1 Slave t Pipeline t Pipeline t+1 XML feed t+1 XML feed t+2 NRT indexer Batch indexer NRT t+1 NRT t+2 Thursday, 2 May 2013
  • 25. Mixed Architecture Swapping indexes HBase XML feed t Slave t+1 Slave t Pipeline t Pipeline t+1 XML feed t+1 XML feed t+2 NRT indexer Batch indexer NRT t+1 NRT t+2 Thursday, 2 May 2013
  • 26. Mixed Architecture Swapping indexes • NRT indexed docs must be stored in a temporary storage • Fetch missing docs from the storage and add them before the next deploy • This avoids time jumps Thursday, 2 May 2013
  • 27. Mixed Architecture Storm and Hadoop • Near real time inserts, low latency • Hadoop handles deletes and updates. No rush on those • No merges on big segments so optimal query response times • Tolerant to human errors • Temporary lost of accuracy on the NRT layer Thursday, 2 May 2013
  • 28. Alternatives SolrCloud - Why not? • Good for the vast majority of use cases • Incremental inserts/updates/deletes oriented. Pay segment merges per real time • Need to deploy full indexes fast (faster that rsync or http replication) • Now full deploy easier with aliases Thursday, 2 May 2013
  • 29. Future lines Lucene real time feature • Allows to see docs in the index before they are committed • Good but not a must right now for the use case • Very easy to integrate on the current architecture Thursday, 2 May 2013
  • 31. Thanks for your attention! Marc Sturlese marc@trovit.com Lucene/Solr Revolution 2013, San Diego, May 1 2013 Thursday, 2 May 2013
  • 32. CONFERENCE PARTY The Tipsy Crow: 770 5th Ave Starts after Stump The Chump Your conference badge gets you in the door TOMORROW Breakfast starts at 7:30 Keynotes start at 8:30 Thursday, 2 May 2013