SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Hadoop use case: A scalable
vertical search engine	
Iván de Prado Alonso, Datasalt Co-founder	
Twitter: @ivanprado
Content	

§  The problem	
§  The obvious solution	
§  When the obvious solution fails…	
§  … Hadoop comes to the rescue	
§  Advantages & disadvantages	
§  Improvements
¿What is a vertical search
             engine? 	
Provider 1

                     Vertical Search Engine


             Feed
                                                            s
                                                     rche
                                              Se a




Provider 2

                                               Sear
                                                       ches
                ed
              Fe
Some of them
The “obvious” architecture	
             The first thing that comes to your mind



   Feed

                 Does it exist?
                Has it changed?
                 Insert/update     Database
Download &
  Process
               Insert/update



                                  Lucene/Solr   Search Page
                                     Index
How it works
                               	
§  Feed download	
§  For every register in the feed	
   •  Check for existence in the DB	
   •  If it exists and has changed, update	
      ª The DB	
      ª The Index	
   •  If it doesn’t exist, insert into	
      ª The DB	
      ª The Index
How it works (II)
                              	
§  The Database is used for	
   •  Checking for register existence (avoiding
      duplicates)	
   •  Managing the data with SQL facility	
§  Lucene/Solr is used for	
   •    Quick searches	
   •    Searching by structured fields	
   •    Free-text searches	
   •    Faceting
But if things go well...…	


                                                                       Feed           Feed
                 Feed                 Feed
 Feed                                                           Feed

                                                         Feed                                Feed
                               Feed                                                  Feed
        Feed
                                             Feed                             Feed
                               Feed
Feed
                 Feed Feed                                Feed Feed
                                                       Feed                          Feed
                                      Feed                                                   Feed
                                                                                     Feed
                    Feed
                                                    Feed
                                                Feed                             Feed
                        Feed                                                                        Feed
Feed      Feed                                                  Feed
                                         Feed                                        Feed
Huge jam!
“Swiss army knife of the 21st
                                            century”	
                                                         	
Media Guardian Innovation Awards
                                                                                          	




http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
Hadoop	
    “The Apache Hadoop
     software library is a
 framework that allows for
the distributed processing
  of large data sets across
clusters of computers using
   a simple programming
           model”  	
               From Hadoop homepage
File System	

§  Distributed File System (HDFS)	
  •  Cluster of nodes exposing their storage
     capacity	
  •  Big blocks: 64 Mb	
  •  Fault tolerant (replication)	
  •  Big files storage
MapReduce	
§  Two functions (Map y Reduce)	
   •  Map(k, v) : [z,w]*	
   •  Reduce(z, w*) : [u, v]*	
§  Example: word count	
   •  Map([document, null]) -> [word, 1]*	
   •  Reduce(word, 1*) -> [word, total]	
§  MapReduce & SQL	
   •  SELECT word, count(*) GROUP BY word	
§  Distributed execution on a cluster	
§  Horizontal scalability
Ok, that’s cool, but… ¿How
does it solve my problem?
Because…	

§  Hadoop is not a Database	
§  Hadoop “apparently” only
    processes data	
§  Hadoop does not allow “lookups”	

     Hadoop is a paradigm shift difficult to
                           	
                  assimilate
Architecture
Philosophy	
§  Always reprocess everything. ¡EVERYTHING!	
§  ¿Why?	
     •  More bug tolerant	
     •  More flexible	
     •  More efficient. E.g.:	
       ª    With a 7200 RPM HD	
               –  Random IOPS – 100 	
               –  Sequencial Read/Write – 40 MB/s	
               –  Hypothesis: 5 Kb register size	
       ª    … it is faster to rewrite all data than to perform random updates when
             more than 1.25% of the registers has changed. 	
               –  1 GB, 200.000 registers	
                     »  Sequential writing: 25 sg	
                     »  Random writing: 33 min!
Fetcher
                                   	
    Feeds are downloaded and stored in the HDFS.	

§  MapReduce	
   •  Input: [feed_url, null]*	
        Reducer Task



   •  Mapper: identity	
   •  Reducer(feed_url,                 Reducer Task
                                                       HDFS
      null*)	
       ª  Download the                 Reducer Task
         feed_url and store it
         in a HDFS folder
Processor
                          	
    Feeds are parsed, converted into documents and
                      deduplicated	
§  MapReduce	
  •  Input: [feed_path, null]*	
  •  Map(feed_path, null) : [id, documents]*	
     ª The feed is parsed and converted into documents	
  •  Reducer(id, [document]*): [id, document]	
     ª Receives a list of documents and keeps the most
        recent one (deduplication)	
     ª  A unique and global identifier is required
        (idProvider + idInternal)	
  •  Output: [id, document]*
Processor (II)
                              	

§  Possible problem:	
   •  Very large feeds	
      ª Does not scale, as one task will deal with the
        full feed. 	
§  Solution	
   •  Write a custom InputFormat that divides
      the feed in smaller pieces.
Serialization	

§  Writables	
   •  Native Hadoop Serialization	
   •  Low level API	
   •  Basic types: IntWritable, Text, etc.	
§  Others	
   •  Thrift, Avro, Protostuff	
   •  Backwards compatibility
Indexer
                             	
                                            Production Solr




                                 Hot swap
Reducer Task

                                            Index - Shard 1
               Index - Shard 1

                                                              Web Server
Reducer Task
                                 Hot swap

                                            Index - Shard 2
               Index - Shard 2

Reducer Task
                                                              Web Server
                                 Hot swap

                                            Index - Shard 3
               Index - Shard 3
Indexer (II)
                                    	
§  SOLR-1301	
   •    https://issues.apache.org/jira/browse/SOLR-1301	
   •    SolrOutputFormat	
   •    1 index per reducer	
   •    A custom Partitioner can be used to control where to
        place each document	
§  Another option	
   •  Writing your own indexation code	
         ª  By creating a custom output format	
         ª  By Indexing at the reducer level. In each reduce call:	
             –  Open an index	
             –  Write all incoming registers	
             –  Close the index
Search & Partitioning	
§  Different partitioning schemas	
   •  Horizontal	
      ª Each search involves all shards	
   •  Vertical: by ad type, country, etc.	
      ª Searches can be restricted to the involved shard	

§  Solr for index serving. Possibilities:	
      ª Non federated Solr	
          –  Only for vertical partitioning	
      ª Distributed Solr	
      ª Solr Cloud
Reconciliation	

                 From Fetcher              Reconciliation                                Next steps

                                                                       Reconciliated
                                                                        documents
                                         Last execution !le


§  ¿How to register changes?	
    •    Changes in price, features, etc.	
    •    MapReduce:	
           ª    Input: [id, document]*	
                     –  From last execution	
                     –  From current processing	
           ª    Map: identity	
           ª    Reduce(id, [document]*) : [id, document]	
                     –    Documents grouped by ID. New and old documents come together.	
                     –    New and old documents are compared.	
                     –    The relevant information is stored in the new document (e.g, the old price)	
                     –    Only the new document is emited.	
§  This is the closest thing in Hadoop to a DB
Advantages of the architecture	
§  Horizontal Scalability	
   •  If properly programmed	
§  High tolerance to failures and bugs	
   •  Always everything is reprocessed	
§  Flexible	
   •  It is easy to do big changes	
§  High decoupling	
   •  Indexes are the unique interaction between the
      back-end and the front-end	
   •  Web servers can keep running even if the back-
      end is broken.
Disadvantages
                        	

§  Batch processing	
  •  No real-time or “near” real-time	
  •  Update cycles of hours	
§  Completely different programming
    paradigm	
  •  High learning curve
Improvements
                           	
§  System for images	
§  Fuzzy duplicates detection	
§  Plasam:	
   •  Mixing this architecture with a by-pass system
      that provides near real time updates to the FE
      indexes	
      ª  Implementing a by-pass to the  Solrs	
      ª  System for ensuring data consistency	
          –  Without back jumps in time	
   •  That combines the advantages of the proposed
      architecture but with near real time	
   •  Datasalt has a prototype ready
Thanks!	
Ivan de Prado, 	
ivan@datasalt.com	
@ivanprado

Weitere ähnliche Inhalte

Andere mochten auch

Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for HadoopSplout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoopdatasalt
 
Datasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactionsDatasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactionsdatasalt
 
Tuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduceTuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreducedatasalt
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Photo Contests 2012
Photo Contests 2012Photo Contests 2012
Photo Contests 2012Mihex
 
The Spirit of Barnabas
The Spirit of Barnabas The Spirit of Barnabas
The Spirit of Barnabas Wy Harris
 
Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...Marisa Gallagher
 
Sandwich Art
Sandwich ArtSandwich Art
Sandwich ArtMihex
 
Day 2 recycle grey water
Day 2 recycle grey waterDay 2 recycle grey water
Day 2 recycle grey watervigyanashram
 
Day 3 recycle grey water
Day 3  recycle grey waterDay 3  recycle grey water
Day 3 recycle grey watervigyanashram
 
The Pursuit of Busyness
The Pursuit of BusynessThe Pursuit of Busyness
The Pursuit of BusynessDevesh Pandey
 
Buñay llangari hector
Buñay llangari hectorBuñay llangari hector
Buñay llangari hectortoli976
 
Lua chon thuy san nhom 3 -11 a10-2010
Lua chon thuy san  nhom 3 -11 a10-2010Lua chon thuy san  nhom 3 -11 a10-2010
Lua chon thuy san nhom 3 -11 a10-2010Thuy AI Tran Thi
 
OPORTUNIDAD!! CASA REMODELADA ESTRENE YA!!
OPORTUNIDAD!!  CASA REMODELADA ESTRENE YA!!OPORTUNIDAD!!  CASA REMODELADA ESTRENE YA!!
OPORTUNIDAD!! CASA REMODELADA ESTRENE YA!!Blanca Flores
 
Switching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting PracticeSwitching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting PracticeGYK Antler
 
Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892Benjamin Crucq
 

Andere mochten auch (20)

Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for HadoopSplout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
 
Datasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactionsDatasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactions
 
Tuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduceTuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduce
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Photo Contests 2012
Photo Contests 2012Photo Contests 2012
Photo Contests 2012
 
The Spirit of Barnabas
The Spirit of Barnabas The Spirit of Barnabas
The Spirit of Barnabas
 
Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...
 
Sandwich Art
Sandwich ArtSandwich Art
Sandwich Art
 
Day 2 recycle grey water
Day 2 recycle grey waterDay 2 recycle grey water
Day 2 recycle grey water
 
Day 3 recycle grey water
Day 3  recycle grey waterDay 3  recycle grey water
Day 3 recycle grey water
 
NajboljaMamaNaSvetu.com
NajboljaMamaNaSvetu.com NajboljaMamaNaSvetu.com
NajboljaMamaNaSvetu.com
 
Rethinking the mobile web
Rethinking the mobile webRethinking the mobile web
Rethinking the mobile web
 
The Pursuit of Busyness
The Pursuit of BusynessThe Pursuit of Busyness
The Pursuit of Busyness
 
Buñay llangari hector
Buñay llangari hectorBuñay llangari hector
Buñay llangari hector
 
Workbook sesion13
Workbook sesion13Workbook sesion13
Workbook sesion13
 
Wedge
WedgeWedge
Wedge
 
Lua chon thuy san nhom 3 -11 a10-2010
Lua chon thuy san  nhom 3 -11 a10-2010Lua chon thuy san  nhom 3 -11 a10-2010
Lua chon thuy san nhom 3 -11 a10-2010
 
OPORTUNIDAD!! CASA REMODELADA ESTRENE YA!!
OPORTUNIDAD!!  CASA REMODELADA ESTRENE YA!!OPORTUNIDAD!!  CASA REMODELADA ESTRENE YA!!
OPORTUNIDAD!! CASA REMODELADA ESTRENE YA!!
 
Switching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting PracticeSwitching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting Practice
 
Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892
 

Ähnlich wie Scalable vertical search engine with hadoop

SUPPORTING QUERYING ON MULTI-MILLION EVENTS PER SECOND from Structure:Data 2012
SUPPORTING QUERYING ON MULTI-MILLION EVENTS PER SECOND from Structure:Data 2012SUPPORTING QUERYING ON MULTI-MILLION EVENTS PER SECOND from Structure:Data 2012
SUPPORTING QUERYING ON MULTI-MILLION EVENTS PER SECOND from Structure:Data 2012Gigaom
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
To Host, Or Not To Host?
To Host, Or Not To Host?To Host, Or Not To Host?
To Host, Or Not To Host?Atlassian
 
Ceph LISA'12 Presentation
Ceph LISA'12 PresentationCeph LISA'12 Presentation
Ceph LISA'12 PresentationCeph Community
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's ArchitectureTony Tam
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
RSS כיצד להשתמש ב
RSS כיצד להשתמש ב RSS כיצד להשתמש ב
RSS כיצד להשתמש ב reballattoun
 
Distributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentdDistributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentdSATOSHI TAGOMORI
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopGeorge Ang
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Nathan Bijnens
 
One Fish, Two Fish, Red Fish, Dru-Fish - BADCamp Presentation on Conference ...
One Fish, Two Fish,  Red Fish, Dru-Fish - BADCamp Presentation on Conference ...One Fish, Two Fish,  Red Fish, Dru-Fish - BADCamp Presentation on Conference ...
One Fish, Two Fish, Red Fish, Dru-Fish - BADCamp Presentation on Conference ...Dustin Boeger
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google wayEduard Hildebrandt
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 

Ähnlich wie Scalable vertical search engine with hadoop (20)

SUPPORTING QUERYING ON MULTI-MILLION EVENTS PER SECOND from Structure:Data 2012
SUPPORTING QUERYING ON MULTI-MILLION EVENTS PER SECOND from Structure:Data 2012SUPPORTING QUERYING ON MULTI-MILLION EVENTS PER SECOND from Structure:Data 2012
SUPPORTING QUERYING ON MULTI-MILLION EVENTS PER SECOND from Structure:Data 2012
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
To Host, Or Not To Host?
To Host, Or Not To Host?To Host, Or Not To Host?
To Host, Or Not To Host?
 
Ceph LISA'12 Presentation
Ceph LISA'12 PresentationCeph LISA'12 Presentation
Ceph LISA'12 Presentation
 
Hadoop
HadoopHadoop
Hadoop
 
CloudOpen - 08/29/2012
CloudOpen - 08/29/2012CloudOpen - 08/29/2012
CloudOpen - 08/29/2012
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
RSS כיצד להשתמש ב
RSS כיצד להשתמש ב RSS כיצד להשתמש ב
RSS כיצד להשתמש ב
 
Distributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentdDistributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentd
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013
 
One Fish, Two Fish, Red Fish, Dru-Fish - BADCamp Presentation on Conference ...
One Fish, Two Fish,  Red Fish, Dru-Fish - BADCamp Presentation on Conference ...One Fish, Two Fish,  Red Fish, Dru-Fish - BADCamp Presentation on Conference ...
One Fish, Two Fish, Red Fish, Dru-Fish - BADCamp Presentation on Conference ...
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google way
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 

Kürzlich hochgeladen

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Kürzlich hochgeladen (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Scalable vertical search engine with hadoop

  • 1. Hadoop use case: A scalable vertical search engine Iván de Prado Alonso, Datasalt Co-founder Twitter: @ivanprado
  • 2. Content §  The problem §  The obvious solution §  When the obvious solution fails… §  … Hadoop comes to the rescue §  Advantages & disadvantages §  Improvements
  • 3. ¿What is a vertical search engine? Provider 1 Vertical Search Engine Feed s rche Se a Provider 2 Sear ches ed Fe
  • 5. The “obvious” architecture The first thing that comes to your mind Feed Does it exist? Has it changed? Insert/update Database Download & Process Insert/update Lucene/Solr Search Page Index
  • 6. How it works §  Feed download §  For every register in the feed •  Check for existence in the DB •  If it exists and has changed, update ª The DB ª The Index •  If it doesn’t exist, insert into ª The DB ª The Index
  • 7. How it works (II) §  The Database is used for •  Checking for register existence (avoiding duplicates) •  Managing the data with SQL facility §  Lucene/Solr is used for •  Quick searches •  Searching by structured fields •  Free-text searches •  Faceting
  • 8. But if things go well...… Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed
  • 10. “Swiss army knife of the 21st century” Media Guardian Innovation Awards http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
  • 11. Hadoop “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model” From Hadoop homepage
  • 12. File System §  Distributed File System (HDFS) •  Cluster of nodes exposing their storage capacity •  Big blocks: 64 Mb •  Fault tolerant (replication) •  Big files storage
  • 13. MapReduce §  Two functions (Map y Reduce) •  Map(k, v) : [z,w]* •  Reduce(z, w*) : [u, v]* §  Example: word count •  Map([document, null]) -> [word, 1]* •  Reduce(word, 1*) -> [word, total] §  MapReduce & SQL •  SELECT word, count(*) GROUP BY word §  Distributed execution on a cluster §  Horizontal scalability
  • 14. Ok, that’s cool, but… ¿How does it solve my problem?
  • 15. Because… §  Hadoop is not a Database §  Hadoop “apparently” only processes data §  Hadoop does not allow “lookups” Hadoop is a paradigm shift difficult to assimilate
  • 17. Philosophy §  Always reprocess everything. ¡EVERYTHING! §  ¿Why? •  More bug tolerant •  More flexible •  More efficient. E.g.: ª  With a 7200 RPM HD –  Random IOPS – 100 –  Sequencial Read/Write – 40 MB/s –  Hypothesis: 5 Kb register size ª  … it is faster to rewrite all data than to perform random updates when more than 1.25% of the registers has changed. –  1 GB, 200.000 registers »  Sequential writing: 25 sg »  Random writing: 33 min!
  • 18. Fetcher Feeds are downloaded and stored in the HDFS. §  MapReduce •  Input: [feed_url, null]* Reducer Task •  Mapper: identity •  Reducer(feed_url, Reducer Task HDFS null*) ª  Download the Reducer Task feed_url and store it in a HDFS folder
  • 19. Processor Feeds are parsed, converted into documents and deduplicated §  MapReduce •  Input: [feed_path, null]* •  Map(feed_path, null) : [id, documents]* ª The feed is parsed and converted into documents •  Reducer(id, [document]*): [id, document] ª Receives a list of documents and keeps the most recent one (deduplication) ª  A unique and global identifier is required (idProvider + idInternal) •  Output: [id, document]*
  • 20. Processor (II) §  Possible problem: •  Very large feeds ª Does not scale, as one task will deal with the full feed. §  Solution •  Write a custom InputFormat that divides the feed in smaller pieces.
  • 21. Serialization §  Writables •  Native Hadoop Serialization •  Low level API •  Basic types: IntWritable, Text, etc. §  Others •  Thrift, Avro, Protostuff •  Backwards compatibility
  • 22. Indexer Production Solr Hot swap Reducer Task Index - Shard 1 Index - Shard 1 Web Server Reducer Task Hot swap Index - Shard 2 Index - Shard 2 Reducer Task Web Server Hot swap Index - Shard 3 Index - Shard 3
  • 23. Indexer (II) §  SOLR-1301 •  https://issues.apache.org/jira/browse/SOLR-1301 •  SolrOutputFormat •  1 index per reducer •  A custom Partitioner can be used to control where to place each document §  Another option •  Writing your own indexation code ª  By creating a custom output format ª  By Indexing at the reducer level. In each reduce call: –  Open an index –  Write all incoming registers –  Close the index
  • 24. Search & Partitioning §  Different partitioning schemas •  Horizontal ª Each search involves all shards •  Vertical: by ad type, country, etc. ª Searches can be restricted to the involved shard §  Solr for index serving. Possibilities: ª Non federated Solr –  Only for vertical partitioning ª Distributed Solr ª Solr Cloud
  • 25. Reconciliation From Fetcher Reconciliation Next steps Reconciliated documents Last execution !le §  ¿How to register changes? •  Changes in price, features, etc. •  MapReduce: ª  Input: [id, document]* –  From last execution –  From current processing ª  Map: identity ª  Reduce(id, [document]*) : [id, document] –  Documents grouped by ID. New and old documents come together. –  New and old documents are compared. –  The relevant information is stored in the new document (e.g, the old price) –  Only the new document is emited. §  This is the closest thing in Hadoop to a DB
  • 26. Advantages of the architecture §  Horizontal Scalability •  If properly programmed §  High tolerance to failures and bugs •  Always everything is reprocessed §  Flexible •  It is easy to do big changes §  High decoupling •  Indexes are the unique interaction between the back-end and the front-end •  Web servers can keep running even if the back- end is broken.
  • 27. Disadvantages §  Batch processing •  No real-time or “near” real-time •  Update cycles of hours §  Completely different programming paradigm •  High learning curve
  • 28. Improvements §  System for images §  Fuzzy duplicates detection §  Plasam: •  Mixing this architecture with a by-pass system that provides near real time updates to the FE indexes ª  Implementing a by-pass to the Solrs ª  System for ensuring data consistency –  Without back jumps in time •  That combines the advantages of the proposed architecture but with near real time •  Datasalt has a prototype ready
  • 29. Thanks! Ivan de Prado, ivan@datasalt.com @ivanprado