SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
How to build a small
distributed search
engine using open
source software
Building a distributed search engine
Search engine subsytems:
●

Page database

●

List of the pages to retrieve

●

Pages retrieval and save

●

Page content parsing

●

Full-text indexing of the contents

●

Graph database of the links for ranking
Building a distributed search engine

Open Source Software
•

Apache Hadoop
•
•
•

•

MapReduce
HDFS
HBase

Apache Lucene
Building a distributed search engine

HDFS
Hadoop Distributed File System
Building a distributed search engine

HDFS – Assumptions and goals
●

Hardware failure

●

Big data

●

Write once / read many

●

Moving computation, not data
Building a distributed search engine
Building a distributed search engine
Building a distributed search engine

Lucene
Building a distributed search engine
Lucene - Inverse Indexing
Term

Doc Id

Weight

JUG
301
198
120

0.97
0.65
0.43

301
278
451
103
763

0.94
0.15
0.87
0.45
0.77

Lugano
Building a distributed search engine
Lucene - Indexing main classes


IndexWriter



Directory



Analyzer



Document



Field
Building a distributed search engine
Lucene - Searching main classes


IndexSearcher



Collector



Query



TopDocs



ScoreDoc
Building a distributed search engine

Lucene - Analyzers






StopWords

”the book is on the table” → [book, table]
Stemming

[paint, paints, painted, …] → paint
Synonims

[cat, feline] → cat
Building a distributed search engine

Lucene - Search options


Fields





Wildcards





Title: JUG
body: ”JUG Lugano”
J?G → [JUG, JAG, ...]
J*G →[JUG, JEEG, JUNG, …]

Fuzzy (basata su vocabolario)


JUG~[n] → [MUG, JAG, …]
Building a distributed search engine

Lucene - Search options


Range





Boost





JUG^5 Lugano
”JUG Lugano”^5

Proximity




Year: [2002 TO 2012]
Name: {Alberto TO Andrea}

”JUG Lugano”~5

Boolean and existance


AND, OR, NOT, (), +, -
Building a distributed search engine

HDFS - Lucene Integration


File copy from/to HDFS



Patch IndexWriter/Director
IndexWriter/Directory



Rewrite of IndexWriter on RAM



Lucene 4
Building a distributed search engine

And now...
Hands on!

Weitere ähnliche Inhalte

Was ist angesagt?

MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...HBaseCon
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightNilesh Gule
 
Hadoop online trainings
Hadoop online trainingsHadoop online trainings
Hadoop online trainingsGeek Trainings
 
Elasticsearch tuning
Elasticsearch tuningElasticsearch tuning
Elasticsearch tuningNIKHIL DUBEY
 
Moving from CKAN to Dataverse setting the plan for 2020
Moving from CKAN to Dataverse setting the plan for 2020Moving from CKAN to Dataverse setting the plan for 2020
Moving from CKAN to Dataverse setting the plan for 2020ILRI
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudCAMMS
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Rukmani Gopalan
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesHBaseCon
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDBMongoDB
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB
 
Effective Searching by Dominik Kornas
Effective Searching by Dominik KornasEffective Searching by Dominik Kornas
Effective Searching by Dominik KornasAEM HUB
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache HadoopSufi Nawaz
 

Was ist angesagt? (20)

Hadoop
HadoopHadoop
Hadoop
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
 
10 mongo db
10 mongo db10 mongo db
10 mongo db
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsight
 
Hadoop online trainings
Hadoop online trainingsHadoop online trainings
Hadoop online trainings
 
Google hacks himesh
Google hacks himeshGoogle hacks himesh
Google hacks himesh
 
Elasticsearch tuning
Elasticsearch tuningElasticsearch tuning
Elasticsearch tuning
 
Moving from CKAN to Dataverse setting the plan for 2020
Moving from CKAN to Dataverse setting the plan for 2020Moving from CKAN to Dataverse setting the plan for 2020
Moving from CKAN to Dataverse setting the plan for 2020
 
Hadoop-BigData
Hadoop-BigDataHadoop-BigData
Hadoop-BigData
 
Big Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in ChandigarhBig Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in Chandigarh
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDB
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
 
Jinchao demo
Jinchao demoJinchao demo
Jinchao demo
 
Effective Searching by Dominik Kornas
Effective Searching by Dominik KornasEffective Searching by Dominik Kornas
Effective Searching by Dominik Kornas
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 

Andere mochten auch

探索 Everything 背后的技术
探索 Everything 背后的技术探索 Everything 背后的技术
探索 Everything 背后的技术yiwenshengmei
 
Xapian vs sphinx
Xapian vs sphinxXapian vs sphinx
Xapian vs sphinxpanjunyong
 
Oracle Text 12c New Features
Oracle Text 12c New FeaturesOracle Text 12c New Features
Oracle Text 12c New FeaturesUlrike Schwinn
 
Comparing open source search engines
Comparing open source search enginesComparing open source search engines
Comparing open source search enginesRichard Boulton
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXAndrea Iacono
 
Oracle Application Express
Oracle Application ExpressOracle Application Express
Oracle Application ExpressHBoone
 
Oracle Text in APEX
Oracle Text in APEXOracle Text in APEX
Oracle Text in APEXScott Wesley
 
Oracle application express ppt
Oracle application express pptOracle application express ppt
Oracle application express pptAbhinaw Kumar
 
Production Support
Production SupportProduction Support
Production Supportr_shanki
 
SEO Proposal Sample
SEO Proposal SampleSEO Proposal Sample
SEO Proposal SampleIan Lurie
 
Scalable vertical search engine with hadoop
Scalable vertical search engine with hadoopScalable vertical search engine with hadoop
Scalable vertical search engine with hadoopdatasalt
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheLeslie Samuel
 

Andere mochten auch (15)

探索 Everything 背后的技术
探索 Everything 背后的技术探索 Everything 背后的技术
探索 Everything 背后的技术
 
Xapian vs sphinx
Xapian vs sphinxXapian vs sphinx
Xapian vs sphinx
 
Oracle Text 12c New Features
Oracle Text 12c New FeaturesOracle Text 12c New Features
Oracle Text 12c New Features
 
Comparing open source search engines
Comparing open source search enginesComparing open source search engines
Comparing open source search engines
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
document1-2 FINAL-FINALLL
document1-2 FINAL-FINALLLdocument1-2 FINAL-FINALLL
document1-2 FINAL-FINALLL
 
Oracle Application Express
Oracle Application ExpressOracle Application Express
Oracle Application Express
 
Oracle Text in APEX
Oracle Text in APEXOracle Text in APEX
Oracle Text in APEX
 
Oracle application express ppt
Oracle application express pptOracle application express ppt
Oracle application express ppt
 
Production Support
Production SupportProduction Support
Production Support
 
Amazon search test case document
Amazon search test case documentAmazon search test case document
Amazon search test case document
 
SEO Proposal Sample
SEO Proposal SampleSEO Proposal Sample
SEO Proposal Sample
 
IC Engine PPt
IC Engine PPtIC Engine PPt
IC Engine PPt
 
Scalable vertical search engine with hadoop
Scalable vertical search engine with hadoopScalable vertical search engine with hadoop
Scalable vertical search engine with hadoop
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Ähnlich wie How to build_a_search_engine

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkkeval dalasaniya
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeMSAdvAnalytics
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
Basic System Design Geliyoo Search Engine
Basic System Design Geliyoo Search EngineBasic System Design Geliyoo Search Engine
Basic System Design Geliyoo Search EngineXtremcoin and Geliyoo
 

Ähnlich wie How to build_a_search_engine (20)

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Big data in Azure
Big data in AzureBig data in Azure
Big data in Azure
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop framework
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Apache drill
Apache drillApache drill
Apache drill
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 
Hadoop
HadoopHadoop
Hadoop
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Basic System Design Geliyoo Search Engine
Basic System Design Geliyoo Search EngineBasic System Design Geliyoo Search Engine
Basic System Design Geliyoo Search Engine
 

Kürzlich hochgeladen

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 

Kürzlich hochgeladen (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 

How to build_a_search_engine

  • 1. How to build a small distributed search engine using open source software
  • 2. Building a distributed search engine Search engine subsytems: ● Page database ● List of the pages to retrieve ● Pages retrieval and save ● Page content parsing ● Full-text indexing of the contents ● Graph database of the links for ranking
  • 3. Building a distributed search engine Open Source Software • Apache Hadoop • • • • MapReduce HDFS HBase Apache Lucene
  • 4. Building a distributed search engine HDFS Hadoop Distributed File System
  • 5. Building a distributed search engine HDFS – Assumptions and goals ● Hardware failure ● Big data ● Write once / read many ● Moving computation, not data
  • 6. Building a distributed search engine
  • 7. Building a distributed search engine
  • 8. Building a distributed search engine Lucene
  • 9. Building a distributed search engine Lucene - Inverse Indexing Term Doc Id Weight JUG 301 198 120 0.97 0.65 0.43 301 278 451 103 763 0.94 0.15 0.87 0.45 0.77 Lugano
  • 10. Building a distributed search engine Lucene - Indexing main classes  IndexWriter  Directory  Analyzer  Document  Field
  • 11. Building a distributed search engine Lucene - Searching main classes  IndexSearcher  Collector  Query  TopDocs  ScoreDoc
  • 12. Building a distributed search engine Lucene - Analyzers    StopWords  ”the book is on the table” → [book, table] Stemming  [paint, paints, painted, …] → paint Synonims  [cat, feline] → cat
  • 13. Building a distributed search engine Lucene - Search options  Fields    Wildcards    Title: JUG body: ”JUG Lugano” J?G → [JUG, JAG, ...] J*G →[JUG, JEEG, JUNG, …] Fuzzy (basata su vocabolario)  JUG~[n] → [MUG, JAG, …]
  • 14. Building a distributed search engine Lucene - Search options  Range    Boost    JUG^5 Lugano ”JUG Lugano”^5 Proximity   Year: [2002 TO 2012] Name: {Alberto TO Andrea} ”JUG Lugano”~5 Boolean and existance  AND, OR, NOT, (), +, -
  • 15. Building a distributed search engine HDFS - Lucene Integration  File copy from/to HDFS  Patch IndexWriter/Director IndexWriter/Directory  Rewrite of IndexWriter on RAM  Lucene 4
  • 16. Building a distributed search engine And now... Hands on!