SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Building a scalable distributed  WWW search engine  … NOT in Perl! Presented by Alex Chudnovsky (http://www.majectic12.co.uk)  at Birmingham Perl Mongers User Group  (http://birmingham.pm.org) V1.0 27/07/05
Contents ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
History (of my work in area of information retrieval) ,[object Object],[object Object],[object Object]
Goals ,[object Object],[object Object],[object Object]
Architecture ,[object Object],[object Object],[object Object],[object Object]
Data collection (crawling) Base Issues URLs to crawl and receives compressed pages Distributed c rawlers – receive lists of URLs to crawl, crawl them and send back compressed data. In the future will do distributed indexing Note: this stage is optional if you already have data to index, ie list of products with their descriptions
Crawler screenshot 1
Crawler screenshot 2
Crawler screenshot 3
Crawler screenshot 4
Crawler screenshot 5
Current Stats Source:  http://www.majestic12.co.uk/projects/dsearch/stats.php  as of 27/07/05
Indexing Indexing is a process of turning words into numbers and creating inverted index. Data barrel Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl  – 1 Mongers  – 2 City  – 3  Inverted Index (Each of the WordID has list of  (ideally sorted) DocIDs) 0  -> 0, 1 1  -> 0, 2 2  -> 0, 3  -> 1, 2 Note: if you use database then it make sense to have clustered index on WordID
Merging Individual indexed barrels Single searchable index Note: this stage is not necessary if just one barrel is used as there will be no need to remap all Ids from local to their global equivalents.
Searching Searching is a process of finding documents that contain words from search query Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl  – 1 Mongers  – 2 City  – 3  Inverted Index (lists DocIDs for each of the WordID) 0  -> 0, 1 1  -> 0, 2 2  -> 0, 3  -> 1, 2 Note: if you use database then it make sense to cluster on WordID Search query:  “Birmingham Perl” WordIDs: 0, 1 Intersection of DocIDs present in both lists (implementation of boolean AND logic): Not matched! 2 n/a Not matched! n/a 1 Matched! 0 0 Result 1 (Perl) 0 (Brum)
Search engine screenshot 1
Search engine screenshot 2
Implementation ,[object Object],[object Object],[object Object]
Why not Perl? (using C # instead) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Conclusions ,[object Object],[object Object],[object Object],[object Object],[object Object]
Credits ,[object Object],[object Object],* Volunteers running crawler and who crawled at least 1 mln URLs as of 27/07/05
Recommended reading ,[object Object],[object Object]
Join! Join the project  (unmetered broadband required!):  majestic12.co.uk Your name could be here!

Weitere ähnliche Inhalte

Was ist angesagt?

EuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devicesEuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devicesHua Chu
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixStitch Fix Algorithms
 
Documenting an API written in Django Rest Framework
Documenting an API written in Django Rest FrameworkDocumenting an API written in Django Rest Framework
Documenting an API written in Django Rest Frameworksmirolo
 
Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8Richard Boulton
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
 
Data Pipelining Across AWS and GCP
Data Pipelining Across AWS and GCPData Pipelining Across AWS and GCP
Data Pipelining Across AWS and GCPJiangjun Huang
 
Week 2-after
Week 2-afterWeek 2-after
Week 2-afterjnand
 
Get Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor DataGet Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor DataCrate.io
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
An Introduction to MongoDB Compass
An Introduction to MongoDB CompassAn Introduction to MongoDB Compass
An Introduction to MongoDB CompassMongoDB
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataVivian S. Zhang
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariKarissa Rae McKelvey
 
Final Presentation IRT - Jingxuan Wei V1.2
Final Presentation  IRT - Jingxuan Wei V1.2Final Presentation  IRT - Jingxuan Wei V1.2
Final Presentation IRT - Jingxuan Wei V1.2JINGXUAN WEI
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 

Was ist angesagt? (20)

EuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devicesEuroPython 2020 - Speak python with devices
EuroPython 2020 - Speak python with devices
 
SFrame
SFrameSFrame
SFrame
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
 
Documenting an API written in Django Rest Framework
Documenting an API written in Django Rest FrameworkDocumenting an API written in Django Rest Framework
Documenting an API written in Django Rest Framework
 
Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8
 
Optimizing Spark
Optimizing SparkOptimizing Spark
Optimizing Spark
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Data Pipelining Across AWS and GCP
Data Pipelining Across AWS and GCPData Pipelining Across AWS and GCP
Data Pipelining Across AWS and GCP
 
Week 2-after
Week 2-afterWeek 2-after
Week 2-after
 
Get Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor DataGet Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor Data
 
Sphinx
SphinxSphinx
Sphinx
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
An Introduction to MongoDB Compass
An Introduction to MongoDB CompassAn Introduction to MongoDB Compass
An Introduction to MongoDB Compass
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
Final Presentation IRT - Jingxuan Wei V1.2
Final Presentation  IRT - Jingxuan Wei V1.2Final Presentation  IRT - Jingxuan Wei V1.2
Final Presentation IRT - Jingxuan Wei V1.2
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 

Andere mochten auch

C1 Introducere Sistem1
C1 Introducere Sistem1C1 Introducere Sistem1
C1 Introducere Sistem1antropologie
 
Specificatii Eseu Fotografic Observatie Antropologie 2009
Specificatii Eseu Fotografic Observatie Antropologie 2009Specificatii Eseu Fotografic Observatie Antropologie 2009
Specificatii Eseu Fotografic Observatie Antropologie 2009antropologie
 
C3 EvoluţIa Sistemului1
C3 EvoluţIa Sistemului1C3 EvoluţIa Sistemului1
C3 EvoluţIa Sistemului1antropologie
 
Nätverket 24 timmarswebben
Nätverket 24 timmarswebbenNätverket 24 timmarswebben
Nätverket 24 timmarswebbenBjörn Hagström
 
C3 4 EvoluţIa Sistemului1
C3 4 EvoluţIa Sistemului1C3 4 EvoluţIa Sistemului1
C3 4 EvoluţIa Sistemului1antropologie
 

Andere mochten auch (9)

C1 Introducere Sistem1
C1 Introducere Sistem1C1 Introducere Sistem1
C1 Introducere Sistem1
 
Kick off presentation
Kick off presentationKick off presentation
Kick off presentation
 
Specificatii Eseu Fotografic Observatie Antropologie 2009
Specificatii Eseu Fotografic Observatie Antropologie 2009Specificatii Eseu Fotografic Observatie Antropologie 2009
Specificatii Eseu Fotografic Observatie Antropologie 2009
 
Page Rank
Page RankPage Rank
Page Rank
 
C3 EvoluţIa Sistemului1
C3 EvoluţIa Sistemului1C3 EvoluţIa Sistemului1
C3 EvoluţIa Sistemului1
 
Nätverket 24 timmarswebben
Nätverket 24 timmarswebbenNätverket 24 timmarswebben
Nätverket 24 timmarswebben
 
Kathleen & nina
Kathleen & ninaKathleen & nina
Kathleen & nina
 
C3 4 EvoluţIa Sistemului1
C3 4 EvoluţIa Sistemului1C3 4 EvoluţIa Sistemului1
C3 4 EvoluţIa Sistemului1
 
Diviziuni Sociale
Diviziuni SocialeDiviziuni Sociale
Diviziuni Sociale
 

Ähnlich wie Www Search Engine But Not In Perl

MongoATL: How Sourceforge is Using MongoDB
MongoATL: How Sourceforge is Using MongoDBMongoATL: How Sourceforge is Using MongoDB
MongoATL: How Sourceforge is Using MongoDBRick Copeland
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?Örjan Lundberg
 
Accra MongoDB User Group
Accra MongoDB User GroupAccra MongoDB User Group
Accra MongoDB User GroupMongoDB
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxsmile790243
 
Beginning MEAN Stack
Beginning MEAN StackBeginning MEAN Stack
Beginning MEAN StackRob Davarnia
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearchMinsoo Jun
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8dallemang
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At CraigslistJeremy Zawodny
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilSunita Shrivastava
 
Techorama - Evolvable Application Development with MongoDB
Techorama  - Evolvable Application Development with MongoDBTechorama  - Evolvable Application Development with MongoDB
Techorama - Evolvable Application Development with MongoDBbwullems
 
Mongo db first steps with csharp
Mongo db first steps with csharpMongo db first steps with csharp
Mongo db first steps with csharpSerdar Buyuktemiz
 
Node Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js TutorialNode Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js TutorialPHP Support
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRavi Teja
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBMongoDB
 

Ähnlich wie Www Search Engine But Not In Perl (20)

MongoATL: How Sourceforge is Using MongoDB
MongoATL: How Sourceforge is Using MongoDBMongoATL: How Sourceforge is Using MongoDB
MongoATL: How Sourceforge is Using MongoDB
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?
 
Accra MongoDB User Group
Accra MongoDB User GroupAccra MongoDB User Group
Accra MongoDB User Group
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
 
Mongo db
Mongo dbMongo db
Mongo db
 
Beginning MEAN Stack
Beginning MEAN StackBeginning MEAN Stack
Beginning MEAN Stack
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
 
Techorama - Evolvable Application Development with MongoDB
Techorama  - Evolvable Application Development with MongoDBTechorama  - Evolvable Application Development with MongoDB
Techorama - Evolvable Application Development with MongoDB
 
Mongo db first steps with csharp
Mongo db first steps with csharpMongo db first steps with csharp
Mongo db first steps with csharp
 
MongoDB Basics
MongoDB BasicsMongoDB Basics
MongoDB Basics
 
Uma SunilKumar Resume
Uma SunilKumar ResumeUma SunilKumar Resume
Uma SunilKumar Resume
 
Node Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js TutorialNode Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js Tutorial
 
Reume IT
Reume ITReume IT
Reume IT
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
 
Search Approach - ES, GraphDB
Search Approach - ES, GraphDBSearch Approach - ES, GraphDB
Search Approach - ES, GraphDB
 
NoSql Databases
NoSql DatabasesNoSql Databases
NoSql Databases
 

Kürzlich hochgeladen

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Kürzlich hochgeladen (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Www Search Engine But Not In Perl

  • 1. Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky (http://www.majectic12.co.uk) at Birmingham Perl Mongers User Group (http://birmingham.pm.org) V1.0 27/07/05
  • 2.
  • 3.
  • 4.
  • 5.
  • 6. Data collection (crawling) Base Issues URLs to crawl and receives compressed pages Distributed c rawlers – receive lists of URLs to crawl, crawl them and send back compressed data. In the future will do distributed indexing Note: this stage is optional if you already have data to index, ie list of products with their descriptions
  • 12. Current Stats Source: http://www.majestic12.co.uk/projects/dsearch/stats.php as of 27/07/05
  • 13. Indexing Indexing is a process of turning words into numbers and creating inverted index. Data barrel Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl – 1 Mongers – 2 City – 3 Inverted Index (Each of the WordID has list of (ideally sorted) DocIDs) 0 -> 0, 1 1 -> 0, 2 2 -> 0, 3 -> 1, 2 Note: if you use database then it make sense to have clustered index on WordID
  • 14. Merging Individual indexed barrels Single searchable index Note: this stage is not necessary if just one barrel is used as there will be no need to remap all Ids from local to their global equivalents.
  • 15. Searching Searching is a process of finding documents that contain words from search query Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl – 1 Mongers – 2 City – 3 Inverted Index (lists DocIDs for each of the WordID) 0 -> 0, 1 1 -> 0, 2 2 -> 0, 3 -> 1, 2 Note: if you use database then it make sense to cluster on WordID Search query: “Birmingham Perl” WordIDs: 0, 1 Intersection of DocIDs present in both lists (implementation of boolean AND logic): Not matched! 2 n/a Not matched! n/a 1 Matched! 0 0 Result 1 (Perl) 0 (Brum)
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. Join! Join the project (unmetered broadband required!): majestic12.co.uk Your name could be here!