SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
MinHashing:
Fast Similarity
Search
@jsuchal @rubyslava #30
@synopsitv
Problem
Users that like this also like...
Jaccard Similarity
movie1 = {user1, user2, user3, user4}
movie2 = {user3, user4, user5, user6}
J(A, B) = |A ∩ B| / |A ∪ B|
J(m1, m2) = |{user3, user4}| / |{user1 ... user6}|
= 2 / 6
= 0.33
Jaccard Similarity
movie1 = {user1, user2, user3, user4} 30K oops!
movie2 = {user3, user4, user5, user6}
2M oops!
J(A, B) = |A ∩ B| / |A ∪ B|
J(m1, m2) = |{user3, user4}| / |{user1 ... user6}|
= 2 / 6
= 0.33
MinHashing
Key idea:
What is the probability that two sets have the
same minimum value?
MinHashing
Key idea:
What is the probability that two sets have the
same minimum value?
P( s(A) = s(B) ) = J(A, B)
MinHashing
def calculate_minhash(hash_function, set)
minhash = Infinity
set.each do |item|
value = hash_function.call(item)
minhash = value if value < minhash
end
minhash
end
def generate_signature(set)
@hash_functions.map do |hash_function|
calculate_minhash(hash_function, set)
end
end
MinHashing
def similarity(signature1, signature2)
matches = 0
signature1.each_with_index do |minhash, idx|
matches += 1 if minhash == signature2[idx]
end
matches / signature1.size.to_f
end
Demo
http://j.mp/194KB7X
Top-k search
● fast top-k search = fulltext search
○ elasticsearch.org
● curl -XGET localhost:
9200/movies/_search? q=likes.
signature:(123 456 789)
Features
● easily updatable on insertion
○ calculate minhash of new element
○ update existing set minhash if new minhash is lower
● tunable precision at query time!
○ use bigger/smaller part of precalculated signature
Extensions
● Weighting
○ not all items in set have same weight
● Locality sensitive hashing
○ shingling
○ boosting high similarity matching
○ e.g. near duplicate detection
Resources
● Finding similar items using minhashing
http://www.toao.com/posts/finding-similar-items-key-store-minhashing.html
● Mining of Massive Datasets
http://infolab.stanford.edu/~ullman/mmds.html

Weitere ähnliche Inhalte

Mehr von Jano Suchal

Bonetics: Mastering Puppet Workshop
Bonetics: Mastering Puppet WorkshopBonetics: Mastering Puppet Workshop
Bonetics: Mastering Puppet WorkshopJano Suchal
 
Peter Mihalik: Puppet
Peter Mihalik: PuppetPeter Mihalik: Puppet
Peter Mihalik: PuppetJano Suchal
 
Tomáš Čorej: Configuration management & CFEngine3
Tomáš Čorej: Configuration management & CFEngine3Tomáš Čorej: Configuration management & CFEngine3
Tomáš Čorej: Configuration management & CFEngine3Jano Suchal
 
Ako si vybrať programovací jazyk a framework?
Ako si vybrať programovací jazyk a framework?Ako si vybrať programovací jazyk a framework?
Ako si vybrať programovací jazyk a framework?Jano Suchal
 
SQL: Query optimization in practice
SQL: Query optimization in practiceSQL: Query optimization in practice
SQL: Query optimization in practiceJano Suchal
 
Garelic: Google Analytics as App Performance monitoring
Garelic: Google Analytics as App Performance monitoringGarelic: Google Analytics as App Performance monitoring
Garelic: Google Analytics as App Performance monitoringJano Suchal
 
Miroslav Šimulčík: Temporálne databázy
Miroslav Šimulčík: Temporálne databázyMiroslav Šimulčík: Temporálne databázy
Miroslav Šimulčík: Temporálne databázyJano Suchal
 
Vojtech Rinik: Internship v USA - moje skúsenosti
Vojtech Rinik: Internship v USA - moje skúsenostiVojtech Rinik: Internship v USA - moje skúsenosti
Vojtech Rinik: Internship v USA - moje skúsenostiJano Suchal
 
Profiling and monitoring ruby & rails applications
Profiling and monitoring ruby & rails applicationsProfiling and monitoring ruby & rails applications
Profiling and monitoring ruby & rails applicationsJano Suchal
 
Aký programovací jazyk a framework si vybrať a prečo?
Aký programovací jazyk a framework si vybrať a prečo?Aký programovací jazyk a framework si vybrať a prečo?
Aký programovací jazyk a framework si vybrať a prečo?Jano Suchal
 
Petr Joachim: Redis na Super.cz
Petr Joachim: Redis na Super.czPetr Joachim: Redis na Super.cz
Petr Joachim: Redis na Super.czJano Suchal
 
Metaprogramovanie #1
Metaprogramovanie #1Metaprogramovanie #1
Metaprogramovanie #1Jano Suchal
 
PostgreSQL: Advanced features in practice
PostgreSQL: Advanced features in practicePostgreSQL: Advanced features in practice
PostgreSQL: Advanced features in practiceJano Suchal
 
elasticsearch - advanced features in practice
elasticsearch - advanced features in practiceelasticsearch - advanced features in practice
elasticsearch - advanced features in practiceJano Suchal
 
Postobjektové programovanie v Ruby
Postobjektové programovanie v RubyPostobjektové programovanie v Ruby
Postobjektové programovanie v RubyJano Suchal
 
Odporúčacie systémy a služba sme.sk čo čítať
Odporúčacie systémy a služba sme.sk čo čítaťOdporúčacie systémy a služba sme.sk čo čítať
Odporúčacie systémy a služba sme.sk čo čítaťJano Suchal
 
sme.sk čočítať ontožíur-2010
sme.sk čočítať ontožíur-2010sme.sk čočítať ontožíur-2010
sme.sk čočítať ontožíur-2010Jano Suchal
 

Mehr von Jano Suchal (18)

Bonetics: Mastering Puppet Workshop
Bonetics: Mastering Puppet WorkshopBonetics: Mastering Puppet Workshop
Bonetics: Mastering Puppet Workshop
 
Peter Mihalik: Puppet
Peter Mihalik: PuppetPeter Mihalik: Puppet
Peter Mihalik: Puppet
 
Tomáš Čorej: Configuration management & CFEngine3
Tomáš Čorej: Configuration management & CFEngine3Tomáš Čorej: Configuration management & CFEngine3
Tomáš Čorej: Configuration management & CFEngine3
 
Ako si vybrať programovací jazyk a framework?
Ako si vybrať programovací jazyk a framework?Ako si vybrať programovací jazyk a framework?
Ako si vybrať programovací jazyk a framework?
 
SQL: Query optimization in practice
SQL: Query optimization in practiceSQL: Query optimization in practice
SQL: Query optimization in practice
 
Garelic: Google Analytics as App Performance monitoring
Garelic: Google Analytics as App Performance monitoringGarelic: Google Analytics as App Performance monitoring
Garelic: Google Analytics as App Performance monitoring
 
Miroslav Šimulčík: Temporálne databázy
Miroslav Šimulčík: Temporálne databázyMiroslav Šimulčík: Temporálne databázy
Miroslav Šimulčík: Temporálne databázy
 
Vojtech Rinik: Internship v USA - moje skúsenosti
Vojtech Rinik: Internship v USA - moje skúsenostiVojtech Rinik: Internship v USA - moje skúsenosti
Vojtech Rinik: Internship v USA - moje skúsenosti
 
Profiling and monitoring ruby & rails applications
Profiling and monitoring ruby & rails applicationsProfiling and monitoring ruby & rails applications
Profiling and monitoring ruby & rails applications
 
Aký programovací jazyk a framework si vybrať a prečo?
Aký programovací jazyk a framework si vybrať a prečo?Aký programovací jazyk a framework si vybrať a prečo?
Aký programovací jazyk a framework si vybrať a prečo?
 
Čo po GAMČI?
Čo po GAMČI?Čo po GAMČI?
Čo po GAMČI?
 
Petr Joachim: Redis na Super.cz
Petr Joachim: Redis na Super.czPetr Joachim: Redis na Super.cz
Petr Joachim: Redis na Super.cz
 
Metaprogramovanie #1
Metaprogramovanie #1Metaprogramovanie #1
Metaprogramovanie #1
 
PostgreSQL: Advanced features in practice
PostgreSQL: Advanced features in practicePostgreSQL: Advanced features in practice
PostgreSQL: Advanced features in practice
 
elasticsearch - advanced features in practice
elasticsearch - advanced features in practiceelasticsearch - advanced features in practice
elasticsearch - advanced features in practice
 
Postobjektové programovanie v Ruby
Postobjektové programovanie v RubyPostobjektové programovanie v Ruby
Postobjektové programovanie v Ruby
 
Odporúčacie systémy a služba sme.sk čo čítať
Odporúčacie systémy a služba sme.sk čo čítaťOdporúčacie systémy a služba sme.sk čo čítať
Odporúčacie systémy a služba sme.sk čo čítať
 
sme.sk čočítať ontožíur-2010
sme.sk čočítať ontožíur-2010sme.sk čočítať ontožíur-2010
sme.sk čočítať ontožíur-2010
 

Kürzlich hochgeladen

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

MinHashing: Fast Similarity Search