Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL

•

5 gefällt mir•3,137 views

Max-kernel search: How to search for just about anything? Nearest neighbor search is a well studied and widely used task in computer science and is quite pervasive in everyday applications. While search is not synonymous with learning, search is a crucial tool for the most nonparametric form of learning. Nearest neighbor search can directly be used for all kinds of learning tasks — classification, regression, density estimation, outlier detection. Search is also the computational bottleneck in various other learning tasks such as clustering and dimensionality reduction. Key to nearest neighbor search is the notion of “near”-ness or similarity. Mercer kernels form a class of general nonlinear similarity functions and are widely used in machine learning. They can define a notion of similarity between pairs of objects of any arbitrary type and have been successfully applied to a wide variety of object types — fixed-length data, images, text, time series, graphs. I will present a technique to do nearest neighbor search with this class of similarity functions provably efficiently, hence facilitating faster learning for larger data.

Technologie

Max-kernel search
How to search for just about anything?
Parikshit Ram

Similarity search
q
● Set of objects
● Query
R ● Similarity function
1

Drug discovery
3
http://fineartamerica.com

Similarity search is ubiquitous
● Machine learning
● Computer vision
● Theory
● Databases
● Information retrieval
● Web application
● Collaborative filtering
● Scientific computing
5

Search-based classification
6
k-nearest-neighbor classification/regression

Search-based classification
7
“RomCom fan”

Search-based classification
7
“Kids movie fanatic”

Search-based ML
Advantage
● nonparametric - lets the data speak
● no need to train complex models
Key ingredient
● notion of similarity (domain/data-specific)
Main challenge: efficiency
● Sheer size of the data
● Varied data types
10

Properties of similarity functions
11
● symmetry
OR

11
3
1
The dissimilarity is the size of the set-theoretic difference

Properties of similarity functions
11
● symmetry
● self-similarity
OR
OR

12
Bregman
divergences
widely used for
distributions
Mercer kernels
widely used in
ML for variety of
objects and
problems
???
not quite
explored in
search or ML
Metrics
used everywhere

Breadth of Kernel Functions
Objects Kernel Functions
Images linear, polynomial, Gaussian, Pyramid match
Documents cosine
Sequences p-spectrum kernel, alignment score
Trees subtree, syntactic, partial tree
Graphs random walk
Time series cross-correlation, dynamic time-warping
Natural Lang. convolution, decomposition, lexical semantic
13

What is a Kernel Function?
In words
A pairwise symmetric function
● Correlation in a richer but hidden feature space
● Cannot access the hidden space
Object space
Hidden space
Hidden mapping
14

Max-kernel Search
Find the object in R most similar to q
with respect to a kernel
15

Existing methods
● Brute-force (parallel/distributed)
○ Domain-specific optimizations
● Coerce data to use metrics
○ Only approximate
No standard search tools!
16

Understanding kernels
If two objects equally similar to each other
then they are equally similar to the query q
17

Multi-resolution index in O( n log n ) time
p
18
Indexing our collection
Cover Tree (BKL 2006)

How to Search with this Index?
19
q
p
p'
p''

How to Search with this Index?
q
p
p''
p'
19

How to Search with this Index?
q
p
p''
p'
Safely ignore
a large chunk
(potentially millions)
19

Results: Efficiency
10000x
● Widely applicable algorithm
● Performance data/kernel-dependent
10x
Improvement
20

Results: Sublinear Query Time
Improvement
Object set size
Bigger data implies bigger efficiency gains
21

Can We Prove it?
What Makes Search Hard?
Thm.
For a set R of n objects, the query time is
● expansion constant
○ the distribution of the data
● directional concentration constant
○ the distribution of a kernel-induced transformation
of the data
22

Endnote
● Search is an essential tool for ML
● Exploring different types of similarity functions
increases the applicability and quality of search
● Kernels are widely applicable similarity functions
○ now we have provably fast max kernel search
Code/tutorial for Fast Exact Max-Kernel Search
23
version 1.0.5
http://www.mlpack.org Ryan R. Curtin
Email: pari@skytree.net

Empfohlen

Skytree big data london meetup - may 2013bigdatalondon

Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies

Distributed Deep Learning + others for Spark MeetupVijay Srinivas Agneeswaran, Ph.D

Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit

Beyond Kaggle: Solving Data Science Challenges at ScaleTuri, Inc.

Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.

Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks

Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaSpark Summit

Empfohlen

Skytree big data london meetup - may 2013bigdatalondon

Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies

Distributed Deep Learning + others for Spark MeetupVijay Srinivas Agneeswaran, Ph.D

Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit

Beyond Kaggle: Solving Data Science Challenges at ScaleTuri, Inc.

Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.

Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks

Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaSpark Summit

Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain

Next generation analytics with yarn, spark and graph labImpetus Technologies

Analyzing Data With PythonSarah Guido

Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella

Data Science with SparkKrishna Sankar

Anomaly Detection with Apache SparkCloudera, Inc.

Yarn spark next_gen_hadoop_8_jan_2014Vijay Srinivas Agneeswaran, Ph.D

High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake

Google's DremelMaria Stylianou

Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Yuanyuan Tian

Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf

Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien

Studies of HPCC Systems from Machine Learning PerspectivesHPCC Systems

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

Big data distributed processing: Spark introductionHektor Jacynycz García

SchemEX - Creating the Yellow Pages for the Linked Open Data CloudAnsgar Scherp

Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit

Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D

Intro to Python Data Analysis in WakariKarissa Rae McKelvey

Dremel: Interactive Analysis of Web-Scale Datasets robertlz

Multiple Kernel Learning based Approach to Representation and Feature Selecti...ICAC09

Distance Metric LearningSanghyuk Chun

Weitere ähnliche Inhalte

Was ist angesagt?

Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain

Next generation analytics with yarn, spark and graph labImpetus Technologies

Analyzing Data With PythonSarah Guido

Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella

Data Science with SparkKrishna Sankar

Anomaly Detection with Apache SparkCloudera, Inc.

Yarn spark next_gen_hadoop_8_jan_2014Vijay Srinivas Agneeswaran, Ph.D

High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake

Google's DremelMaria Stylianou

Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Yuanyuan Tian

Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf

Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien

Studies of HPCC Systems from Machine Learning PerspectivesHPCC Systems

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

Big data distributed processing: Spark introductionHektor Jacynycz García

SchemEX - Creating the Yellow Pages for the Linked Open Data CloudAnsgar Scherp

Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit

Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D

Intro to Python Data Analysis in WakariKarissa Rae McKelvey

Dremel: Interactive Analysis of Web-Scale Datasets robertlz

Was ist angesagt? (20)

Multiplatform Spark solution for Graph datasources by Javier Dominguez

Next generation analytics with yarn, spark and graph lab

Analyzing Data With Python

Distributed machine learning 101 using apache spark from a browser devoxx.b...

Data Science with Spark

Anomaly Detection with Apache Spark

Yarn spark next_gen_hadoop_8_jan_2014

High Performance Data Analytics with Java on Large Multicore HPC Clusters

Google's Dremel

Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15

Scalable Distributed Real-Time Clustering for Big Data Streams

Studies of HPCC Systems from Machine Learning Perspectives

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

Big data distributed processing: Spark introduction

SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud

Optimizing Terascale Machine Learning Pipelines with Keystone ML

Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab

Intro to Python Data Analysis in Wakari

Dremel: Interactive Analysis of Web-Scale Datasets

Andere mochten auch

Multiple Kernel Learning based Approach to Representation and Feature Selecti...ICAC09

Distance Metric LearningSanghyuk Chun

Machine Learning and ApplicationsGeeta Arora

Model selection and tuning at scaleOwen Zhang

MapR & Skytree: MapR Technologies

Machine learning in image processingData Science Thailand

Andere mochten auch (6)

Multiple Kernel Learning based Approach to Representation and Feature Selecti...

Distance Metric Learning

Machine Learning and Applications

Model selection and tuning at scale

MapR & Skytree:

Machine learning in image processing

Ähnlich wie Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL

Elasticsearch - basics and beyondErnesto Reig

Graph basedrdf storeforapachecassandraRavindra Ranwala

L15.pptxImonBennett

Data Structures & AlgorithmsMuhammad Jahanzaib

Data Science as ScaleConor B. Murphy

Apache Spark 101 - Demi Ben-AriDemi Ben-Ari

General introduction to AI ML DL DSRoopesh Kohad

Optimizing GenAI apps, by N. El Mawass and Maria KnorpsParis Women in Machine Learning and Data Science

Azure Databricks for Data ScientistsRichard Garris

Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox

Object Detection Beyond Mask R-CNN and RetinaNet IIWanjin Yu

Data Science At ZillowNicholas McClure

Search summit-2018-ltr-presentationSujit Pal

Distributed Decision Tree Inductiongregoryg

Session 2HarithaAshok3

Analysis of different similarity measures: SimrankAbhishek Mungoli

An introduction to similarity search and k-nn graphsThibault Debatty

Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Benjamin Nussbaum

Fast Variant Calling with ADAM and avocadofnothaft

The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar

Ähnlich wie Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL (20)

Elasticsearch - basics and beyond

Graph basedrdf storeforapachecassandra

L15.pptx

Data Structures & Algorithms

Data Science as Scale

Apache Spark 101 - Demi Ben-Ari

General introduction to AI ML DL DS

Optimizing GenAI apps, by N. El Mawass and Maria Knorps

Azure Databricks for Data Scientists

Comparing Big Data and Simulation Applications and Implications for Software ...

Object Detection Beyond Mask R-CNN and RetinaNet II

Data Science At Zillow

Search summit-2018-ltr-presentation

Distributed Decision Tree Induction

Session 2

Analysis of different similarity measures: Simrank

An introduction to similarity search and k-nn graphs

Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...

Fast Variant Calling with ADAM and avocado

The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Mehr von MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf

Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf

Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf

Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf

Josh Wills - Data Labeling as Religious ExperienceMLconf

Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf

Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf

Meghana Ravikumar - Optimized Image Classification on the CheapMLconf

Noam Finkelstein - The Importance of Modeling Data CollectionMLconf

June Andrews - The Uncanny Valley of MLMLconf

Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf

Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf

Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf

Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf

Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf

Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf

Neel Sundaresan - Teaching a machine to codeMLconf

Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf

Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf

Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf

Mehr von MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding

Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...

Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush

Josh Wills - Data Labeling as Religious Experience

Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...

Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...

Meghana Ravikumar - Optimized Image Classification on the Cheap

Noam Finkelstein - The Importance of Modeling Data Collection

June Andrews - The Uncanny Valley of ML

Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks

Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...

Vito Ostuni - The Voice: New Challenges in a Zero UI World

Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...

Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...

Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...

Neel Sundaresan - Teaching a machine to code

Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...

Soumith Chintala - Increasing the Impact of AI Through Better Software

Roy Lowrance - Predicting Bond Prices: Regime Changes

Kürzlich hochgeladen

From Family Reminiscence to Scholarly Archive .Alan Dix

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

"ML in Production",Oleksandr BaganFwdays

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Kürzlich hochgeladen (20)

From Family Reminiscence to Scholarly Archive .

The State of Passkeys with FIDO Alliance.pptx

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

What's New in Teams Calling, Meetings and Devices March 2024

SIP trunking in Janus @ Kamailio World 2024

DSPy a system for AI to Write Prompts and Do Fine Tuning

DevoxxFR 2024 Reproducible Builds with Apache Maven

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

What is DBT - The Ultimate Data Build Tool.pdf

"ML in Production",Oleksandr Bagan

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

SAP Build Work Zone - Overview L2-L3.pptx

Scanning the Internet for External Cloud Exposures via SSL Certs

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Streamlining Python Development: A Guide to a Modern Project Setup

Take control of your SAP testing with UiPath Test Suite

Developer Data Modeling Mistakes: From Postgres to NoSQL

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL

1. Max-kernel search How to search for just about anything? Parikshit Ram

2. Similarity search q ● Set of objects ● Query R ● Similarity function 1

3. Finding similar images 2

4. Drug discovery 3 http://fineartamerica.com

5. Movie recommendations 4

6. Similarity search is ubiquitous ● Machine learning ● Computer vision ● Theory ● Databases ● Information retrieval ● Web application ● Collaborative filtering ● Scientific computing 5

7. Search-based classification 6

8. Search-based classification 6 ?

9. Search-based classification 6 k-nearest-neighbor classification/regression

10. Search-based classification 7 “RomCom fan”

11. Search-based classification 7 “Kids movie fanatic”

12. Search-based outlier detection 8

13. 9

14. Search-based ML Advantage ● nonparametric - lets the data speak ● no need to train complex models Key ingredient ● notion of similarity (domain/data-specific) Main challenge: efficiency ● Sheer size of the data ● Varied data types 10

15. Properties of similarity functions 11 ● symmetry OR

16. 11 3 1 The dissimilarity is the size of the set-theoretic difference

17. Properties of similarity functions 11 ● symmetry ● self-similarity OR OR

18. 11 We do not really care about this.

19. Properties of similarity functions 11 ● symmetry ● self-similarity OR OR

20. 12

21. 12

22. 12 Metrics used everywhere

23. 12 Metrics used everywhere

24. 12 Bregman divergences widely used for distributions Mercer kernels widely used in ML for variety of objects and problems ??? not quite explored in search or ML Metrics used everywhere

25. Breadth of Kernel Functions Objects Kernel Functions Images linear, polynomial, Gaussian, Pyramid match Documents cosine Sequences p-spectrum kernel, alignment score Trees subtree, syntactic, partial tree Graphs random walk Time series cross-correlation, dynamic time-warping Natural Lang. convolution, decomposition, lexical semantic 13

26. What is a Kernel Function? In words A pairwise symmetric function ● Correlation in a richer but hidden feature space ● Cannot access the hidden space Object space Hidden space Hidden mapping 14

27. Max-kernel Search Find the object in R most similar to q with respect to a kernel 15

28. Existing methods ● Brute-force (parallel/distributed) ○ Domain-specific optimizations ● Coerce data to use metrics ○ Only approximate No standard search tools! 16

29. Understanding kernels If two objects equally similar to each other then they are equally similar to the query q 17

30. IF 17 Understanding kernels THEN

31. 18 Indexing our collection

32. 18 Indexing our collection

33. Multi-resolution index in O( n log n ) time p 18 Indexing our collection Cover Tree (BKL 2006)

34. How to Search with this Index? 19 q p

35. How to Search with this Index? 19 q p p' p''

36. How to Search with this Index? q p p'' p' 19

37. How to Search with this Index? q p p'' p' 19

38. How to Search with this Index? q p p'' p' Safely ignore a large chunk (potentially millions) 19

39. Results: Efficiency Improvement 20

40. Results: Efficiency 10000x ● Widely applicable algorithm ● Performance data/kernel-dependent 10x Improvement 20

41. Results: Sublinear Query Time Improvement Object set size Bigger data implies bigger efficiency gains 21

42. Can We Prove it? What Makes Search Hard? Thm. For a set R of n objects, the query time is ● expansion constant ○ the distribution of the data ● directional concentration constant ○ the distribution of a kernel-induced transformation of the data 22

43. Endnote ● Search is an essential tool for ML ● Exploring different types of similarity functions increases the applicability and quality of search ● Kernels are widely applicable similarity functions ○ now we have provably fast max kernel search Code/tutorial for Fast Exact Max-Kernel Search 23 version 1.0.5 http://www.mlpack.org Ryan R. Curtin Email: pari@skytree.net