SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Downloaden Sie, um offline zu lesen
Random 
Manhattan Indexing 
Behrang Q. Zadeh and Siegfried Handschuh 
National University of Ireland, Galway, Ireland 
University of Passau, Lower Bavaria, Germany
Processing Natural Language Text 
• For Computers, natural language text is simply a sequence of 
bits and bytes.
Processing Natural Language Text 
• For Computers, natural language text is simply a sequence of 
bits and bytes. 
Text 
Preprocessing 
Computable Model
Vector Space Model (VSM) 
• Vector spaces are one of the models employed for text 
processing. 
• A Mathematical Structure ۦܸ, Թ, ൅, ·ۧ that satisfy certain 
axioms. 
• Text elements are converted to real-valued vectors. 
• Each dimension of the vector represents some information 
about text elements.
VSMs: an Example 
• In a text-based information retrieval task: 
• Each dimension of the VSM represents an index term t. 
• Documents are represented by vectors. 
Salton et. al. (1975)
VSMs: an Example 
Salton et. al. (1975) 
t1 
d2 
d1 
t2 
t1 t2 
4 
1 
4 
2 
d1 
d2
VSMs: an Example 
• In a text-based information retrieval task: 
• Each dimension of the VSM represents an index term t. 
• Documents are represented by vectors. 
• Queries are treated as pseudo-documents. 
Salton et. al. (1975)
VSMs: an Example 
t1 
d2 
d1 
t2 
t1 t2 
4 
1 
4 
2 
d1 
d2 q 
q 3 1 
Salton et. al. (1975)
VSMs: an Example 
• In a text-based information retrieval task: 
• Each dimension of the VSM represents an index term t. 
• Documents are represented by vectors. 
• Queries are treated as pseudo-documents. 
• Using a norm structure, a notion of distance is defined 
and used to assess the similarity between vectors, i.e. 
documents and queries. 
Salton et. al. (1975)
VSMs: an Example 
The L2 or Euclidean norm, 2 i 
Salton et. al. (1975) 
t1 
d2 
d1 
t2 
ଶ 
௜ 
t1 t2 
4 
1 
4 
2 
d1 
d2 q 
q 3 1 
݀݅ݏݐ2ሺ݀1, ݍሻ ൌ ݀1െݍ 2ൌ ሺ4 െ 3ሻଶ൅ሺ4 െ 1ሻଶ ൌ 10 
݀݅ݏݐ2ሺ݀2, ݍሻ ൌ ݀2െݍ 2ൌ ሺ1 െ 3ሻଶ൅ሺ2 െ 1ሻଶ ൌ 5
VSMs: an Example 
The L1 norm, 1 ௜ i 
t1 
d2 
d1 
t1 t2 
݀݅ݏݐ1ሺ݀1, ݍሻ ൌ ݀1െݍ 1ൌ 4 െ 3 ൅ 4 െ 1 ൌ 4 
݀݅ݏݐ1ሺ݀2, ݍሻ ൌ ݀2െݍ 1ൌ 1 െ 3 ൅ 2 െ 1 ൌ 3 
t2 
4 
1 
4 
2 
d1 
d2 q 
q 3 1
The dimensionality barrier 
• Due to the Zipfian distribution of index terms, the 
number of index terms escalates when new 
documents are added. 
• Adding new index terms requires adding additional 
dimensions to the vector space. 
index terms 
frequency
The dimensionality barrier 
• Due to the Zipfian distribution of index terms, the 
number of index terms escalates when new 
documents are added. 
• Adding new index terms requires adding additional 
dimensions to the vector space. 
• Therefore, VSMs are extremely high-dimensional 
and sparse: 
“THE CURSE OF DIMENSIONALITY”
The dimensionality barrier
Overcoming the dimensionality 
barrier
Overcoming the dimensionality 
barrier 
• Dimensionality Reduction Techniques are employed 
to resolve the problem.
Overcoming the dimensionality 
barrier 
• Dimensionality Reduction Techniques are employed 
to resolve the problem. 
Dimensionality 
Reduction 
Selection Process 
(based on a heuristic) 
Transformation-based methods
Overcoming the dimensionality 
barrier 
• Dimensionality Reduction Techniques are employed 
to resolve the problem. 
Dimensionality 
Reduction 
Selection Process 
(based on a heuristic) 
Transformation-based methods 
Matrix Factorization Random Projections
Overcoming the dimensionality 
barrier 
• Dimensionality Reduction Techniques are employed 
to resolve the problem. 
Dimensionality 
Reduction 
Selection Process 
(based on a heuristic) 
Transformation-based methods 
Matrix Factorization Random Projections 
Truncated Singular Value Decomposition 
(Latent Semantic Analysis) 
Random Indexing
Limitations of Matrix Factorization 
• Computation of Singular Value Decomposition 
(SVD) is process-intensive, i.e. O(mn2). 
• Every time a VSM is updated, SVD must be 
recomputed: 
˟ Not suitable for big-text data analytics. 
˟ Not suitable for frequently updated text-data, e.g. blogs, 
tweeter analysis, etc.
Limitations of Matrix Factorization 
• Computation of Singular Value Decomposition 
(SVD) is process-intensive, i.e. O(mn2). 
• Every time a VSM is updated, SVD must be 
recomputed: 
˟ Not suitable for big-text data analytics. 
˟ Not suitable for frequently updated text-data, e.g. blogs, 
tweeter analysis, etc. 
Solution: Random projection methods skip the 
computation of transformations (eigenvectors).
Random Projections: Random Indexing 
• VSM Dimensionality is decided independent of the 
number of index terms, i.e. dimensionality of a 
VSM is fixed. 
• Algorithm: 
• Assign each index term to a randomly generated “index vector”. 
• Most elements of index vectors are 0 and only a few +1 and -1, e.g. 
rt1 = (0, 0, 1, 0,-1), rt2 = (1, 0, -1, 0, 0), etc. 
• Assign each document to an empty “context vector”. 
• Construct the VSM incrementally by the accumulation of index vectors to 
context vectors.
Random Indexing (Example) 
Munich is the capital 
and largest city of the 
German state of 
Bavaria. 
Document 1
Random Indexing (Example) 
Munich is the capital 
and largest city of the 
German state of 
Bavaria. 
Munich 
Capital 
German 
Bavaria 
Largest 
City 
State 
Document 1 
Index Terms
Random Indexing (Example) 
Munich is the capital 
and largest city of the 
German state of 
Bavaria. 
Munich: (1,‐1,0,0,0) 
Capital: (0,0,1,0,‐1) 
Germany: (1,0,0,‐1,0) 
Bavaria: (0,0,0,1,‐1) 
Largest: (0,1,0,‐1,0) 
City: (1,0,‐1,0,0) 
State: (1,0,0,0,‐1) 
Document 1 
Index Terms 
D1 = (0, 0, 0, 0, 0)
Random Indexing (Example) 
Munich is the capital 
and largest city of the 
German state of 
Bavaria. 
Munich: (1,‐1,0,0,0) 
Capital: (0,0,1,0,‐1) 
Germany: (1,0,0,‐1,0) 
Bavaria: (0,0,0,1,‐1) 
Largest: (0,1,0,‐1,0) 
City: (1,0,‐1,0,0) 
State: (1,0,0,0,‐1) 
Document 1 
Index Terms 
D1 = (4, 0, 0, ‐1, ‐3) 
Munich ( 1 ,‐1 ,0 ,0 ,0 ) 
Capital ( 0 ,0 ,1 ,0 ,‐1 ) 
Germany ( 1 ,0 ,0 ,‐1 ,0 ) 
Bavaria ( 0 ,0 ,0 ,1 ,‐1 ) 
Largest ( 0 ,1 ,0 ,‐1 ,0 ) 
City ( 1 ,0 ,‐1 ,0 ,0 ) 
State ( 1 ,0 ,0 ,0 ,‐1 ) 
D1 = ( 4 ,0 , 0 , ‐1 , ‐3 ) 
+ 
+++ 
++
Random Indexing (Example) 
Munich is the capital 
and largest city of the 
German state of 
Bavaria. 
Munich: (1,‐1,0,0,0) 
Capital: (0,0,1,0,‐1) 
Germany: (1,0,0,‐1,0) 
Bavaria: (0,0,0,1,‐1) 
Largest: (0,1,0,‐1,0) 
City: (1,0,‐1,0,0) 
State: (1,0,0,0,‐1) 
Document 1 
Index Terms 
D1 = (4, 0, 0, ‐1, ‐3) 
Munich is a beautiful, 
historical city. 
Document 2
Random Indexing (Example) 
Munich is the capital 
and largest city of the 
German state of 
Bavaria. 
Munich: (1,‐1,0,0,0) 
Capital: (0,0,1,0,‐1) 
Germany: (1,0,0,‐1,0) 
Bavaria: (0,0,0,1,‐1) 
Largest: (0,1,0,‐1,0) 
City: (1,0,‐1,0,0) 
State: (1,0,0,0,‐1) 
Beautiful: (0,0,1,‐1,0) 
Historical: (0,1,0,0,‐1) 
Document 1 
Index Terms 
D1 = (4, 0, 0, ‐1, ‐3) 
Munich is a beautiful, 
historical city. 
Document 2 
D2 = (0, 0, 0, 0, 0)
Random Indexing (Example) 
Munich is the capital 
and largest city of the 
German state of 
Bavaria. 
Munich: (1,‐1,0,0,0) 
Capital: (0,0,1,0,‐1) 
Germany: (1,0,0,‐1,0) 
Bavaria: (0,0,0,1,‐1) 
Largest: (0,1,0,‐1,0) 
City: (1,0,‐1,0,0) 
State: (1,0,0,0,‐1) 
Beautiful: (0,0,1,‐1,0) 
Historical: (0,1,0,0,‐1) 
Document 1 
Index Terms 
D1 = (4, 0, 0, ‐1, ‐3) 
Munich is a beautiful, 
historical city. 
Document 2 
D2 = (2,0,0, ‐1,‐1)
Random Indexing (Example) 
Munich is the capital 
and largest city of the 
German state of 
Bavaria. 
Munich: (1,‐1,0,0,0) 
Capital: (0,0,1,0,‐1) 
Germany: (1,0,0,‐1,0) 
Bavaria: (0,0,0,1,‐1) 
Largest: (0,1,0,‐1,0) 
City: (1,0,‐1,0,0) 
State: (1,0,0,0,‐1) 
Beautiful: (0,0,1,‐1,0) 
Historical: (0,1,0,0,‐1) 
.... 
Document 1 
Index Terms 
D1 = (4, 0, 0, ‐1, ‐3) 
Munich is a beautiful, 
historical city. 
Document 2 
D2 = (2,0,0, ‐1,‐1) 
Document 
Document 
Document 
Document 
FIXED DIMENSION
Limitation of Random Indexing 
• Random Indexing employs a Gaussian Random 
Projection. 
• A Gaussian Random Projection can only be used for 
the estimation of Euclidean distances. 
d2 
d1 
t2 
q 
t1 
d2 
d1 
t2 
q
Random Manhattan Indexing 
• In an applications we may want to estimate the L1 
(Manhattan) distance between vectors. 
• Random Manhattan Indexing (RMI) can be used to 
estimate the Manhattan distances. 
d2 
d1 
t2 
q 
t1 
d2 
d1 
t2 
q
The RMI method 
• RMI employs Cauchy Random Projections 
• RMI is an incremental technique (similar to RI): 
• Each index term is assigned to an index vectors. 
• Index vectors are high-dimensional and randomly 
generated with the following distribution: 
With probability ௦ 
ଶ 
rij = With probability 1 െ ݏ 
With probability ௦ 
ଶ 
ଵ 
௎ଵ 0 
െ ଵ 
௎ଶ 
ܷ1 and ܷ2 are independent uniform random variables in (0, 1). 
Indyk (2000), Li (2008)
The RMI method 
• Assign documents to context vectors. 
• Create context vectors incrementally. 
• Estimate the Manhattan distance between vectors 
using the following (non-linear) equation: 
m is the dimension of the RMI-constructed VSM
RMI (Example) 
Munich is the capital 
and largest city of the 
German state of 
Bavaria. 
Munich: (0,0,1.0,‐2.1,0) 
Capital: (0,1.49,0,‐1.2,0) 
German: (0,0,0,1.9,‐1.8) 
Bavaria: (1.3,0,0,0,‐3.9) 
Largest: (0,0,1.6,‐2.1,0) 
City: (0,2.5,0,0,‐2.3) 
State: (0,1.9,‐9.3,0,0 
Document 1 
Index Terms 
D1 = (1.3,5.89,‐6.7,‐3.5,‐8)
RMI (Example) 
Munich is the capital 
and largest city of the 
German state of 
Bavaria. 
Munich: (0,0,1.0,‐2.1,0) 
Capital: (0,1.49,0,‐1.2,0) 
German: (0,0,0,1.9,‐1.8) 
Bavaria: (1.3,0,0,0,‐3.9) 
Largest: (0,0,1.6,‐2.1,0) 
City: (0,2.5,0,0,‐2.3) 
State: (0,1.9,‐9.3,0,0 
Document 1 
Index Terms 
D1 = (1.3,5.89,‐6.7,‐3.5,‐8) 
Use the non-linear estimator to assess similarities: 
௠ 
ଵ ௜ ௜ 
௜ୀଵ
RMI’s parameters 
• The dimension of the RMI-constructed VSM. 
• Number of non-zero elements in index vectors.
RMI’s parameters 
• The dimension of the RMI-constructed VSM: 
• It is independent of the number of index terms. 
• It is decided by the probability and the maximum 
expected distortion in distances. 
• For fixed set of documents, larger dimension results in 
less distortion. 
• According to our experiment, m=400 is suitable for most 
applications.
RMI’s parameters 
• The dimension of the RMI-constructed VSM. 
• Number of non-zero elements in index vectors: 
• It is decided by the number of index terms and the 
sparsity of VSM at its original high-dimension. 
• We suggest ݏ ൌ ଵ 
୓ ఉ௡ , where ߚ is the sparseness of 
original high-dimensional VSM and n is the number 
index terms. 
• ߚ is often considered to be around 0.0001 to 0.01.
Experimental Results 
• We designed an experiment that shows the ability of 
RMI in preserving L1 distances: 
• A VSM is first constructed from the INEX-Wikipedia 
2009 collection at its original high dimension 
(dimensionality of 2.53 million). 
• We choose a random list of 1000 random articles from 
the corpus. 
• The L1 distance of each document to the rest of 
documents in the list are calculated. 
• Documents are sorted by the calculated L1 distances. 
• The result is 1000 lists of 999 sorted documents.
Experimental Results 
The set of 1000 random 
documents 
Reference 
Document 
Sorted Lists of 999 Documents 
1000 Lists
Experimental Results 
• Construct the VSM using RMI method and repeat 
the sorting process. 
• Use different dimensionalities. 
• User different number of non-zero elements. 
• Compare the sorted lists obtained from the VSM 
constructed at the original high dimension and the 
RMI-constructed VSM at reduced dimensionality. 
• Spearman's rank correlation ( ) for comparison 
• EXPECTATION: similar sorted lists from both 
VSM, i.e. = 1.
Experimental Results 
• Observed correlation: 
red triangles: dimension = 200 
blue diamonds: dimension = 400 
black squares: dimension = 800 
For the baseline, we report ߩ = 0.1375 for lists of documents sorted randomly.
Experimental Results 
• Experiments support the earlier claim that RI do not 
preserve the L1 distance: 
RMI 
RI 
RI (standard L1 distance) 
(non‐linear estimator)
Discussion 
• We proposed an incremental, scalable method for 
the construction of the L1 normed VSMs. 
• We show the ability of the method in preserving the 
distances between documents. 
• Is this possible to avoid floating-point calculation? 
• What is the performance of RMI within the context 
of Information Retrieval benchmarks?
THANKS FOR YOUR ATTENTION! 
Random Manhattan Indexing 
Behrang Q. Zadeh and Siegfried Handschuh 
National University of Ireland, Galway, Ireland 
University of Passau, Lower Bavaria, Germany 
http://atmykitchen.info/

Weitere ähnliche Inhalte

Ähnlich wie Random Manhattan Indexing

Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Satyam Saxena
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdfHabtamu100
 
Statistics in real life engineering
Statistics in real life engineeringStatistics in real life engineering
Statistics in real life engineeringMD TOUFIQ HASAN ANIK
 
The wise doc_trans presentation
The wise doc_trans presentationThe wise doc_trans presentation
The wise doc_trans presentationRaouf KESKES
 
LinearAlgebra_2016updatedFromwiki.ppt
LinearAlgebra_2016updatedFromwiki.pptLinearAlgebra_2016updatedFromwiki.ppt
LinearAlgebra_2016updatedFromwiki.pptAruneshAdarsh
 
LinearAlgebra_2016updatedFromwiki.ppt
LinearAlgebra_2016updatedFromwiki.pptLinearAlgebra_2016updatedFromwiki.ppt
LinearAlgebra_2016updatedFromwiki.pptHumayilZia
 
Sketch algorithms
Sketch algorithmsSketch algorithms
Sketch algorithmsSimon Belak
 
Session 1 and 2.pptx
Session 1 and 2.pptxSession 1 and 2.pptx
Session 1 and 2.pptxAkshitMGoel
 
Introduction to basic statistics
Introduction to basic statisticsIntroduction to basic statistics
Introduction to basic statisticsIBM
 
wk1a-basicstats (2).ppt
wk1a-basicstats (2).pptwk1a-basicstats (2).ppt
wk1a-basicstats (2).pptssuser0be977
 
wk1a-basicstats.ppt
wk1a-basicstats.pptwk1a-basicstats.ppt
wk1a-basicstats.pptssuser0be977
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and trackingGeorge Ang
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsJen Stirrup
 

Ähnlich wie Random Manhattan Indexing (20)

Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Visual Techniques
Visual TechniquesVisual Techniques
Visual Techniques
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
Week_3_Lecture.pdf
Week_3_Lecture.pdfWeek_3_Lecture.pdf
Week_3_Lecture.pdf
 
Text features
Text featuresText features
Text features
 
SVD.ppt
SVD.pptSVD.ppt
SVD.ppt
 
Statistics in real life engineering
Statistics in real life engineeringStatistics in real life engineering
Statistics in real life engineering
 
The wise doc_trans presentation
The wise doc_trans presentationThe wise doc_trans presentation
The wise doc_trans presentation
 
Week_2_Lecture.pdf
Week_2_Lecture.pdfWeek_2_Lecture.pdf
Week_2_Lecture.pdf
 
LinearAlgebra_2016updatedFromwiki.ppt
LinearAlgebra_2016updatedFromwiki.pptLinearAlgebra_2016updatedFromwiki.ppt
LinearAlgebra_2016updatedFromwiki.ppt
 
LinearAlgebra_2016updatedFromwiki.ppt
LinearAlgebra_2016updatedFromwiki.pptLinearAlgebra_2016updatedFromwiki.ppt
LinearAlgebra_2016updatedFromwiki.ppt
 
Sketch algorithms
Sketch algorithmsSketch algorithms
Sketch algorithms
 
Session 1 and 2.pptx
Session 1 and 2.pptxSession 1 and 2.pptx
Session 1 and 2.pptx
 
Introduction to basic statistics
Introduction to basic statisticsIntroduction to basic statistics
Introduction to basic statistics
 
wk1a-basicstats (2).ppt
wk1a-basicstats (2).pptwk1a-basicstats (2).ppt
wk1a-basicstats (2).ppt
 
wk1a-basicstats.ppt
wk1a-basicstats.pptwk1a-basicstats.ppt
wk1a-basicstats.ppt
 
Time series.ppt
Time series.pptTime series.ppt
Time series.ppt
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and tracking
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and Statistics
 

Mehr von net2-project

Vector spaces for information extraction - Random Projection Example
Vector spaces for information extraction - Random Projection ExampleVector spaces for information extraction - Random Projection Example
Vector spaces for information extraction - Random Projection Examplenet2-project
 
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...net2-project
 
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...net2-project
 
Managing Social Communities
Managing Social CommunitiesManaging Social Communities
Managing Social Communitiesnet2-project
 
Data Exchange over RDF
Data Exchange over RDFData Exchange over RDF
Data Exchange over RDFnet2-project
 
Exchanging More than Complete Data
Exchanging More than Complete DataExchanging More than Complete Data
Exchanging More than Complete Datanet2-project
 
Exchanging More than Complete Data
Exchanging More than Complete DataExchanging More than Complete Data
Exchanging More than Complete Datanet2-project
 
Answer-set programming
Answer-set programmingAnswer-set programming
Answer-set programmingnet2-project
 
Evolving web, evolving search
Evolving web, evolving searchEvolving web, evolving search
Evolving web, evolving searchnet2-project
 
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)net2-project
 

Mehr von net2-project (10)

Vector spaces for information extraction - Random Projection Example
Vector spaces for information extraction - Random Projection ExampleVector spaces for information extraction - Random Projection Example
Vector spaces for information extraction - Random Projection Example
 
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...
 
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...
 
Managing Social Communities
Managing Social CommunitiesManaging Social Communities
Managing Social Communities
 
Data Exchange over RDF
Data Exchange over RDFData Exchange over RDF
Data Exchange over RDF
 
Exchanging More than Complete Data
Exchanging More than Complete DataExchanging More than Complete Data
Exchanging More than Complete Data
 
Exchanging More than Complete Data
Exchanging More than Complete DataExchanging More than Complete Data
Exchanging More than Complete Data
 
Answer-set programming
Answer-set programmingAnswer-set programming
Answer-set programming
 
Evolving web, evolving search
Evolving web, evolving searchEvolving web, evolving search
Evolving web, evolving search
 
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
 

Kürzlich hochgeladen

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 

Kürzlich hochgeladen (20)

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 

Random Manhattan Indexing

  • 1. Random Manhattan Indexing Behrang Q. Zadeh and Siegfried Handschuh National University of Ireland, Galway, Ireland University of Passau, Lower Bavaria, Germany
  • 2. Processing Natural Language Text • For Computers, natural language text is simply a sequence of bits and bytes.
  • 3. Processing Natural Language Text • For Computers, natural language text is simply a sequence of bits and bytes. Text Preprocessing Computable Model
  • 4. Vector Space Model (VSM) • Vector spaces are one of the models employed for text processing. • A Mathematical Structure ۦܸ, Թ, ൅, ·ۧ that satisfy certain axioms. • Text elements are converted to real-valued vectors. • Each dimension of the vector represents some information about text elements.
  • 5. VSMs: an Example • In a text-based information retrieval task: • Each dimension of the VSM represents an index term t. • Documents are represented by vectors. Salton et. al. (1975)
  • 6. VSMs: an Example Salton et. al. (1975) t1 d2 d1 t2 t1 t2 4 1 4 2 d1 d2
  • 7. VSMs: an Example • In a text-based information retrieval task: • Each dimension of the VSM represents an index term t. • Documents are represented by vectors. • Queries are treated as pseudo-documents. Salton et. al. (1975)
  • 8. VSMs: an Example t1 d2 d1 t2 t1 t2 4 1 4 2 d1 d2 q q 3 1 Salton et. al. (1975)
  • 9. VSMs: an Example • In a text-based information retrieval task: • Each dimension of the VSM represents an index term t. • Documents are represented by vectors. • Queries are treated as pseudo-documents. • Using a norm structure, a notion of distance is defined and used to assess the similarity between vectors, i.e. documents and queries. Salton et. al. (1975)
  • 10. VSMs: an Example The L2 or Euclidean norm, 2 i Salton et. al. (1975) t1 d2 d1 t2 ଶ ௜ t1 t2 4 1 4 2 d1 d2 q q 3 1 ݀݅ݏݐ2ሺ݀1, ݍሻ ൌ ݀1െݍ 2ൌ ሺ4 െ 3ሻଶ൅ሺ4 െ 1ሻଶ ൌ 10 ݀݅ݏݐ2ሺ݀2, ݍሻ ൌ ݀2െݍ 2ൌ ሺ1 െ 3ሻଶ൅ሺ2 െ 1ሻଶ ൌ 5
  • 11. VSMs: an Example The L1 norm, 1 ௜ i t1 d2 d1 t1 t2 ݀݅ݏݐ1ሺ݀1, ݍሻ ൌ ݀1െݍ 1ൌ 4 െ 3 ൅ 4 െ 1 ൌ 4 ݀݅ݏݐ1ሺ݀2, ݍሻ ൌ ݀2െݍ 1ൌ 1 െ 3 ൅ 2 െ 1 ൌ 3 t2 4 1 4 2 d1 d2 q q 3 1
  • 12. The dimensionality barrier • Due to the Zipfian distribution of index terms, the number of index terms escalates when new documents are added. • Adding new index terms requires adding additional dimensions to the vector space. index terms frequency
  • 13. The dimensionality barrier • Due to the Zipfian distribution of index terms, the number of index terms escalates when new documents are added. • Adding new index terms requires adding additional dimensions to the vector space. • Therefore, VSMs are extremely high-dimensional and sparse: “THE CURSE OF DIMENSIONALITY”
  • 16. Overcoming the dimensionality barrier • Dimensionality Reduction Techniques are employed to resolve the problem.
  • 17. Overcoming the dimensionality barrier • Dimensionality Reduction Techniques are employed to resolve the problem. Dimensionality Reduction Selection Process (based on a heuristic) Transformation-based methods
  • 18. Overcoming the dimensionality barrier • Dimensionality Reduction Techniques are employed to resolve the problem. Dimensionality Reduction Selection Process (based on a heuristic) Transformation-based methods Matrix Factorization Random Projections
  • 19. Overcoming the dimensionality barrier • Dimensionality Reduction Techniques are employed to resolve the problem. Dimensionality Reduction Selection Process (based on a heuristic) Transformation-based methods Matrix Factorization Random Projections Truncated Singular Value Decomposition (Latent Semantic Analysis) Random Indexing
  • 20. Limitations of Matrix Factorization • Computation of Singular Value Decomposition (SVD) is process-intensive, i.e. O(mn2). • Every time a VSM is updated, SVD must be recomputed: ˟ Not suitable for big-text data analytics. ˟ Not suitable for frequently updated text-data, e.g. blogs, tweeter analysis, etc.
  • 21. Limitations of Matrix Factorization • Computation of Singular Value Decomposition (SVD) is process-intensive, i.e. O(mn2). • Every time a VSM is updated, SVD must be recomputed: ˟ Not suitable for big-text data analytics. ˟ Not suitable for frequently updated text-data, e.g. blogs, tweeter analysis, etc. Solution: Random projection methods skip the computation of transformations (eigenvectors).
  • 22. Random Projections: Random Indexing • VSM Dimensionality is decided independent of the number of index terms, i.e. dimensionality of a VSM is fixed. • Algorithm: • Assign each index term to a randomly generated “index vector”. • Most elements of index vectors are 0 and only a few +1 and -1, e.g. rt1 = (0, 0, 1, 0,-1), rt2 = (1, 0, -1, 0, 0), etc. • Assign each document to an empty “context vector”. • Construct the VSM incrementally by the accumulation of index vectors to context vectors.
  • 23. Random Indexing (Example) Munich is the capital and largest city of the German state of Bavaria. Document 1
  • 24. Random Indexing (Example) Munich is the capital and largest city of the German state of Bavaria. Munich Capital German Bavaria Largest City State Document 1 Index Terms
  • 25. Random Indexing (Example) Munich is the capital and largest city of the German state of Bavaria. Munich: (1,‐1,0,0,0) Capital: (0,0,1,0,‐1) Germany: (1,0,0,‐1,0) Bavaria: (0,0,0,1,‐1) Largest: (0,1,0,‐1,0) City: (1,0,‐1,0,0) State: (1,0,0,0,‐1) Document 1 Index Terms D1 = (0, 0, 0, 0, 0)
  • 26. Random Indexing (Example) Munich is the capital and largest city of the German state of Bavaria. Munich: (1,‐1,0,0,0) Capital: (0,0,1,0,‐1) Germany: (1,0,0,‐1,0) Bavaria: (0,0,0,1,‐1) Largest: (0,1,0,‐1,0) City: (1,0,‐1,0,0) State: (1,0,0,0,‐1) Document 1 Index Terms D1 = (4, 0, 0, ‐1, ‐3) Munich ( 1 ,‐1 ,0 ,0 ,0 ) Capital ( 0 ,0 ,1 ,0 ,‐1 ) Germany ( 1 ,0 ,0 ,‐1 ,0 ) Bavaria ( 0 ,0 ,0 ,1 ,‐1 ) Largest ( 0 ,1 ,0 ,‐1 ,0 ) City ( 1 ,0 ,‐1 ,0 ,0 ) State ( 1 ,0 ,0 ,0 ,‐1 ) D1 = ( 4 ,0 , 0 , ‐1 , ‐3 ) + +++ ++
  • 27. Random Indexing (Example) Munich is the capital and largest city of the German state of Bavaria. Munich: (1,‐1,0,0,0) Capital: (0,0,1,0,‐1) Germany: (1,0,0,‐1,0) Bavaria: (0,0,0,1,‐1) Largest: (0,1,0,‐1,0) City: (1,0,‐1,0,0) State: (1,0,0,0,‐1) Document 1 Index Terms D1 = (4, 0, 0, ‐1, ‐3) Munich is a beautiful, historical city. Document 2
  • 28. Random Indexing (Example) Munich is the capital and largest city of the German state of Bavaria. Munich: (1,‐1,0,0,0) Capital: (0,0,1,0,‐1) Germany: (1,0,0,‐1,0) Bavaria: (0,0,0,1,‐1) Largest: (0,1,0,‐1,0) City: (1,0,‐1,0,0) State: (1,0,0,0,‐1) Beautiful: (0,0,1,‐1,0) Historical: (0,1,0,0,‐1) Document 1 Index Terms D1 = (4, 0, 0, ‐1, ‐3) Munich is a beautiful, historical city. Document 2 D2 = (0, 0, 0, 0, 0)
  • 29. Random Indexing (Example) Munich is the capital and largest city of the German state of Bavaria. Munich: (1,‐1,0,0,0) Capital: (0,0,1,0,‐1) Germany: (1,0,0,‐1,0) Bavaria: (0,0,0,1,‐1) Largest: (0,1,0,‐1,0) City: (1,0,‐1,0,0) State: (1,0,0,0,‐1) Beautiful: (0,0,1,‐1,0) Historical: (0,1,0,0,‐1) Document 1 Index Terms D1 = (4, 0, 0, ‐1, ‐3) Munich is a beautiful, historical city. Document 2 D2 = (2,0,0, ‐1,‐1)
  • 30. Random Indexing (Example) Munich is the capital and largest city of the German state of Bavaria. Munich: (1,‐1,0,0,0) Capital: (0,0,1,0,‐1) Germany: (1,0,0,‐1,0) Bavaria: (0,0,0,1,‐1) Largest: (0,1,0,‐1,0) City: (1,0,‐1,0,0) State: (1,0,0,0,‐1) Beautiful: (0,0,1,‐1,0) Historical: (0,1,0,0,‐1) .... Document 1 Index Terms D1 = (4, 0, 0, ‐1, ‐3) Munich is a beautiful, historical city. Document 2 D2 = (2,0,0, ‐1,‐1) Document Document Document Document FIXED DIMENSION
  • 31. Limitation of Random Indexing • Random Indexing employs a Gaussian Random Projection. • A Gaussian Random Projection can only be used for the estimation of Euclidean distances. d2 d1 t2 q t1 d2 d1 t2 q
  • 32. Random Manhattan Indexing • In an applications we may want to estimate the L1 (Manhattan) distance between vectors. • Random Manhattan Indexing (RMI) can be used to estimate the Manhattan distances. d2 d1 t2 q t1 d2 d1 t2 q
  • 33. The RMI method • RMI employs Cauchy Random Projections • RMI is an incremental technique (similar to RI): • Each index term is assigned to an index vectors. • Index vectors are high-dimensional and randomly generated with the following distribution: With probability ௦ ଶ rij = With probability 1 െ ݏ With probability ௦ ଶ ଵ ௎ଵ 0 െ ଵ ௎ଶ ܷ1 and ܷ2 are independent uniform random variables in (0, 1). Indyk (2000), Li (2008)
  • 34. The RMI method • Assign documents to context vectors. • Create context vectors incrementally. • Estimate the Manhattan distance between vectors using the following (non-linear) equation: m is the dimension of the RMI-constructed VSM
  • 35. RMI (Example) Munich is the capital and largest city of the German state of Bavaria. Munich: (0,0,1.0,‐2.1,0) Capital: (0,1.49,0,‐1.2,0) German: (0,0,0,1.9,‐1.8) Bavaria: (1.3,0,0,0,‐3.9) Largest: (0,0,1.6,‐2.1,0) City: (0,2.5,0,0,‐2.3) State: (0,1.9,‐9.3,0,0 Document 1 Index Terms D1 = (1.3,5.89,‐6.7,‐3.5,‐8)
  • 36. RMI (Example) Munich is the capital and largest city of the German state of Bavaria. Munich: (0,0,1.0,‐2.1,0) Capital: (0,1.49,0,‐1.2,0) German: (0,0,0,1.9,‐1.8) Bavaria: (1.3,0,0,0,‐3.9) Largest: (0,0,1.6,‐2.1,0) City: (0,2.5,0,0,‐2.3) State: (0,1.9,‐9.3,0,0 Document 1 Index Terms D1 = (1.3,5.89,‐6.7,‐3.5,‐8) Use the non-linear estimator to assess similarities: ௠ ଵ ௜ ௜ ௜ୀଵ
  • 37. RMI’s parameters • The dimension of the RMI-constructed VSM. • Number of non-zero elements in index vectors.
  • 38. RMI’s parameters • The dimension of the RMI-constructed VSM: • It is independent of the number of index terms. • It is decided by the probability and the maximum expected distortion in distances. • For fixed set of documents, larger dimension results in less distortion. • According to our experiment, m=400 is suitable for most applications.
  • 39. RMI’s parameters • The dimension of the RMI-constructed VSM. • Number of non-zero elements in index vectors: • It is decided by the number of index terms and the sparsity of VSM at its original high-dimension. • We suggest ݏ ൌ ଵ ୓ ఉ௡ , where ߚ is the sparseness of original high-dimensional VSM and n is the number index terms. • ߚ is often considered to be around 0.0001 to 0.01.
  • 40. Experimental Results • We designed an experiment that shows the ability of RMI in preserving L1 distances: • A VSM is first constructed from the INEX-Wikipedia 2009 collection at its original high dimension (dimensionality of 2.53 million). • We choose a random list of 1000 random articles from the corpus. • The L1 distance of each document to the rest of documents in the list are calculated. • Documents are sorted by the calculated L1 distances. • The result is 1000 lists of 999 sorted documents.
  • 41. Experimental Results The set of 1000 random documents Reference Document Sorted Lists of 999 Documents 1000 Lists
  • 42. Experimental Results • Construct the VSM using RMI method and repeat the sorting process. • Use different dimensionalities. • User different number of non-zero elements. • Compare the sorted lists obtained from the VSM constructed at the original high dimension and the RMI-constructed VSM at reduced dimensionality. • Spearman's rank correlation ( ) for comparison • EXPECTATION: similar sorted lists from both VSM, i.e. = 1.
  • 43. Experimental Results • Observed correlation: red triangles: dimension = 200 blue diamonds: dimension = 400 black squares: dimension = 800 For the baseline, we report ߩ = 0.1375 for lists of documents sorted randomly.
  • 44. Experimental Results • Experiments support the earlier claim that RI do not preserve the L1 distance: RMI RI RI (standard L1 distance) (non‐linear estimator)
  • 45. Discussion • We proposed an incremental, scalable method for the construction of the L1 normed VSMs. • We show the ability of the method in preserving the distances between documents. • Is this possible to avoid floating-point calculation? • What is the performance of RMI within the context of Information Retrieval benchmarks?
  • 46. THANKS FOR YOUR ATTENTION! Random Manhattan Indexing Behrang Q. Zadeh and Siegfried Handschuh National University of Ireland, Galway, Ireland University of Passau, Lower Bavaria, Germany http://atmykitchen.info/