SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Automated Ranking of
          Database Query Results


Sanjay Agrawal, Surajit Chaudhari, Gautam Das,
Aristides Gionis


                                      Presented By: Upa Gupta
Contents
   Introduction
   IDF Similarity
   QF Similarity
   Breaking Ties
   Implementation
       ITA Algorithm
   Conclusion
Introduction
   Database is Boolean Query Model
       E.g.. Select * WHERE MFR_Country = “Germany”
        AND Type = “Sports” AND Manufacture =
        “Volkswagon”
   Problems in Database
       Empty Answers
            Too selective query leading to Null Result Set
       Many Answers
            General query leading to too many results
Introduction
   Ranking of Database Query Results using IR
    techniques.
       Applying TF-IDF concept to database that is
        based on the frequency of the attribute values.
       Need to extend the TF-IDF to Numerical Domains
            IDF Similarity is discussed in paper
       Collecting WORKLOAD and using it for ranking.
            QF Similarity, leveraging Workload Information
Introduction

   Many Answers Problem is solved using Top-K
    Query Processing

   Index-based Threshold Algorithm (ITA)
    developed exploiting IDF/QF Similarity.
IDF Similarity
   What is TF-IDF Technique?
       Given a set of documents and a query,
        documents are ranked based on TF and IDF of
        the words of the document.


   Adapting IDF concept to Database
    containing only categorical Attributes
    t=<t1,……tm>  values of Attribute A
    n  Number of tuples in the database
IDF Similarity
   For all the values of t:
       Frequency F(t) is defined as no. of tuples having
        Attribute A = t
       IDF is calculated as:
                       IDF(t) = log(n/F(t))
       For pair of values u and v in Attribute A domain
               S(u,v) = IDF (u) if u=v otherwise 0
       For tuple T and Query Q for all the Attributes
        (A1…Ak)            m

               SIM(T,Q) = S ( t , q )
                                 k   k   k
                           k 1
IDF Similarity
   Example:
    CAR_ID   MODEL      MFR           MFR_Country   Type
    1        SLR        Mercedes      Germany       Sports
    2        A6         Audi          Germany       Executive
    3        R8         Audi          Germany       Sports
    4        Gallardo   Lamborghini   Italy         Sports



        Query Q: Select * WHERE MFR_Country =
         “Germany” AND Type = “Sports” AND MFR =
         “Volkswagon”
IDF Similarity
n=4
F (MFR_Country = Germany) = 3
IDF(MFR_Country = Germany) = log(n/F(MFR_Country = Germany))
                            = log(4/3) = 0.287
Similarly,
   IDF(MFR_Country=Italy) = 1.38               IDF(MFR = Audi) = 0.69

   IDF(MFR = Lamborghini) = 1.38                IDF(MFR = Mercedes) =
   1.38
   IDF(Type = Sports) = 0.287           IDF(Type = Executive) = 1.38

Similarity of 1st tuple with Q = SIM(T,Q)
   = S(Germany, Germany) + S(Sports, Sports) + S(Mercedes, Volkswagen)
   = IDF(MFR_Country = Germany) + IDF(Type = Sports) + 0
   = 0.287+0.287+0 = 0.574
IDF Similarity
   Consider a Numeric Attribute in DB e.g. PRICE
   SIMPLE SOLUTION: Discretize the data between ranges
   Consider two Range: (0, 50) and (51, 100)
       Values 49 and 52 are considered completely dissimilar.
   Frequency of a numeric value t of an attribute is defined as
                                                    2
                                 ti     t
                  n       1/ 2
                                            h
                                                                sum of contributions to t
         F(t) =       e                                         from every ti database.
                  i

         IDF(t) = log(n/F(t))                h = bandwidth parameter
         S(t,q) = density at t of a Gaussian Distribution centered q.
                                                2
                                 ti t
                          1/ 2
                                        h
         S(t,q) = e                                 IDF ( q )
IDF Similarity
   Consider following Query:
   Select * where MFR IN (“Germany”, “Italy”,
    ”Japan”)     m

   SIM(T,Q) =       max S k ( t k , q )
                       q Qk
               k   1
QF Similarity
   Problems with IDF:
       In a realtor database, more homes are built in
        recent years such as 2007 and 2008 as
        compared to 1980 and 1981.Thus recent years
        have small IDF. Yet newer homes have higher
        demand.

       In a bookstore DB, demand for an author is due
        to factor other than no. of books he has written
QF Similarity
   WORKLOAD: Past Queries
   Importance of attribute values is determined
    by frequency of their occurrence in workload.
   As in above eg, frequency of queries
    requesting homes in 2010 are more than of
    the year 1981
QF Similarity
   For categorical data
      RQF(q) = raw frequency of occurrence of value q of

       attribute A in query strings of workload

       RQFMax = raw frequency of most frequently occurring
        value in workload

       Query frequency QF(q) = RQF(q)/RQFMax

      s(t, q) = QF(q), if q = t otherwise 0
   QF resembles TF
QF Similarity
   Consider Workload containing following
    values of Attribute TYPE:

    {Sports, Executive, Luxury, Sports, Sports, Executive}

    QF(Executive) = RQF(Executive)/RQFMax
                  = 2/3
QF Similarity
   Similarity between pairs of different categorical
    attribute values can also be derived from workload
    eg. To find S(Audi, Mercedes)

   Similarity coefficient between t and q in this case is
    defined by jaccard coefficient scaled by QF factor
    as shown below.
     S(t,q)=J(W(t),W(q))/QF(q)
       W(t) = Subset of queries in workload W in which
        categorical value t occurs in an IN clause
QF-IDF



   For QF-IDF Similarity
     S(t,q)=QF(q) *IDF(q) when t=q otherwise 0
BREAKING TIES
   IF SIM(t1, q) = SIM (t2, q)
           Which Should be ranked Higher??
           QF and IDF partitions database into classes
    CAR_ID     MODEL      MFR           MFR_Country   Type
    1          SLR        Mercedes      Germany       Sports
    2          A6         Audi          Germany       Executive
    3          R8         Audi          Germany       Sports
    4          Gallardo   Lamborghini   Italy         Sports

           Q: SELECT * WHERE Type = “Sports” AND MFR_Country
            = “Germany”
Breaking Ties with QF
   Determine weights of missing attribute values that
    reflect their “global importance” using workload.

   Global Imp =       log( QF ( t k ))   tk= missing attribute
                   k




   Missing Attributes for Q: MFR and Model
Breaking Ties with QF
   Considering Workload with following values of MFR and
    Model
    MFR{Audi, Audi, Lamborghini, Mercedes, Lamborghini, Audi}
    Model{R8, A6, Gallardo, SLR, Gallardo, A6}
   QF(SLR) = ½ = 0.5        QF(Mercedes) = 1/3 = 0.33
         1        SLR      Mercedes   Germany   Sports
   Global Imp = log(0.5) + log(0.33).
   NEGATIVE VALUES of Global Imp ??
Breaking Ties with IDF
   Tuples with large IDF(occuring infequently) of
    missing attributes are ranked higher
       Cars which are not popular are ranked higher


   Tuples with small IDF of missing attributes
    are ranked higher
       Cars having Moonroof will be ranked less which
        is a desirable feature.
Implementation

   Pre-processing component



   Query–processing component
Implementation
   Pre Processing Component

       Compute and store a representation of similarity
        function(QF-IDF, QF, IDF) in auxiliary database
        tables
Implementation
   Query Processing Component
       Job: Retrieving Top-K results from Database

       ITA Algorithm: Use of Fagin’s Threshold Algorithm
        and Similarity function
            Sorted Access: Along any attribute Ak, TIDs of tuples
             are retrieved.
            Random Access: entire tuple corresponding to a TID
             is retrieved.
ITA Algorithm
   Repeat
   Initialize Top-K Buffer to empty
   For each k = 1 to p
      TID = Index of the next Tuple is retrieved from the ordered

        Lists
      T = Complete Tuple is retrieved for TID

      Compute value of Ranking Function

      If Rank of T is higher than the rank of lowest ranking tuple in

        Top-K Buffer, then update Top-K Buffer
      If Stopping Condition has been reached then Exit

   End For
   Until all index of the tuples have been seen.
ITA Algorithm
Stopping Condition
   Hypothetical tuple – current value a1,…, ap
  for A1,… Ap, corresponding to index seeks on
  L1,…, Lp and qp+1,….. qm for remaining
  columns from the query directly.
  Termination – Similarity of hypothetical tuple
  to the query< tuple in Top-k buffer with least
  similarity.
ITA for Numeric columns
   Consider a query has condition Ak = qk for a
    numeric column Ak.

   Two index scan is performed on Ak.
       First retrieve TID’s > qk in incresing order.
       Second retrieve TID’s < qk in decreasing order.

   We then pick TID’s from the merged stream.
Conclusion
   Automated Ranking Infrastructure for SQL
    databases.
   Extended TF-IDF based techniques from
    Information retrieval to numeric and mixed
    data.
   Implementation of Ranking function that
    exploited Fagin’s TA
THANK YOU

Weitere ähnliche Inhalte

Was ist angesagt?

Inductive Triple Graphs: A purely functional approach to represent RDF
Inductive Triple Graphs: A purely functional approach to represent RDFInductive Triple Graphs: A purely functional approach to represent RDF
Inductive Triple Graphs: A purely functional approach to represent RDFJose Emilio Labra Gayo
 
Concurrent Argumentation with Time: an Overview
Concurrent Argumentation with Time: an OverviewConcurrent Argumentation with Time: an Overview
Concurrent Argumentation with Time: an OverviewCarlo Taticchi
 
3.5 equivalence of pushdown automata and cfl
3.5 equivalence of pushdown automata and cfl3.5 equivalence of pushdown automata and cfl
3.5 equivalence of pushdown automata and cflSampath Kumar S
 
Containerisation and Dynamic Frameworks in ICCMA’19
Containerisation and Dynamic Frameworks in ICCMA’19Containerisation and Dynamic Frameworks in ICCMA’19
Containerisation and Dynamic Frameworks in ICCMA’19Carlo Taticchi
 
GDG DevFest Xiamen 2017
GDG DevFest Xiamen 2017GDG DevFest Xiamen 2017
GDG DevFest Xiamen 2017Taegyun Jeon
 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...Vissarion Fisikopoulos
 
Two-level Just-in-Time Compilation with One Interpreter and One Engine
Two-level Just-in-Time Compilation with One Interpreter and One EngineTwo-level Just-in-Time Compilation with One Interpreter and One Engine
Two-level Just-in-Time Compilation with One Interpreter and One EngineYusuke Izawa
 
Graph Modification: Beyond the known Boundaries
Graph Modification: Beyond the known BoundariesGraph Modification: Beyond the known Boundaries
Graph Modification: Beyond the known BoundariesAkankshaAgrawal55
 
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...Yusuke Izawa
 
Assembly language (addition and subtraction)
Assembly language (addition and subtraction)Assembly language (addition and subtraction)
Assembly language (addition and subtraction)Muhammad Umar Farooq
 
GDG DevFest Seoul 2017: Codelab - Time Series Analysis for Kaggle using Tenso...
GDG DevFest Seoul 2017: Codelab - Time Series Analysis for Kaggle using Tenso...GDG DevFest Seoul 2017: Codelab - Time Series Analysis for Kaggle using Tenso...
GDG DevFest Seoul 2017: Codelab - Time Series Analysis for Kaggle using Tenso...Taegyun Jeon
 

Was ist angesagt? (15)

Inductive Triple Graphs: A purely functional approach to represent RDF
Inductive Triple Graphs: A purely functional approach to represent RDFInductive Triple Graphs: A purely functional approach to represent RDF
Inductive Triple Graphs: A purely functional approach to represent RDF
 
Concurrent Argumentation with Time: an Overview
Concurrent Argumentation with Time: an OverviewConcurrent Argumentation with Time: an Overview
Concurrent Argumentation with Time: an Overview
 
3.5 equivalence of pushdown automata and cfl
3.5 equivalence of pushdown automata and cfl3.5 equivalence of pushdown automata and cfl
3.5 equivalence of pushdown automata and cfl
 
Containerisation and Dynamic Frameworks in ICCMA’19
Containerisation and Dynamic Frameworks in ICCMA’19Containerisation and Dynamic Frameworks in ICCMA’19
Containerisation and Dynamic Frameworks in ICCMA’19
 
GDG DevFest Xiamen 2017
GDG DevFest Xiamen 2017GDG DevFest Xiamen 2017
GDG DevFest Xiamen 2017
 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...
 
Two-level Just-in-Time Compilation with One Interpreter and One Engine
Two-level Just-in-Time Compilation with One Interpreter and One EngineTwo-level Just-in-Time Compilation with One Interpreter and One Engine
Two-level Just-in-Time Compilation with One Interpreter and One Engine
 
Graph Modification: Beyond the known Boundaries
Graph Modification: Beyond the known BoundariesGraph Modification: Beyond the known Boundaries
Graph Modification: Beyond the known Boundaries
 
Al2ed chapter5
Al2ed chapter5Al2ed chapter5
Al2ed chapter5
 
An25237245
An25237245An25237245
An25237245
 
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
 
Assembly language (addition and subtraction)
Assembly language (addition and subtraction)Assembly language (addition and subtraction)
Assembly language (addition and subtraction)
 
GDG DevFest Seoul 2017: Codelab - Time Series Analysis for Kaggle using Tenso...
GDG DevFest Seoul 2017: Codelab - Time Series Analysis for Kaggle using Tenso...GDG DevFest Seoul 2017: Codelab - Time Series Analysis for Kaggle using Tenso...
GDG DevFest Seoul 2017: Codelab - Time Series Analysis for Kaggle using Tenso...
 
Gremlin's Anatomy
Gremlin's AnatomyGremlin's Anatomy
Gremlin's Anatomy
 
C applications
C applicationsC applications
C applications
 

Andere mochten auch

Andere mochten auch (8)

Its undergraduate-12600-presentation
Its undergraduate-12600-presentationIts undergraduate-12600-presentation
Its undergraduate-12600-presentation
 
My cool new Slideshow!
My cool new Slideshow!My cool new Slideshow!
My cool new Slideshow!
 
344444
344444344444
344444
 
Up thử cái mới
Up thử cái mớiUp thử cái mới
Up thử cái mới
 
Slideshow!
Slideshow!Slideshow!
Slideshow!
 
Slideshow mới up nè. ^_^
Slideshow mới up nè. ^_^Slideshow mới up nè. ^_^
Slideshow mới up nè. ^_^
 
Chapter 3 intimacy presentation
Chapter 3 intimacy presentationChapter 3 intimacy presentation
Chapter 3 intimacy presentation
 
huhu
huhuhuhu
huhu
 

Ähnlich wie 9-1-13

My cool new Slideshow!
My cool new Slideshow!My cool new Slideshow!
My cool new Slideshow!Dung Trương
 
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...ETH Zurich
 
ABAP Programming Overview
ABAP Programming OverviewABAP Programming Overview
ABAP Programming Overviewsapdocs. info
 
Chapter 1abapprogrammingoverview-091205081953-phpapp01
Chapter 1abapprogrammingoverview-091205081953-phpapp01Chapter 1abapprogrammingoverview-091205081953-phpapp01
Chapter 1abapprogrammingoverview-091205081953-phpapp01tabish
 
chapter-1abapprogrammingoverview-091205081953-phpapp01
chapter-1abapprogrammingoverview-091205081953-phpapp01chapter-1abapprogrammingoverview-091205081953-phpapp01
chapter-1abapprogrammingoverview-091205081953-phpapp01tabish
 
Chapter 1 Abap Programming Overview
Chapter 1 Abap Programming OverviewChapter 1 Abap Programming Overview
Chapter 1 Abap Programming OverviewAshish Kumar
 
Abapprogrammingoverview 090715081305-phpapp02
Abapprogrammingoverview 090715081305-phpapp02Abapprogrammingoverview 090715081305-phpapp02
Abapprogrammingoverview 090715081305-phpapp02tabish
 
Abapprogrammingoverview 090715081305-phpapp02
Abapprogrammingoverview 090715081305-phpapp02Abapprogrammingoverview 090715081305-phpapp02
Abapprogrammingoverview 090715081305-phpapp02wingsrai
 
Kursi i programimit të Gjuhës R
Kursi i programimit të Gjuhës R Kursi i programimit të Gjuhës R
Kursi i programimit të Gjuhës R tctal
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Craig Chao
 
Scala categorytheory
Scala categorytheoryScala categorytheory
Scala categorytheoryMeetu Maltiar
 
Scala categorytheory
Scala categorytheoryScala categorytheory
Scala categorytheoryKnoldus Inc.
 
Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)
Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)
Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)Wim Vanderbauwhede
 
Optimizing with persistent data structures (LLVM Cauldron 2016)
Optimizing with persistent data structures (LLVM Cauldron 2016)Optimizing with persistent data structures (LLVM Cauldron 2016)
Optimizing with persistent data structures (LLVM Cauldron 2016)Igalia
 
Chapter Eight(1)
Chapter Eight(1)Chapter Eight(1)
Chapter Eight(1)bolovv
 
33-Procedures-Switch Case Statements-27-03-2024.pdf
33-Procedures-Switch Case Statements-27-03-2024.pdf33-Procedures-Switch Case Statements-27-03-2024.pdf
33-Procedures-Switch Case Statements-27-03-2024.pdfYash218469
 
All About ... Functions
All About ... FunctionsAll About ... Functions
All About ... FunctionsMichal Bigos
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 

Ähnlich wie 9-1-13 (20)

new Slideshow!
new Slideshow!new Slideshow!
new Slideshow!
 
My cool new Slideshow!
My cool new Slideshow!My cool new Slideshow!
My cool new Slideshow!
 
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
 
ABAP Programming Overview
ABAP Programming OverviewABAP Programming Overview
ABAP Programming Overview
 
Chapter 1abapprogrammingoverview-091205081953-phpapp01
Chapter 1abapprogrammingoverview-091205081953-phpapp01Chapter 1abapprogrammingoverview-091205081953-phpapp01
Chapter 1abapprogrammingoverview-091205081953-phpapp01
 
chapter-1abapprogrammingoverview-091205081953-phpapp01
chapter-1abapprogrammingoverview-091205081953-phpapp01chapter-1abapprogrammingoverview-091205081953-phpapp01
chapter-1abapprogrammingoverview-091205081953-phpapp01
 
Chapter 1 Abap Programming Overview
Chapter 1 Abap Programming OverviewChapter 1 Abap Programming Overview
Chapter 1 Abap Programming Overview
 
Abapprogrammingoverview 090715081305-phpapp02
Abapprogrammingoverview 090715081305-phpapp02Abapprogrammingoverview 090715081305-phpapp02
Abapprogrammingoverview 090715081305-phpapp02
 
Abapprogrammingoverview 090715081305-phpapp02
Abapprogrammingoverview 090715081305-phpapp02Abapprogrammingoverview 090715081305-phpapp02
Abapprogrammingoverview 090715081305-phpapp02
 
Kursi i programimit të Gjuhës R
Kursi i programimit të Gjuhës R Kursi i programimit të Gjuhës R
Kursi i programimit të Gjuhës R
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
 
Scala categorytheory
Scala categorytheoryScala categorytheory
Scala categorytheory
 
Scala categorytheory
Scala categorytheoryScala categorytheory
Scala categorytheory
 
Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)
Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)
Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)
 
Optimizing with persistent data structures (LLVM Cauldron 2016)
Optimizing with persistent data structures (LLVM Cauldron 2016)Optimizing with persistent data structures (LLVM Cauldron 2016)
Optimizing with persistent data structures (LLVM Cauldron 2016)
 
K map
K mapK map
K map
 
Chapter Eight(1)
Chapter Eight(1)Chapter Eight(1)
Chapter Eight(1)
 
33-Procedures-Switch Case Statements-27-03-2024.pdf
33-Procedures-Switch Case Statements-27-03-2024.pdf33-Procedures-Switch Case Statements-27-03-2024.pdf
33-Procedures-Switch Case Statements-27-03-2024.pdf
 
All About ... Functions
All About ... FunctionsAll About ... Functions
All About ... Functions
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 

9-1-13

  • 1. Automated Ranking of Database Query Results Sanjay Agrawal, Surajit Chaudhari, Gautam Das, Aristides Gionis Presented By: Upa Gupta
  • 2. Contents  Introduction  IDF Similarity  QF Similarity  Breaking Ties  Implementation  ITA Algorithm  Conclusion
  • 3. Introduction  Database is Boolean Query Model  E.g.. Select * WHERE MFR_Country = “Germany” AND Type = “Sports” AND Manufacture = “Volkswagon”  Problems in Database  Empty Answers  Too selective query leading to Null Result Set  Many Answers  General query leading to too many results
  • 4. Introduction  Ranking of Database Query Results using IR techniques.  Applying TF-IDF concept to database that is based on the frequency of the attribute values.  Need to extend the TF-IDF to Numerical Domains  IDF Similarity is discussed in paper  Collecting WORKLOAD and using it for ranking.  QF Similarity, leveraging Workload Information
  • 5. Introduction  Many Answers Problem is solved using Top-K Query Processing  Index-based Threshold Algorithm (ITA) developed exploiting IDF/QF Similarity.
  • 6. IDF Similarity  What is TF-IDF Technique?  Given a set of documents and a query, documents are ranked based on TF and IDF of the words of the document.  Adapting IDF concept to Database containing only categorical Attributes t=<t1,……tm>  values of Attribute A n  Number of tuples in the database
  • 7. IDF Similarity  For all the values of t:  Frequency F(t) is defined as no. of tuples having Attribute A = t  IDF is calculated as: IDF(t) = log(n/F(t))  For pair of values u and v in Attribute A domain S(u,v) = IDF (u) if u=v otherwise 0  For tuple T and Query Q for all the Attributes (A1…Ak) m SIM(T,Q) = S ( t , q ) k k k k 1
  • 8. IDF Similarity  Example: CAR_ID MODEL MFR MFR_Country Type 1 SLR Mercedes Germany Sports 2 A6 Audi Germany Executive 3 R8 Audi Germany Sports 4 Gallardo Lamborghini Italy Sports Query Q: Select * WHERE MFR_Country = “Germany” AND Type = “Sports” AND MFR = “Volkswagon”
  • 9. IDF Similarity n=4 F (MFR_Country = Germany) = 3 IDF(MFR_Country = Germany) = log(n/F(MFR_Country = Germany)) = log(4/3) = 0.287 Similarly, IDF(MFR_Country=Italy) = 1.38 IDF(MFR = Audi) = 0.69 IDF(MFR = Lamborghini) = 1.38 IDF(MFR = Mercedes) = 1.38 IDF(Type = Sports) = 0.287 IDF(Type = Executive) = 1.38 Similarity of 1st tuple with Q = SIM(T,Q) = S(Germany, Germany) + S(Sports, Sports) + S(Mercedes, Volkswagen) = IDF(MFR_Country = Germany) + IDF(Type = Sports) + 0 = 0.287+0.287+0 = 0.574
  • 10. IDF Similarity  Consider a Numeric Attribute in DB e.g. PRICE  SIMPLE SOLUTION: Discretize the data between ranges  Consider two Range: (0, 50) and (51, 100)  Values 49 and 52 are considered completely dissimilar.  Frequency of a numeric value t of an attribute is defined as 2 ti t n 1/ 2 h sum of contributions to t F(t) = e from every ti database. i IDF(t) = log(n/F(t)) h = bandwidth parameter S(t,q) = density at t of a Gaussian Distribution centered q. 2 ti t 1/ 2 h S(t,q) = e IDF ( q )
  • 11. IDF Similarity  Consider following Query:  Select * where MFR IN (“Germany”, “Italy”, ”Japan”) m  SIM(T,Q) = max S k ( t k , q ) q Qk k 1
  • 12. QF Similarity  Problems with IDF:  In a realtor database, more homes are built in recent years such as 2007 and 2008 as compared to 1980 and 1981.Thus recent years have small IDF. Yet newer homes have higher demand.  In a bookstore DB, demand for an author is due to factor other than no. of books he has written
  • 13. QF Similarity  WORKLOAD: Past Queries  Importance of attribute values is determined by frequency of their occurrence in workload.  As in above eg, frequency of queries requesting homes in 2010 are more than of the year 1981
  • 14. QF Similarity  For categorical data  RQF(q) = raw frequency of occurrence of value q of attribute A in query strings of workload  RQFMax = raw frequency of most frequently occurring value in workload  Query frequency QF(q) = RQF(q)/RQFMax  s(t, q) = QF(q), if q = t otherwise 0  QF resembles TF
  • 15. QF Similarity  Consider Workload containing following values of Attribute TYPE: {Sports, Executive, Luxury, Sports, Sports, Executive} QF(Executive) = RQF(Executive)/RQFMax = 2/3
  • 16. QF Similarity  Similarity between pairs of different categorical attribute values can also be derived from workload eg. To find S(Audi, Mercedes)  Similarity coefficient between t and q in this case is defined by jaccard coefficient scaled by QF factor as shown below. S(t,q)=J(W(t),W(q))/QF(q)  W(t) = Subset of queries in workload W in which categorical value t occurs in an IN clause
  • 17. QF-IDF  For QF-IDF Similarity S(t,q)=QF(q) *IDF(q) when t=q otherwise 0
  • 18. BREAKING TIES  IF SIM(t1, q) = SIM (t2, q)  Which Should be ranked Higher??  QF and IDF partitions database into classes CAR_ID MODEL MFR MFR_Country Type 1 SLR Mercedes Germany Sports 2 A6 Audi Germany Executive 3 R8 Audi Germany Sports 4 Gallardo Lamborghini Italy Sports  Q: SELECT * WHERE Type = “Sports” AND MFR_Country = “Germany”
  • 19. Breaking Ties with QF  Determine weights of missing attribute values that reflect their “global importance” using workload.  Global Imp = log( QF ( t k )) tk= missing attribute k  Missing Attributes for Q: MFR and Model
  • 20. Breaking Ties with QF  Considering Workload with following values of MFR and Model MFR{Audi, Audi, Lamborghini, Mercedes, Lamborghini, Audi} Model{R8, A6, Gallardo, SLR, Gallardo, A6}  QF(SLR) = ½ = 0.5 QF(Mercedes) = 1/3 = 0.33 1 SLR Mercedes Germany Sports  Global Imp = log(0.5) + log(0.33).  NEGATIVE VALUES of Global Imp ??
  • 21. Breaking Ties with IDF  Tuples with large IDF(occuring infequently) of missing attributes are ranked higher  Cars which are not popular are ranked higher  Tuples with small IDF of missing attributes are ranked higher  Cars having Moonroof will be ranked less which is a desirable feature.
  • 22. Implementation  Pre-processing component  Query–processing component
  • 23. Implementation  Pre Processing Component  Compute and store a representation of similarity function(QF-IDF, QF, IDF) in auxiliary database tables
  • 24. Implementation  Query Processing Component  Job: Retrieving Top-K results from Database  ITA Algorithm: Use of Fagin’s Threshold Algorithm and Similarity function  Sorted Access: Along any attribute Ak, TIDs of tuples are retrieved.  Random Access: entire tuple corresponding to a TID is retrieved.
  • 25. ITA Algorithm  Repeat  Initialize Top-K Buffer to empty  For each k = 1 to p  TID = Index of the next Tuple is retrieved from the ordered Lists  T = Complete Tuple is retrieved for TID  Compute value of Ranking Function  If Rank of T is higher than the rank of lowest ranking tuple in Top-K Buffer, then update Top-K Buffer  If Stopping Condition has been reached then Exit  End For  Until all index of the tuples have been seen.
  • 26. ITA Algorithm Stopping Condition Hypothetical tuple – current value a1,…, ap for A1,… Ap, corresponding to index seeks on L1,…, Lp and qp+1,….. qm for remaining columns from the query directly. Termination – Similarity of hypothetical tuple to the query< tuple in Top-k buffer with least similarity.
  • 27. ITA for Numeric columns  Consider a query has condition Ak = qk for a numeric column Ak.  Two index scan is performed on Ak.  First retrieve TID’s > qk in incresing order.  Second retrieve TID’s < qk in decreasing order.  We then pick TID’s from the merged stream.
  • 28. Conclusion  Automated Ranking Infrastructure for SQL databases.  Extended TF-IDF based techniques from Information retrieval to numeric and mixed data.  Implementation of Ranking function that exploited Fagin’s TA