SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Part 1: Non Relational Databases
Part 2: Collaborative Filtering
          Simon Woodman
       [s.j.woodman@ncl.ac.uk]
Outline
•   Part 1: Non-Relational Databases (NoSQL)
     – Trends forcing change
     – NoSQL database types
     – Graph Databases (Neo4J)
     – Demo



•   Part 2: Making Recommendations
     – Background/example
     – Pearson Score
     – User based
     – Item based
Credit: http://ecogreenliving.net/
Trend 1: Data Size
                         Digital Information
                    Created, Captured, Replicated
                              worldwide
           3000

           2500

           2000
Exabytes
           1500

           1000

            500

             0
                  2006   2007   2008   2009   2010     2011   2012
                                                     Source: IDC 2009
Trend 2: Connectedness
Trend 2: connectedness
                                                                                                   Giant
                                                                                                  Global
                                                                                                  Graph
                                                                                                  (GGG)
 Information connectivity


                                                                                     Ontologies


                                                                              RDF

                                                                                          Folksonomies
                                                                          Tagging


                                                              Wikis             User-
                                                                              generated
                                                                               content
                                                                      Blogs


                                                            RSS


                                              Hypertext


                               Text
                            documents          web 1.0                web 2.0             “web 3.0”
                                          1990              2000                    2010                   2020

                                  Source: http://nosql.mypopescu.com/post/342947902/presentation-graphs-neo4j-teh-awesome
Trend 3: semi-structure
• “The great majority of the data out there is not structured and [there’s]
   no way in the world you can force people to structure it.” [1]


• Trend accelerated by the decentralization of content generation that is
   the hallmark of the age of participation (“web 2.0”)


• Evolving applications

    [1] Stefano Mazzocci Apache and MIT
Types of Databases

• Relational

• Key-Value Stores

• BigTable Clones

• Document Databases

• Graph Databases
Relational Databases
• Data Model: Normalised, multi-table with referential integrity
• Good for very static data
   – Payroll, accounts
   – Well understood
   – Not evolving
• SQL Queries (joins etc.)
• Good Tooling


• Examples: Oracle, MySQL, Postgres, …
Key-Value Stores
•       Data Model: (global) collection of K-V pairs
•       Massive Distributed HashMap
•       Partitioning and Replication usually ring based
           –      Load Balancer round robins the requests
           –      Hash(key) = partition
           –      Partition map maintains partition -> node mapping
           –      Quorum System (N, R, W), usually (3,2,2)


•       Scales Well (1000B rows)
•       How many apps need that?
           –      Google, Amazon, Facebook etc.
           –      <10 in the world

•       Examples: Dynomite, Voldemort, Tokyo

[http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf]
BigTable Clones
•        Data model: single table, column families
•        Distributed storage of semi-structured data (column families)
•        Scale: “Petabyte range”
•        Supports MapReduce well




•        Example: Hbase, Hypertable


[http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable-osdi06.pdf]
Document Databases
•   Inspired by Lotus Notes
•   Data model: collections of K-V collections
•   Document:
      –   Collection of K-V pairs (often JSON)
      –   Often versioned

•   Scales: Dependant on implementation


•   Can (potentially) store entire 3 tier web app
in the database (probably NOT the best
architecture!)




•   Example: CouchDB, MongoDB
Graph Databases
•   Inspired by Euler & graph theory
•   Data model: nodes, relationships, K-V on both
•   Scale: 10B entities
•   SPARQL Queries


•   No O/R Impedance mismatch
•   Semi Structured & Evolving Schema




•   Example: AllegroGraph, VertexDB, Neo4j
Social Network Problem


• System stores people and friends

• Find all “friends of friends”
RDBMS Solution
•   SQL: single join to get
    friends


•   SELECT p.name, p2.name
     FROM people AS p, people AS p2,
     friends AS f
     WHERE p.id = 1 AND p.id = f.id1 AND p2.id = f.id2;



•   SQL: 2-3 joins or subqueries to get “friends of friends”


•   i.e. Not trivial and doesn’t scale
Graph DB Solution
• Graph Traversal



• pathExists(a,b)

  limit depth 2
Neo4J Model
• Nodes
• Relationships (edges)             type=“KNOWS”
                                      age=4 years

• Properties on Both
                            1

                                           2
           name = “Simon”
              job=“RA”



                                3              name = “Chris”
Live Demo!
Neo4J Model
• Transactions
• Reference Node
• Indexes (Apache Lucene)
• Visualisation
  – Neoclipse
  – The JIT
Neoclipse
Pros and Cons
• “Whiteboard friendly” – fits domain models better
• Scales up “enough”
• Evolve Schema
• Can represent semi-structured data
• Good Performance for graph/network traversals


• Lacks tool support
• Harder to write ad-hoc queries (SPARQL vs. SQL)
Important Reminders
• Other options exist apart from the Relational
  Database




• Fit the technology to the domain model, not the
  domain model to the technology
Questions?

• http://neo4j.org/



• Some material from

[http://nosql.mypopescu.com/post/342947902/
  presentation-graphs-neo4j-teh-awesome]
Part 2: Collaborative Filtering



• Calculating Similarities

• User based filtering

• Item based filtering
Why?
•   Sell more items
•   Increase market share
•   Better targeted advertising


•   Up sell rather than new-sell


•   Make more £££


•   Not perfect
     – Bad recommendations
     – Inappropriate recommendations
It can go wrong
It will go wrong
Preference Data
Movie Ratings    Online Shopping     Site Recommender
     5           Bought       1       Like       1
     4          Didn’t Buy    0     No vote      0
      3                            Didn’t Like   -1
     2
      1
Recommending Items

• Step 1: Calculate similarities
  – either user-user or item-item

• Step 2: Predict scores for “unseen” items

• Step 3: Normalise and order
Example Data: Movie Reviews

          Shawshank     The    Lock     Love
                                                Titanic   Seven
          Redemption   Ghost   Stock   Actually
  Simon       5         4       4         1

  Chris       1          3      4         5        4

  Paul        4          5                         2        4
Calculating Similarity
• Method 1: Euclidian Distance Score
• Compare Common Rankings
• n-dimensional preference space
• Score 0 – 1
• 1 = Identical
• 0 = Highly dissimilar
Calculating Euclidian Distance Score


• Done for each pair of people


• Difference in each axis
• Square
• Add them together
• Add 1 (avoids divide by zero)
• Square Root
• Invert
Chris and Simon


•   Difference in each axis
     –   (5-1), (4-3) = 4, 1

•   Square
     –   16, 1

•   Add them together
     –   17

•   Add 1 (avoids divide by zero)
     –   = 18

•   Square Root
     –   = 4.24264069

•   Invert
     –   = 0.23570226
Euclidian Distance Score


• Easy to calculate

• Bad for people who are similar but
  consistently rate higher/lower
Pearson Correlation Coefficient

• More Complicated
• Line of Best Fit between commonly rated items
• Deals with grade inflation




• Other measures
   – Jaccard Coefficient
   – Manhattan Distance
User based Filtering
• Look at what similar people have liked but you
  haven’t seen?
  – Similar person likes something that has bad reviews
    from everyone else?


• Weighted Score that ranks the other people and
  takes into account similarity
Recommending Items

                Similarity (ED)   Titanic   Sim x Titanic   Seven   Sim x Seven


    Chris            0.23           4           0.92
    Paul             0.78           2           1.56         4         3.12


    Total                                       2.48                   3.12
  Sim Sum                                       1.01                   0.78
Total/Sim Sum                               2.455445545                 4
Recommending Items

                Similarity (ED)   Titanic   Sim x Titanic   Seven   Sim x Seven


    Chris            0.23           4           0.92
    Paul             0.78           2           1.56         4         3.12


    Total                                       2.48                   3.12
  Sim Sum                                       1.01                   0.78
Total/Sim Sum                               2.455445545                 4
Recommending Items

                Similarity (ED)   Titanic   Sim x Titanic   Seven   Sim x Seven


    Chris            0.23           4           0.92
    Paul             0.78           2           1.56         4         3.12


    Total                                       2.48                   3.12
  Sim Sum                                       1.01                   0.78
Total/Sim Sum                               2.455445545                 4
User Based Filtering - Conclusions

• Calculate Similarity between users
• Recommend based on similar users


• Similarity
   – Euclidian Distance Score
   – Pearson Coefficient – better for non-normalised data


• Problem – need to compare every user/item to every other
  user/item
Item Based Filtering
• Pre-compute most similar items for each item
  – Item similarities change less often than user
    similarities and can be re-used



• Create a weighted list of items most similar to
  user’s top rated items
Recommending Items

                    Rating          Titanic (ED) Rat x Titanic   Seven (ED)   Rat x Seven
Shawshank              5               0.084         0.42          0.366         1.83
 The Ghost             4               0.125          0.5          0.487         1.948
 Lock Stock            4               0.091         0.364         0.318         1.272
Love Actually          1               0.737         0.737         0.184         0.184



    Total                              1.037         2.021         1.355         5.234
 Normalised (Rating / Similarity)                    1.948                    3.862730627
Recommending Items

                    Rating          Titanic (ED) Rat x Titanic   Seven (ED)   Rat x Seven
Shawshank              5               0.084         0.42          0.366         1.83
 The Ghost             4               0.125          0.5          0.487         1.948
 Lock Stock            4               0.091         0.364         0.318         1.272
Love Actually          1               0.737         0.737         0.184         0.184



    Total                              1.037         2.021         1.355         5.234
 Normalised (Rating / Similarity)                    1.948                    3.862730627
Recommending Items

                    Rating          Titanic (ED) Rat x Titanic   Seven (ED)   Rat x Seven
Shawshank              5               0.084         0.42          0.366         1.83
 The Ghost             4               0.125          0.5          0.487         1.948
 Lock Stock            4               0.091         0.364         0.318         1.272
Love Actually          1               0.737         0.737         0.184         0.184



    Total                              1.037         2.021         1.355         5.234
 Normalised (Rating / Similarity)                    1.948                    3.862730627
Item Based Filtering - Conclusions

• Calculate Similarity between items
• Recommend based on user’s ratings for items


• Similarity (as before)
   – Euclidian Distance Score
   – Pearson Coefficient – better for non-normalised data



• Problem – need to maintain item similarity data set
Item vs. User Based Filtering
• Item based scales better
   – Need to maintain the similarities data set

• User based simpler to implement
• May (or may not) want to show users who is similar in
  terms of habits
• Perform equally on dense data sets
• Item based performs better on sparse data sets
Questions?
• Reference: Programming Collective Intelligence,
  Toby Seagram, O’Reilly 2007




• s.j.woodman@ncl.ac.uk

Weitere ähnliche Inhalte

Ähnlich wie CSC 8101 Non Relational Databases

An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jDebanjan Mahata
 
No Sql Movement
No Sql MovementNo Sql Movement
No Sql MovementAjit Koti
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databasesthai
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Neo4j Training Introduction
Neo4j Training IntroductionNeo4j Training Introduction
Neo4j Training IntroductionMax De Marzi
 
NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social WebBogdan Gaza
 
Facets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data ExplorationFacets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data ExplorationRoberto García
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBWilliam LaForest
 
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版Rikkyo University
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Dataaba-sah
 
NOSQL overview and intro to graph databases with Neo4j (Geeknight May 2010)
NOSQL overview and intro to graph databases with Neo4j (Geeknight May 2010)NOSQL overview and intro to graph databases with Neo4j (Geeknight May 2010)
NOSQL overview and intro to graph databases with Neo4j (Geeknight May 2010)Emil Eifrem
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
Making your data work for you: Scratchpads, publishing & the Biodiversity Dat...
Making your data work for you: Scratchpads, publishing & the Biodiversity Dat...Making your data work for you: Scratchpads, publishing & the Biodiversity Dat...
Making your data work for you: Scratchpads, publishing & the Biodiversity Dat...Vince Smith
 
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)Emil Eifrem
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsGeorge Stathis
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)Emil Eifrem
 

Ähnlich wie CSC 8101 Non Relational Databases (20)

An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4j
 
No Sql Movement
No Sql MovementNo Sql Movement
No Sql Movement
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
Grails goes Graph
Grails goes GraphGrails goes Graph
Grails goes Graph
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Neo4j Training Introduction
Neo4j Training IntroductionNeo4j Training Introduction
Neo4j Training Introduction
 
Anti-social Databases
Anti-social DatabasesAnti-social Databases
Anti-social Databases
 
NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social Web
 
Facets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data ExplorationFacets and Pivoting for Flexible and Usable Linked Data Exploration
Facets and Pivoting for Flexible and Usable Linked Data Exploration
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
 
NoSQL
NoSQLNoSQL
NoSQL
 
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
全てのエンジニアのためのWeb標準技術とのつきあい方 OSC福岡 2011版
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Data
 
NOSQL overview and intro to graph databases with Neo4j (Geeknight May 2010)
NOSQL overview and intro to graph databases with Neo4j (Geeknight May 2010)NOSQL overview and intro to graph databases with Neo4j (Geeknight May 2010)
NOSQL overview and intro to graph databases with Neo4j (Geeknight May 2010)
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Making your data work for you: Scratchpads, publishing & the Biodiversity Dat...
Making your data work for you: Scratchpads, publishing & the Biodiversity Dat...Making your data work for you: Scratchpads, publishing & the Biodiversity Dat...
Making your data work for you: Scratchpads, publishing & the Biodiversity Dat...
 
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
 

Kürzlich hochgeladen

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Kürzlich hochgeladen (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

CSC 8101 Non Relational Databases

  • 1. Part 1: Non Relational Databases Part 2: Collaborative Filtering Simon Woodman [s.j.woodman@ncl.ac.uk]
  • 2. Outline • Part 1: Non-Relational Databases (NoSQL) – Trends forcing change – NoSQL database types – Graph Databases (Neo4J) – Demo • Part 2: Making Recommendations – Background/example – Pearson Score – User based – Item based
  • 4. Trend 1: Data Size Digital Information Created, Captured, Replicated worldwide 3000 2500 2000 Exabytes 1500 1000 500 0 2006 2007 2008 2009 2010 2011 2012 Source: IDC 2009
  • 5. Trend 2: Connectedness Trend 2: connectedness Giant Global Graph (GGG) Information connectivity Ontologies RDF Folksonomies Tagging Wikis User- generated content Blogs RSS Hypertext Text documents web 1.0 web 2.0 “web 3.0” 1990 2000 2010 2020 Source: http://nosql.mypopescu.com/post/342947902/presentation-graphs-neo4j-teh-awesome
  • 6. Trend 3: semi-structure • “The great majority of the data out there is not structured and [there’s] no way in the world you can force people to structure it.” [1] • Trend accelerated by the decentralization of content generation that is the hallmark of the age of participation (“web 2.0”) • Evolving applications [1] Stefano Mazzocci Apache and MIT
  • 7. Types of Databases • Relational • Key-Value Stores • BigTable Clones • Document Databases • Graph Databases
  • 8. Relational Databases • Data Model: Normalised, multi-table with referential integrity • Good for very static data – Payroll, accounts – Well understood – Not evolving • SQL Queries (joins etc.) • Good Tooling • Examples: Oracle, MySQL, Postgres, …
  • 9. Key-Value Stores • Data Model: (global) collection of K-V pairs • Massive Distributed HashMap • Partitioning and Replication usually ring based – Load Balancer round robins the requests – Hash(key) = partition – Partition map maintains partition -> node mapping – Quorum System (N, R, W), usually (3,2,2) • Scales Well (1000B rows) • How many apps need that? – Google, Amazon, Facebook etc. – <10 in the world • Examples: Dynomite, Voldemort, Tokyo [http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf]
  • 10. BigTable Clones • Data model: single table, column families • Distributed storage of semi-structured data (column families) • Scale: “Petabyte range” • Supports MapReduce well • Example: Hbase, Hypertable [http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable-osdi06.pdf]
  • 11. Document Databases • Inspired by Lotus Notes • Data model: collections of K-V collections • Document: – Collection of K-V pairs (often JSON) – Often versioned • Scales: Dependant on implementation • Can (potentially) store entire 3 tier web app in the database (probably NOT the best architecture!) • Example: CouchDB, MongoDB
  • 12. Graph Databases • Inspired by Euler & graph theory • Data model: nodes, relationships, K-V on both • Scale: 10B entities • SPARQL Queries • No O/R Impedance mismatch • Semi Structured & Evolving Schema • Example: AllegroGraph, VertexDB, Neo4j
  • 13. Social Network Problem • System stores people and friends • Find all “friends of friends”
  • 14. RDBMS Solution • SQL: single join to get friends • SELECT p.name, p2.name FROM people AS p, people AS p2, friends AS f WHERE p.id = 1 AND p.id = f.id1 AND p2.id = f.id2; • SQL: 2-3 joins or subqueries to get “friends of friends” • i.e. Not trivial and doesn’t scale
  • 15. Graph DB Solution • Graph Traversal • pathExists(a,b) limit depth 2
  • 16. Neo4J Model • Nodes • Relationships (edges) type=“KNOWS” age=4 years • Properties on Both 1 2 name = “Simon” job=“RA” 3 name = “Chris”
  • 18. Neo4J Model • Transactions • Reference Node • Indexes (Apache Lucene) • Visualisation – Neoclipse – The JIT
  • 20. Pros and Cons • “Whiteboard friendly” – fits domain models better • Scales up “enough” • Evolve Schema • Can represent semi-structured data • Good Performance for graph/network traversals • Lacks tool support • Harder to write ad-hoc queries (SPARQL vs. SQL)
  • 21. Important Reminders • Other options exist apart from the Relational Database • Fit the technology to the domain model, not the domain model to the technology
  • 22. Questions? • http://neo4j.org/ • Some material from [http://nosql.mypopescu.com/post/342947902/ presentation-graphs-neo4j-teh-awesome]
  • 23. Part 2: Collaborative Filtering • Calculating Similarities • User based filtering • Item based filtering
  • 24.
  • 25.
  • 26. Why? • Sell more items • Increase market share • Better targeted advertising • Up sell rather than new-sell • Make more £££ • Not perfect – Bad recommendations – Inappropriate recommendations
  • 27. It can go wrong
  • 28. It will go wrong
  • 29. Preference Data Movie Ratings Online Shopping Site Recommender 5 Bought 1 Like 1 4 Didn’t Buy 0 No vote 0 3 Didn’t Like -1 2 1
  • 30. Recommending Items • Step 1: Calculate similarities – either user-user or item-item • Step 2: Predict scores for “unseen” items • Step 3: Normalise and order
  • 31. Example Data: Movie Reviews Shawshank The Lock Love Titanic Seven Redemption Ghost Stock Actually Simon 5 4 4 1 Chris 1 3 4 5 4 Paul 4 5 2 4
  • 32. Calculating Similarity • Method 1: Euclidian Distance Score • Compare Common Rankings • n-dimensional preference space • Score 0 – 1 • 1 = Identical • 0 = Highly dissimilar
  • 33. Calculating Euclidian Distance Score • Done for each pair of people • Difference in each axis • Square • Add them together • Add 1 (avoids divide by zero) • Square Root • Invert
  • 34. Chris and Simon • Difference in each axis – (5-1), (4-3) = 4, 1 • Square – 16, 1 • Add them together – 17 • Add 1 (avoids divide by zero) – = 18 • Square Root – = 4.24264069 • Invert – = 0.23570226
  • 35. Euclidian Distance Score • Easy to calculate • Bad for people who are similar but consistently rate higher/lower
  • 36. Pearson Correlation Coefficient • More Complicated • Line of Best Fit between commonly rated items • Deals with grade inflation • Other measures – Jaccard Coefficient – Manhattan Distance
  • 37. User based Filtering • Look at what similar people have liked but you haven’t seen? – Similar person likes something that has bad reviews from everyone else? • Weighted Score that ranks the other people and takes into account similarity
  • 38. Recommending Items Similarity (ED) Titanic Sim x Titanic Seven Sim x Seven Chris 0.23 4 0.92 Paul 0.78 2 1.56 4 3.12 Total 2.48 3.12 Sim Sum 1.01 0.78 Total/Sim Sum 2.455445545 4
  • 39. Recommending Items Similarity (ED) Titanic Sim x Titanic Seven Sim x Seven Chris 0.23 4 0.92 Paul 0.78 2 1.56 4 3.12 Total 2.48 3.12 Sim Sum 1.01 0.78 Total/Sim Sum 2.455445545 4
  • 40. Recommending Items Similarity (ED) Titanic Sim x Titanic Seven Sim x Seven Chris 0.23 4 0.92 Paul 0.78 2 1.56 4 3.12 Total 2.48 3.12 Sim Sum 1.01 0.78 Total/Sim Sum 2.455445545 4
  • 41. User Based Filtering - Conclusions • Calculate Similarity between users • Recommend based on similar users • Similarity – Euclidian Distance Score – Pearson Coefficient – better for non-normalised data • Problem – need to compare every user/item to every other user/item
  • 42. Item Based Filtering • Pre-compute most similar items for each item – Item similarities change less often than user similarities and can be re-used • Create a weighted list of items most similar to user’s top rated items
  • 43. Recommending Items Rating Titanic (ED) Rat x Titanic Seven (ED) Rat x Seven Shawshank 5 0.084 0.42 0.366 1.83 The Ghost 4 0.125 0.5 0.487 1.948 Lock Stock 4 0.091 0.364 0.318 1.272 Love Actually 1 0.737 0.737 0.184 0.184 Total 1.037 2.021 1.355 5.234 Normalised (Rating / Similarity) 1.948 3.862730627
  • 44. Recommending Items Rating Titanic (ED) Rat x Titanic Seven (ED) Rat x Seven Shawshank 5 0.084 0.42 0.366 1.83 The Ghost 4 0.125 0.5 0.487 1.948 Lock Stock 4 0.091 0.364 0.318 1.272 Love Actually 1 0.737 0.737 0.184 0.184 Total 1.037 2.021 1.355 5.234 Normalised (Rating / Similarity) 1.948 3.862730627
  • 45. Recommending Items Rating Titanic (ED) Rat x Titanic Seven (ED) Rat x Seven Shawshank 5 0.084 0.42 0.366 1.83 The Ghost 4 0.125 0.5 0.487 1.948 Lock Stock 4 0.091 0.364 0.318 1.272 Love Actually 1 0.737 0.737 0.184 0.184 Total 1.037 2.021 1.355 5.234 Normalised (Rating / Similarity) 1.948 3.862730627
  • 46. Item Based Filtering - Conclusions • Calculate Similarity between items • Recommend based on user’s ratings for items • Similarity (as before) – Euclidian Distance Score – Pearson Coefficient – better for non-normalised data • Problem – need to maintain item similarity data set
  • 47. Item vs. User Based Filtering • Item based scales better – Need to maintain the similarities data set • User based simpler to implement • May (or may not) want to show users who is similar in terms of habits • Perform equally on dense data sets • Item based performs better on sparse data sets
  • 48. Questions? • Reference: Programming Collective Intelligence, Toby Seagram, O’Reilly 2007 • s.j.woodman@ncl.ac.uk

Hinweis der Redaktion

  1. Information overload. Creating too much data to be able to store it. Digital Cameras/Video Cameras/CCTVVOIP, Sensors, Medical imaging
  2. Information overload.
  3. Over time data has evolved to be more interlinked and connectedHypertext has linksBlogs have pingbacksTagging groups related dataOntologies formalise it moreGGG the relationships contain information rather than the data items. e.g. friends on FB – the data was there before but the relationships are the important part
  4. Applications in 70s and 80s were simple and rigid. Doesn’t work now with the interconnected world.Semi structured data is bad for RDBMSFB, Twitter,etc have had to build their own databases
  5. Used internally at Amazon in services like S3 and EC2Quorum(N, R, W)N = number of replicas that will be written toW = number of responses to wait for for write to succeedR = number of responses to agree on for read to be returnedMeans that n-r/w nodes can go down and the system still function
  6. ----- Meeting Notes (29/11/2011 10:32) -----Part 1: 40mins