CSC 8101 Non Relational Databases

Part 1: Non Relational Databases
Part 2: Collaborative Filtering
Simon Woodman
[s.j.woodman@ncl.ac.uk]

Outline
• Part 1: Non-Relational Databases (NoSQL)
– Trends forcing change
– NoSQL database types
– Graph Databases (Neo4J)
– Demo

• Part 2: Making Recommendations
– Background/example
– Pearson Score
– User based
– Item based

Credit: http://ecogreenliving.net/

Trend 1: Data Size
Digital Information
Created, Captured, Replicated
worldwide
3000

2500

2000
Exabytes
1500

1000

500

0
2006 2007 2008 2009 2010 2011 2012
Source: IDC 2009

Trend 2: Connectedness
Trend 2: connectedness
Giant
Global
Graph
(GGG)
Information connectivity

Ontologies

RDF

Folksonomies
Tagging

Wikis User-
generated
content
Blogs

RSS

Hypertext

Text
documents web 1.0 web 2.0 “web 3.0”
1990 2000 2010 2020

Source: http://nosql.mypopescu.com/post/342947902/presentation-graphs-neo4j-teh-awesome

Trend 3: semi-structure
• “The great majority of the data out there is not structured and [there’s]
no way in the world you can force people to structure it.” [1]

• Trend accelerated by the decentralization of content generation that is
the hallmark of the age of participation (“web 2.0”)

• Evolving applications

[1] Stefano Mazzocci Apache and MIT

Types of Databases

• Relational

• Key-Value Stores

• BigTable Clones

• Document Databases

• Graph Databases

Relational Databases
• Data Model: Normalised, multi-table with referential integrity
• Good for very static data
– Payroll, accounts
– Well understood
– Not evolving
• SQL Queries (joins etc.)
• Good Tooling

• Examples: Oracle, MySQL, Postgres, …

Key-Value Stores
• Data Model: (global) collection of K-V pairs
• Massive Distributed HashMap
• Partitioning and Replication usually ring based
– Load Balancer round robins the requests
– Hash(key) = partition
– Partition map maintains partition -> node mapping
– Quorum System (N, R, W), usually (3,2,2)

• Scales Well (1000B rows)
• How many apps need that?
– Google, Amazon, Facebook etc.
– <10 in the world

• Examples: Dynomite, Voldemort, Tokyo

[http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf]

BigTable Clones
• Data model: single table, column families
• Distributed storage of semi-structured data (column families)
• Scale: “Petabyte range”
• Supports MapReduce well

• Example: Hbase, Hypertable

[http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable-osdi06.pdf]

Document Databases
• Inspired by Lotus Notes
• Data model: collections of K-V collections
• Document:
– Collection of K-V pairs (often JSON)
– Often versioned

• Scales: Dependant on implementation

• Can (potentially) store entire 3 tier web app
in the database (probably NOT the best
architecture!)

• Example: CouchDB, MongoDB

Graph Databases
• Inspired by Euler & graph theory
• Data model: nodes, relationships, K-V on both
• Scale: 10B entities
• SPARQL Queries

• No O/R Impedance mismatch
• Semi Structured & Evolving Schema

• Example: AllegroGraph, VertexDB, Neo4j

Social Network Problem

• System stores people and friends

• Find all “friends of friends”

RDBMS Solution
• SQL: single join to get
friends

• SELECT p.name, p2.name
FROM people AS p, people AS p2,
friends AS f
WHERE p.id = 1 AND p.id = f.id1 AND p2.id = f.id2;

• SQL: 2-3 joins or subqueries to get “friends of friends”

• i.e. Not trivial and doesn’t scale

Graph DB Solution
• Graph Traversal

• pathExists(a,b)

limit depth 2

Neo4J Model
• Nodes
• Relationships (edges) type=“KNOWS”
age=4 years

• Properties on Both
1

2
name = “Simon”
job=“RA”

3 name = “Chris”

Neo4J Model
• Transactions
• Reference Node
• Indexes (Apache Lucene)
• Visualisation
– Neoclipse
– The JIT

Pros and Cons
• “Whiteboard friendly” – fits domain models better
• Scales up “enough”
• Evolve Schema
• Can represent semi-structured data
• Good Performance for graph/network traversals

• Lacks tool support
• Harder to write ad-hoc queries (SPARQL vs. SQL)

Important Reminders
• Other options exist apart from the Relational
Database

• Fit the technology to the domain model, not the
domain model to the technology

Questions?

• http://neo4j.org/

• Some material from

[http://nosql.mypopescu.com/post/342947902/
presentation-graphs-neo4j-teh-awesome]

Part 2: Collaborative Filtering

• Calculating Similarities

• User based filtering

• Item based filtering

Why?
• Sell more items
• Increase market share
• Better targeted advertising

• Up sell rather than new-sell

• Make more £££

• Not perfect
– Bad recommendations
– Inappropriate recommendations

Preference Data
Movie Ratings Online Shopping Site Recommender
5 Bought 1 Like 1
4 Didn’t Buy 0 No vote 0
3 Didn’t Like -1
2
1

Recommending Items

• Step 1: Calculate similarities
– either user-user or item-item

• Step 2: Predict scores for “unseen” items

• Step 3: Normalise and order

Example Data: Movie Reviews

Shawshank The Lock Love
Titanic Seven
Redemption Ghost Stock Actually
Simon 5 4 4 1

Chris 1 3 4 5 4

Paul 4 5 2 4

Calculating Similarity
• Method 1: Euclidian Distance Score
• Compare Common Rankings
• n-dimensional preference space
• Score 0 – 1
• 1 = Identical
• 0 = Highly dissimilar

Calculating Euclidian Distance Score

• Done for each pair of people

• Difference in each axis
• Square
• Add them together
• Add 1 (avoids divide by zero)
• Square Root
• Invert

Chris and Simon

• Difference in each axis
– (5-1), (4-3) = 4, 1

• Square
– 16, 1

• Add them together
– 17

• Add 1 (avoids divide by zero)
– = 18

• Square Root
– = 4.24264069

• Invert
– = 0.23570226

Euclidian Distance Score

• Easy to calculate

• Bad for people who are similar but
consistently rate higher/lower

Pearson Correlation Coefficient

• More Complicated
• Line of Best Fit between commonly rated items
• Deals with grade inflation

• Other measures
– Jaccard Coefficient
– Manhattan Distance

User based Filtering
• Look at what similar people have liked but you
haven’t seen?
– Similar person likes something that has bad reviews
from everyone else?

• Weighted Score that ranks the other people and
takes into account similarity

Recommending Items

Similarity (ED) Titanic Sim x Titanic Seven Sim x Seven

Chris 0.23 4 0.92
Paul 0.78 2 1.56 4 3.12

Total 2.48 3.12
Sim Sum 1.01 0.78
Total/Sim Sum 2.455445545 4

User Based Filtering - Conclusions

• Calculate Similarity between users
• Recommend based on similar users

• Similarity
– Euclidian Distance Score
– Pearson Coefficient – better for non-normalised data

• Problem – need to compare every user/item to every other
user/item

Item Based Filtering
• Pre-compute most similar items for each item
– Item similarities change less often than user
similarities and can be re-used

• Create a weighted list of items most similar to
user’s top rated items

Recommending Items

Rating Titanic (ED) Rat x Titanic Seven (ED) Rat x Seven
Shawshank 5 0.084 0.42 0.366 1.83
The Ghost 4 0.125 0.5 0.487 1.948
Lock Stock 4 0.091 0.364 0.318 1.272
Love Actually 1 0.737 0.737 0.184 0.184

Total 1.037 2.021 1.355 5.234
Normalised (Rating / Similarity) 1.948 3.862730627

Item Based Filtering - Conclusions

• Calculate Similarity between items
• Recommend based on user’s ratings for items

• Similarity (as before)
– Euclidian Distance Score
– Pearson Coefficient – better for non-normalised data

• Problem – need to maintain item similarity data set

Item vs. User Based Filtering
• Item based scales better
– Need to maintain the similarities data set

• User based simpler to implement
• May (or may not) want to show users who is similar in
terms of habits
• Perform equally on dense data sets
• Item based performs better on sparse data sets

Questions?
• Reference: Programming Collective Intelligence,
Toby Seagram, O’Reilly 2007

• s.j.woodman@ncl.ac.uk

CSC 8101 Non Relational Databases

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie CSC 8101 Non Relational Databases

Ähnlich wie CSC 8101 Non Relational Databases (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

CSC 8101 Non Relational Databases

Hinweis der Redaktion