4. Trend 1: Data Size
Digital Information
Created, Captured, Replicated
worldwide
3000
2500
2000
Exabytes
1500
1000
500
0
2006 2007 2008 2009 2010 2011 2012
Source: IDC 2009
5. Trend 2: Connectedness
Trend 2: connectedness
Giant
Global
Graph
(GGG)
Information connectivity
Ontologies
RDF
Folksonomies
Tagging
Wikis User-
generated
content
Blogs
RSS
Hypertext
Text
documents web 1.0 web 2.0 “web 3.0”
1990 2000 2010 2020
Source: http://nosql.mypopescu.com/post/342947902/presentation-graphs-neo4j-teh-awesome
6. Trend 3: semi-structure
• “The great majority of the data out there is not structured and [there’s]
no way in the world you can force people to structure it.” [1]
• Trend accelerated by the decentralization of content generation that is
the hallmark of the age of participation (“web 2.0”)
• Evolving applications
[1] Stefano Mazzocci Apache and MIT
8. Relational Databases
• Data Model: Normalised, multi-table with referential integrity
• Good for very static data
– Payroll, accounts
– Well understood
– Not evolving
• SQL Queries (joins etc.)
• Good Tooling
• Examples: Oracle, MySQL, Postgres, …
9. Key-Value Stores
• Data Model: (global) collection of K-V pairs
• Massive Distributed HashMap
• Partitioning and Replication usually ring based
– Load Balancer round robins the requests
– Hash(key) = partition
– Partition map maintains partition -> node mapping
– Quorum System (N, R, W), usually (3,2,2)
• Scales Well (1000B rows)
• How many apps need that?
– Google, Amazon, Facebook etc.
– <10 in the world
• Examples: Dynomite, Voldemort, Tokyo
[http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf]
10. BigTable Clones
• Data model: single table, column families
• Distributed storage of semi-structured data (column families)
• Scale: “Petabyte range”
• Supports MapReduce well
• Example: Hbase, Hypertable
[http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable-osdi06.pdf]
11. Document Databases
• Inspired by Lotus Notes
• Data model: collections of K-V collections
• Document:
– Collection of K-V pairs (often JSON)
– Often versioned
• Scales: Dependant on implementation
• Can (potentially) store entire 3 tier web app
in the database (probably NOT the best
architecture!)
• Example: CouchDB, MongoDB
12. Graph Databases
• Inspired by Euler & graph theory
• Data model: nodes, relationships, K-V on both
• Scale: 10B entities
• SPARQL Queries
• No O/R Impedance mismatch
• Semi Structured & Evolving Schema
• Example: AllegroGraph, VertexDB, Neo4j
14. RDBMS Solution
• SQL: single join to get
friends
• SELECT p.name, p2.name
FROM people AS p, people AS p2,
friends AS f
WHERE p.id = 1 AND p.id = f.id1 AND p2.id = f.id2;
• SQL: 2-3 joins or subqueries to get “friends of friends”
• i.e. Not trivial and doesn’t scale
20. Pros and Cons
• “Whiteboard friendly” – fits domain models better
• Scales up “enough”
• Evolve Schema
• Can represent semi-structured data
• Good Performance for graph/network traversals
• Lacks tool support
• Harder to write ad-hoc queries (SPARQL vs. SQL)
21. Important Reminders
• Other options exist apart from the Relational
Database
• Fit the technology to the domain model, not the
domain model to the technology
23. Part 2: Collaborative Filtering
• Calculating Similarities
• User based filtering
• Item based filtering
24.
25.
26. Why?
• Sell more items
• Increase market share
• Better targeted advertising
• Up sell rather than new-sell
• Make more £££
• Not perfect
– Bad recommendations
– Inappropriate recommendations
33. Calculating Euclidian Distance Score
• Done for each pair of people
• Difference in each axis
• Square
• Add them together
• Add 1 (avoids divide by zero)
• Square Root
• Invert
34. Chris and Simon
• Difference in each axis
– (5-1), (4-3) = 4, 1
• Square
– 16, 1
• Add them together
– 17
• Add 1 (avoids divide by zero)
– = 18
• Square Root
– = 4.24264069
• Invert
– = 0.23570226
35. Euclidian Distance Score
• Easy to calculate
• Bad for people who are similar but
consistently rate higher/lower
36. Pearson Correlation Coefficient
• More Complicated
• Line of Best Fit between commonly rated items
• Deals with grade inflation
• Other measures
– Jaccard Coefficient
– Manhattan Distance
37. User based Filtering
• Look at what similar people have liked but you
haven’t seen?
– Similar person likes something that has bad reviews
from everyone else?
• Weighted Score that ranks the other people and
takes into account similarity
38. Recommending Items
Similarity (ED) Titanic Sim x Titanic Seven Sim x Seven
Chris 0.23 4 0.92
Paul 0.78 2 1.56 4 3.12
Total 2.48 3.12
Sim Sum 1.01 0.78
Total/Sim Sum 2.455445545 4
39. Recommending Items
Similarity (ED) Titanic Sim x Titanic Seven Sim x Seven
Chris 0.23 4 0.92
Paul 0.78 2 1.56 4 3.12
Total 2.48 3.12
Sim Sum 1.01 0.78
Total/Sim Sum 2.455445545 4
40. Recommending Items
Similarity (ED) Titanic Sim x Titanic Seven Sim x Seven
Chris 0.23 4 0.92
Paul 0.78 2 1.56 4 3.12
Total 2.48 3.12
Sim Sum 1.01 0.78
Total/Sim Sum 2.455445545 4
41. User Based Filtering - Conclusions
• Calculate Similarity between users
• Recommend based on similar users
• Similarity
– Euclidian Distance Score
– Pearson Coefficient – better for non-normalised data
• Problem – need to compare every user/item to every other
user/item
42. Item Based Filtering
• Pre-compute most similar items for each item
– Item similarities change less often than user
similarities and can be re-used
• Create a weighted list of items most similar to
user’s top rated items
43. Recommending Items
Rating Titanic (ED) Rat x Titanic Seven (ED) Rat x Seven
Shawshank 5 0.084 0.42 0.366 1.83
The Ghost 4 0.125 0.5 0.487 1.948
Lock Stock 4 0.091 0.364 0.318 1.272
Love Actually 1 0.737 0.737 0.184 0.184
Total 1.037 2.021 1.355 5.234
Normalised (Rating / Similarity) 1.948 3.862730627
44. Recommending Items
Rating Titanic (ED) Rat x Titanic Seven (ED) Rat x Seven
Shawshank 5 0.084 0.42 0.366 1.83
The Ghost 4 0.125 0.5 0.487 1.948
Lock Stock 4 0.091 0.364 0.318 1.272
Love Actually 1 0.737 0.737 0.184 0.184
Total 1.037 2.021 1.355 5.234
Normalised (Rating / Similarity) 1.948 3.862730627
45. Recommending Items
Rating Titanic (ED) Rat x Titanic Seven (ED) Rat x Seven
Shawshank 5 0.084 0.42 0.366 1.83
The Ghost 4 0.125 0.5 0.487 1.948
Lock Stock 4 0.091 0.364 0.318 1.272
Love Actually 1 0.737 0.737 0.184 0.184
Total 1.037 2.021 1.355 5.234
Normalised (Rating / Similarity) 1.948 3.862730627
46. Item Based Filtering - Conclusions
• Calculate Similarity between items
• Recommend based on user’s ratings for items
• Similarity (as before)
– Euclidian Distance Score
– Pearson Coefficient – better for non-normalised data
• Problem – need to maintain item similarity data set
47. Item vs. User Based Filtering
• Item based scales better
– Need to maintain the similarities data set
• User based simpler to implement
• May (or may not) want to show users who is similar in
terms of habits
• Perform equally on dense data sets
• Item based performs better on sparse data sets
Information overload. Creating too much data to be able to store it. Digital Cameras/Video Cameras/CCTVVOIP, Sensors, Medical imaging
Information overload.
Over time data has evolved to be more interlinked and connectedHypertext has linksBlogs have pingbacksTagging groups related dataOntologies formalise it moreGGG the relationships contain information rather than the data items. e.g. friends on FB – the data was there before but the relationships are the important part
Applications in 70s and 80s were simple and rigid. Doesn’t work now with the interconnected world.Semi structured data is bad for RDBMSFB, Twitter,etc have had to build their own databases
Used internally at Amazon in services like S3 and EC2Quorum(N, R, W)N = number of replicas that will be written toW = number of responses to wait for for write to succeedR = number of responses to agree on for read to be returnedMeans that n-r/w nodes can go down and the system still function