Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Entity-Centric Indexing
Mark Harwood @elasticmark
4/6/2015
www.elastic.co
2
(or “when aggregations don’t cut it”)
Entity-centric indexes
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohib...
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohib...
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohib...
www.elastic.co
6
A “pay-as-you-go” model to the
costs of fusing data
Solution
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohib...
www.elastic.co
8
• WebSessions
• "how long on average do my customers spend on my site?”
• “which users behave like bots?”...
www.elastic.co
9
• Buyers
• "What do the users who bought product X also buy?”
• “Which buyers behave like ‘shills’ and wh...
www.elastic.co
10
Web log analytics
Use case
www.elastic.co
11
• Analyses website traffic for retailers and manufacturers in the automotive
industry
• Summarising many ...
www.elastic.co
12
• Data store contains 150m events generated by 26m user sessions
• Event-centric aggregations were takin...
www.elastic.co
13
Amazon marketplace reviews -
building profiles for reviewers
Worked example
Play	
  along!	
  Code	
  +	...
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohib...
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohib...
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohib...
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohib...
www.elastic.co
18
UK 2013 car road worthiness tests
Worked example
www.elastic.co
19
• In the UK all vehicles must pass an annual roadworthiness test, called an MOT
(named after the Ministr...
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohib...
www.elastic.co
21
Car attributes derived from 3 test result documents
Data fusion logic
1
2
3
Test	
  date
Mile-­‐o-­‐mete...
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohib...
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohib...
www.elastic.co
24
A user-centric index as a
recommendation engine
Recycling user behaviours
www.elastic.co
25
• A public dataset* of 10m movie ratings made by 71k users
• One elasticsearch document per user with a ...
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohib...
www.elastic.co
27
Conclusions
www.elastic.co
28
• Efficient and simple queries
• Advanced analytics/insights
• Can provide a cheaper data retention polic...
www.elastic.co
29
• Avoid “fat entities”
• Use forgetful collections: Priority queues, circular buffers, HyperLogLog
• Avo...
www.elastic.co
30
• Incremental entity updates can be achieved by querying all events since the
timestamp of the last run
...
www.elastic.co
31
@elasticmark
Questions?
Nächste SlideShare
Wird geladen in …5
×

Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015

1.684 Aufrufe

Veröffentlicht am

Sometimes we need to step back and take a look at the bigger picture - not just counting huge piles of individual log records, but reasoning about the behaviors of the people who are ultimately generating this firehose of data. While your DevOps folks care deeply about log records from a machine utlization perspective, marketing wants to know what these records tell us about the customers' needs. Elasticsearch Aggregations are a great feature but are not a panacea. We can happily use them to summarise complex things like the number of web requests per day broken down by geography and browser type on a busy website, but we would quickly run out of memory if we tried to calculate something as simple as a single number for the average duration of visitor web sessions when using the very same dataset. Why does this occur? A web session duration is an example of a behavioural attribute not held on any one log record; it has to be derived by finding the first and last records for each session in our weblogs, requiring some complex query expressions and a lot of memory to connect all the data points. We can maintain a more useful joined-up-picture if we run an ongoing background process to fuse related events from one index into ?entity-centric? summaries in another index e.g: • Web log events summarised into ?web session? entities • Road-worthiness test results summarised into ?car? entities • Reviews in a marketplace summarised into a ?reviewer? entity Using real data, this session will demonstrate how to incrementally build entity-centric indexes alongside event-centric indexes by using simple scripts to uncover interesting behaviours that accumulate over time. We'll explore: • Which cars are driven long distances after failing roadworthiness tests? • Which website visitors look to be behaving like ?bots?? • Which seller in my marketplace has employed an army of ?shills? to boost his feedback rating? Attendees will leave this session with all the tools required to begin building entity-centric indexes and using that data to derive richer business insights across every department in their organization.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015

  1. 1. Entity-Centric Indexing Mark Harwood @elasticmark 4/6/2015
  2. 2. www.elastic.co 2 (or “when aggregations don’t cut it”) Entity-centric indexes
  3. 3. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 3 A typical “event-centric” deployment Time-based event indexesEvent stream
  4. 4. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 4 Problem: some aggregations are expensive We need to join all event-level data together at query-time. ?Using web server log data, answer the question: "how long on average do customers spend on my site?" !
  5. 5. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 5 How to cripple elasticsearch with a bucket explosion: 1. Ask a question about values that needs to be derived from multiple documents (e.g. deriving a web session’s duration) 2. Make the joining key a high cardinality field e.g. something like “IP address” 3. Extra points if you use no routing of your documents so that related content is spray-gunned across multiple shards
  6. 6. www.elastic.co 6 A “pay-as-you-go” model to the costs of fusing data Solution
  7. 7. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 7 Solution: an “entity-centric” model Usual stream of events Time-based event indexes Entity-based summary indexes Periodic extracts sorted by entity ID and time
  8. 8. www.elastic.co 8 • WebSessions • "how long on average do my customers spend on my site?” • “which users behave like bots?” • “what is the most common exit page?” • Bank Accounts • "Does this new payment match the typical spending behaviour of bank account X?” Entity-centric queries
  9. 9. www.elastic.co 9 • Buyers • "What do the users who bought product X also buy?” • “Which buyers behave like ‘shills’ and who are they promoting?” • Cars • “Which cars drove long distances after failing a road worthiness test?” Entity-centric queries
  10. 10. www.elastic.co 10 Web log analytics Use case
  11. 11. www.elastic.co 11 • Analyses website traffic for retailers and manufacturers in the automotive industry • Summarising many behaviours over time e.g. • unique numbers of visitors per month • engagement: average session durations • Faced scaling issues producing some results from raw events Use case: GFORCES
  12. 12. www.elastic.co 12 • Data store contains 150m events generated by 26m user sessions • Event-centric aggregations were taking ~25 seconds • Equivalent entity-centric aggregations take <50ms • Simplified queries for common entry pages, common exit pages etc Results of moving to entity-centric indexing
  13. 13. www.elastic.co 13 Amazon marketplace reviews - building profiles for reviewers Worked example Play  along!  Code  +  data  here:  bit.ly/entcent
  14. 14. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 14 An “entity-centric” model AmazonReviews (an event-centric index) reviews.csv loadEvents.sh Review event fields • rating • seller • reviewer • date AmazonReviewers (an entity-centric index) buildEntities.sh • Drops and creates reviewers index. • Uses Python client to query and scroll list of reviews sorted by reviewerId and time • Python pushes _update requests to ~400k “Reviewer” documents each containing bundles of their recent reviews using bulk indexing API • Shard-side Groovy script collapses the multiple reviews into a single reviewer JSON document summarising behaviour Reviewer entity fields • positivity • num sellers reviewed • last 50 reviews • profile (“newbie”, “fanboy” etc)
  15. 15. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 15 Anatomy of an entity indexing groovy script Initialize  if  new  document Loop  to  consolidate  latest  events Re-­‐run  risk  profile  logic   Load  stored  state Store  the  script  in  ES_HOME/config/scripts/foo.groovy
  16. 16. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 16 Insight: which sellers have a lot of fanboys? Seller  #187  has  more  than  his   fair  share  of  “fanboy”  reviewers   …
  17. 17. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 17 Drilling down into seller #187’s fanboys Suspiciously   synchronised   behaviour
  18. 18. www.elastic.co 18 UK 2013 car road worthiness tests Worked example
  19. 19. www.elastic.co 19 • In the UK all vehicles must pass an annual roadworthiness test, called an MOT (named after the Ministry of Transport) • It is illegal to drive a car that has failed an MOT (unless driving home from a test or to a repair centre) • Taxis and other forms of public transport have to be tested more frequently - every 6 months. • All data is freely available from data.gov.uk but with anonymised vehicle ID and inexact test locations. Example background
  20. 20. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 20 Example background MOTs mots.csv loadMOTs.sh Cars buildEntities.sh • Drops and creates mots index. • Uses Python client to bulk load all 37m road worthiness test results for 2013 (data source http://data.gov.uk/ • Drops and creates cars index. • Registers CarProfileUpdater.groovy as a stored script • Uses Python client to query and scroll list of mot test results sorted by vehicle ID and time • Python pushes _update requests to ~27m “Car” documents each containing bundles of related MOT test results using bulk indexing API • Shard-side Groovy script collapses the multiple tests into a single summary JSON document for a car, deriving summaries eg MOT event fields • result (pass/fail) • vehicle ID • Make + model + age • mileage • test date • test location Car entity fields • Make + model + age • last test result, date, location • miles driven while failed • days between fail and fix • complete test history • suspected bad mileometer readings
  21. 21. www.elastic.co 21 Car attributes derived from 3 test result documents Data fusion logic 1 2 3 Test  date Mile-­‐o-­‐meter  reading daysForFix badReading? milesDrivenAfterFailure mile-o-meterRewind
  22. 22. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 22 Insight: who is driving failed vehicles? Q: Why is there an unexpected peak in milesDrivenWithFailure around 6-months? A: Taxis
  23. 23. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 23 Insight: Taxis keep on trucking after failures..
  24. 24. www.elastic.co 24 A user-centric index as a recommendation engine Recycling user behaviours
  25. 25. www.elastic.co 25 • A public dataset* of 10m movie ratings made by 71k users • One elasticsearch document per user with a list of their movie ratings Movielens data Example background *  http://files.grouplens.org/datasets/movielens/ml-­‐10m-­‐README.html
  26. 26. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 26 “Uncommonly common”user behaviours
  27. 27. www.elastic.co 27 Conclusions
  28. 28. www.elastic.co 28 • Efficient and simple queries • Advanced analytics/insights • Can provide a cheaper data retention policy (daily->weekly->monthly roll-ups) • Can reuse existing elasticsearch APIs or build entity documents using external technologies Entity centric indexing: Advantages
  29. 29. www.elastic.co 29 • Avoid “fat entities” • Use forgetful collections: Priority queues, circular buffers, HyperLogLog • Avoid pointless updates • Use ctx.op=“none” to avoid writes of insignificant changes • Consider options for reducing event volumes: • Use of aggregations in gathering events • Reduce related events in event-gathering script that issues updates • Parallelise the pull of event information Entity centric indexing: tips
  30. 30. www.elastic.co 30 • Incremental entity updates can be achieved by querying all events since the timestamp of the last run • Data integrity - implement policies for: • handling any failures in performing entity updates • retiring old entities (use of TTL?) Entity centric indexing
  31. 31. www.elastic.co 31 @elasticmark Questions?

×