7. Analyze & Report
Discover & Explore
Structured Semi-Structured Unstructured
SQL SQL++ Java/C++/Pig/Hive
Production Data Warehousing Contextual-Complex Analytics Structure the Unstructured
Large Concurrent User-base Deep, Seasonal, Consumable Data Sets Detect Patterns
Data Warehouse Data Warehouse + Hadoop
Behavioral
Enterprise-class System Low End Enterprise-class System Commodity Hardware System
8+PB 60+PB 40+PB
10. Data
questions later
structure later
(<$0.04/GB, <$80/2TB)
single HDFS instances >50PB
Value > Cost 10
11. Designing for the Unknown
>85% of analytical workload is NEW & Unknown
The metrics you know are cheap
The metrics you don’t know are expensive – but high in potential ROI
Exploration & Testing are core pillars of an analytics-driven
organization
14. Site Key Expansion Top Query Note
US diaries diary
US baggies baggy
US cranberries cranberry
US jogging jog
US fishing sticker fish stickers
UK panels panelling
UK protection protecter
UK lining lined
UK animation animated
UK trucks trucking
UK edging edges
UK nets netting
15. Site Key Expansion Top Query Note
US diaries diary vampire diaries
US baggies baggy patagonia baggies good for patagonia baggy, not good alone
US cranberries cranberry the cranberries
US jogging jog jogging stroller
US fishing sticker fish stickers fishing sticker sports vs. kids rooms
UK panels panelling fence panels
UK protection protecter mcafee total protection 2012 screen protecter is top US query
UK lining lined pink lining changing bag
UK animation animated animation cel
UK trucks trucking corgi trucks
UK edging edges garden edging
UK nets netting purse nets
16.
17.
18. Value > Cost
$’s per year in incremental revenue
www.wallpapertimes.com
19.
20.
21. Toys and Hobbies
ATC > Artist trading card in ART
ATC > Automatic Tool Change in Business and Industrial
22. German Compound Words
• German compound words can be arbitrarily created and extremely long
Adidastrainingsanzug (Adidas track suit)
Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
(beef labeling regulation & delegation of supervision law)
• Syntactically, words can be combined and split in many ways.
• Some words shouldn’t be de-compounded.
beiden (both) – bei(at) den(the)
• Too many candidates for
Granitpflastersteine (granite paving stones)
Granit(granite) pflastersteine(cobblestones)
Granit(granite) pflaster(paving/band-aid) steine(stones)
• Binding characters
Hochzeitsschuhe (grammatically correct, 593 hits on ebay.de)
Hochzeitschuhe (129 hits on ebay.de).
23. Synonyms derived from top queries in item query clusters
texas instruments ba ii plus ti ba ii plus
brighton handbag brighton purse
lenovo x200 thinkpad x200
king bedspread king coverlet
rockabilly dress swing dress
1963 ford falcon 63 falcon
jessica simpson hair extensions jessica simpson hairdo
Abbreviations/acronym derived from query transitions
stanford ky stanford kentucky
dc sub dc subwoofer
snowboard helmet l snowboard helmet large
motorcycle cam motorcycle camera
diamond amp diamond amplifier
Hinweis der Redaktion
I work at eBay, every second…BLANK SLIDEGrocery store – 2 cans of soupPoint of No Return – people haven’t changed, motivation still the same, everyone loves free, don’t waste hard earned resources, make intelligent decisions – technology HAS changed, and it is still accelerating, behavior easier to capture/analyzeSkip - Costco – Netflix – Blu-Ray – Players I considered – Players I DIDN”T consider – Person asked, data collected - WHY? – Mobile Phone – 3-5x speed of home network – blend online/offline – just commerce
You are in business to make moneyHow do you know if changes you make, make moneyYou HAVE to testYou can’t manage what you don’t measureTesting is crucialImage http://www.wallpapertimes.com/files/q/Yf/4j/qYf4jp9q86379020_800x600.jpg
Is my data BIG enough – who caresI don’t really care about defining how big, big is.Big is whatever you need to detail level (not aggregate) analysisImage http://www.skimountaineer.com/ROF/OcAnt/BigBen/BigBenHeardIsland.jpg
Beyond aggregate dataSession level detailItem impression data – logging the items people DON’T clickWe always knew what items people clicked on (view item page log)What about the items people did NOT click on, need impression logging, they’re just as informativeLet’s bring this closer to home for youMarket basket data – buy this buy thatCowboy hats – detailed data
Before we talk about the systems we have in place, let’s take a look at what happens in the industry and describe the buzz word of the year – Big Data.A big data warehouse is a data warehouse that is a magnitude bigger than the one you have. So just the data volume is no what Big Data is about. The key change is the form of the data and its processing requirements. Since 2003 there is more data processed in 2 days than what human mankind has produced in the last 40.000 years. The rise of the machines!Classical data warehousing stores data attributes in columns, nicely separated by the source application, or the ETL process. Data that is usually generated by direct user interactions and clearly defined transactions. The big boost in data volume comes from new data types like free form text, audio, video, pictures, and graphs that do not easily fit into the structures of a database, or pose quite some challenge on the processing of it. The third key characteristic of Big Data is the velocity, both in regards to speed of processing as well as speed of change. Initial use cases of Hadoop like spam filtering imply real time processing combined with tremendous amounts of data.With this in mind now, let’s look at what analytics systems we have in place today.
What do we have at eBayDW for analysts comfortable with SQL & reportingHadoop for developersYou don’t have to do everything all at once, start and evolve
Data is growingLand it ONCEAdd moore’s law graphicGet back up data for data rate changeJeff H slides?Google VP Marissa Mayer made last August 2009, "The Physics of Data," Mayer noted that there have been three big changes to Internet data in recent times:Speed (real-time data);Scale ("unprecedented processing power");Sensors ("new kinds of data").Mayer went on to say that there were 5 exabytes of data online in 2002, which had risen to 281 exabytes in 2009. That's a growth rate of 56 times over seven years. Partly, she said, this has been the result of people uploading more data. Mayer said that the average person uploaded 15 times more data in 2009 than they did in 2006.http://blog.appro.com/the-big-data-challenge-for-data-intensive-computing-applications/http://www.enterpriseirregulars.com/40616/the-enterprise-opportunity-of-big-data-closing-the-clue-gap/http://www.ameinfo.com/231603.htmlhttp://www.f5.com/images/news-press-events/data-growth-monster.pnghttp://www.veecom.co.uk/2010/the-difficulties-of-streaming-video-over-3g/http://www.kurzweilai.net/the-law-of-accelerating-returnshttp://techcrunch.com/2010/03/16/big-data-freedom/
Data is growingLand it ONCEAdd moore’s law graphicGet back up data for data rate changeJeff H slides?Google VP Marissa Mayer made last August 2009, "The Physics of Data," Mayer noted that there have been three big changes to Internet data in recent times:Speed (real-time data);Scale ("unprecedented processing power");Sensors ("new kinds of data").Mayer went on to say that there were 5 exabytes of data online in 2002, which had risen to 281 exabytes in 2009. That's a growth rate of 56 times over seven years. Partly, she said, this has been the result of people uploading more data. Mayer said that the average person uploaded 15 times more data in 2009 than they did in 2006.http://blog.appro.com/the-big-data-challenge-for-data-intensive-computing-applications/http://www.enterpriseirregulars.com/40616/the-enterprise-opportunity-of-big-data-closing-the-clue-gap/http://www.ameinfo.com/231603.htmlhttp://www.f5.com/images/news-press-events/data-growth-monster.pnghttp://www.veecom.co.uk/2010/the-difficulties-of-streaming-video-over-3g/http://www.kurzweilai.net/the-law-of-accelerating-returnshttp://techcrunch.com/2010/03/16/big-data-freedom/
Let me summarize before search behavioral data I work with to show you how you can use these principles to analyze your data
Would you throw away money?Collect data, what seems big and expensive today will be be cheap and valuable tomorrow. Don’t throw good data away.
Embed analytics in your businessMake it easyAgile Analytics – is ability to support analytical requirements in a TIMELY manner, irrespective of the their complexity.Enable business agility vs development agilityAgile Analytics enables business to quickly and accurately make decisions.Image from http://jonmell.co.uk/enterprise-20-enables-business-agility/
Documents not enough anymoreNeed behavioral data – Yandex beating Google in Russia, why, they have users, refrigerators in Moscow vs. isolated small town
You are in business to make moneyHow do you know if changes you make, make moneyYou HAVE to testYou can’t manage what you don’t measureTesting is crucialImage http://www.wallpapertimes.com/files/q/Yf/4j/qYf4jp9q86379020_800x600.jpg
How do we do thisSimple counting – that’s it, you “just” have to countImage http://www.csie.ntnu.edu.tw/~u91029/Matching.html
Detail mattersContext is important
"beef labeling regulation & delegation of supervision law” - long word