2. Agenda
CPG Mission and Value Proposition
Fit within the Yahoo Stack
Drill-down: User Generated Content (UGC)
Drill-down: User Location
Drill-down: Web Extractions
Drill-down: Trending
Q&A
Yahoo! Presentation, Confidential 2
3. Cloud Platform Group Mission
Create a global, scalable platform built on
science that enables rapid innovation and
delivery of personalized, monetizable
experiences across devices.
Yahoo! Presentation, Confidential 3 3/29/2012
4. CPG Value Proposition
1 Agility with Stability
LEGO powered by Content Agility
Yahoo! Presentation, Confidential 4
6. ILLUSTRATIVE SAMPLE
CPG powers all of Yahoo! today
MAIL
DISPLAY ADS FRONT PAGE
powered by Edge,
powered by Hadoop powered by CORE
Storage, Ranking, & Hadoop
3x improvement in accuracy of ad 40% faster download time, 300K+ spam Increased CTR by +263% for Today
placements and our ability to forecast mails blocked/ sec Module by serving right content to the
supply over legacy systems right user (over pre-CORE)
LIVESTAND
LEGO (YPP) SOCIAL CHROME
powered by Mobile &
powered by Content Agility powered by Social Platform
Cocktails Presentation Services
Reduce time to launch new sites from Over 22M net cumulative installs Seamlessly distribute content across
quarters to weeks since launch, Integrated into devices in an experience that is
News, Games, Movies, OMG, TV elegant and personalized
Yahoo! Presentation, Confidential 6
7. User Generated Content
Unified, scalable platform that enables self expression and gets users to
connect over content
USE CASE RESULTS
Increase content stickiness UGC platforms are used by over 200 Yahoo!
and user retention; drive properties with over 650M UGC actions per year
repeat usage across the Comments Message Boards
Yahoo! network
1/3 of US
6M
Finance
comments
traffic
per month
from MB
Ratings & Reviews
Polls
SOLUTION
40M user
UGC Cloud is a ratings
scalable, real-time platform 1.2M poll per
votes per month
that lets users to express month
themselves, resulting in
increased user
engagement and a vibrant
Yahoo! community
8. User Generated Content – Applications
Improving Comment Quality
3 pronged approach – Machine; Human and Community Moderation
300M analyzed, 70 M blocked with machine moderation
Reactive Volume (cost of reacting to abuse) avoided
Sentiment Slider
http://news.yahoo.com/open-business-free-agency-set-begin-211828913--spt.html
10. User Generated Content – In the Works
Topical Organization of Comments Social Conversations
11. User Location
Store, manage & share user locations and locations of interest to create
deeply personal digital experiences
USE CASE RESULTS
User location information was Properties can launch location aware services
siloed, inconsistent, and with faster time to market on a single platform
not shareable across
properties and users 237M users with 550M locations
Management, Authorization, and Control
LOCDROP
Normalized, Geo-Aware User Locations
SOLUTION Centralized, Consistent, and Contextual
Accurate, Relevant, Valuable Experiences
Create a single data store Increase Content, Targeting and Revenues
of user locations, shareable
across Yahoo! properties and
advertising systems
14. User Generated Places: Enable users to submit (and curate) a
location if one does not exist
Android Messenger Use Case
User cannot find a place and decides to create
a new location to check-in
User is asked for permission to detect current
location from device
Users location is pointed on a map. This will be
used to get the lat/long of the created place
User enters a location “Russian Tea Room”
A new location is stored in UGP platform and
the user is checked-in to this location
User has an option to curate the locations
created by other users
UGP platform enables algorithmic curation
15. KAFE: Technologies*
Web Content
Manual SDE Rules
Bing WCC YST HVC Live Pages
Large Aggregator Websites (LLFS)
(e.g. amazon)
Editorial
Effort
Dapper KAFE
Small Websites
(e.g. community sites) S.D.E Dapper PSOX
Behind the Form sites
(Deep Web)
PSOX (Y! Labs)
Unsupervised extractions
from large number of
websites
W.O.O Properties
Goldrush, Dish-a- Legacy
wish, Restaurant Photos Backend
Precision
* Supports Multiple Sources of Data and Multiple Technologies
Yahoo! Presentation, Confidential 15
18. Answers Not Links
S-DEKAFE XSL Rules
Creating Vertical Search Experiences for
Recipes
18
19. Answers Not Links
PSOX-Unsupervised Extractions
Looking for where to buy Amana dishwashers ? Y! Goldrush
Craving for Hummus in Sunnyvale ? Y! Dish-a-Wish
19
20. Enhanced Listings
Dappfactory
Before:
After:
• Taken from Roadmap deck for Y! Local by Erin Johns
• Data being provided to Y! Local, Front End revamp on Local Roadmap
20
21. Local Events for N.I.L.E Dappfactory
Extracted using
Dappfactory
As of Feb ‘12, over 22,000 events for 250 US cities have been extracted using Dappfactory
21
22. Data Extraction – Challenges
Technology whitespace
Head – Fully manual scales fine. Gives high precision.
Torso – Mostly use human assisted learning. Drop in recall and
precision, but acceptable for production use.
Tail content – Only option is ML/no-human-in-loop models.
Recall and Precision need lot of improvement.
Semantic Web initiatives – Web of Objects
Linked Open Data Format (RDF-a, OWL, Sparql)
Lod Cloud – Few Thousand data sets, 10s of billions of
interlinked facts.
Confhopper – Sample/Demo application
Unstructured Corpus – NLP Extraction
Systems /Engineering Challenges – Low Latency
processing, tokenization/parsing – Intl support
Sciences Challenges –
polysemy, synonymy, aboutness/concepts, sentiment analysis.
CAP – Contextual analysis platform
Yahoo! Presentation, Confidential 22
23. TimeSense – usecases/business value proposition
Search Suggestions in SD box – Timesense powered
US FP Trending Now local pool for a given DMA suggestions triggered for 6% of all gossip requests
powered by TS –6% CTR lift attributed to local terms
Trending searches in Left Rail on Yahoo US SRP – triggered
for ~6% of all user queries
TW FP Trending Now automated by
Timesense API
Plumbing, Monetization, & Games
23
24. TimeSense
In Bucket
AUTOMATED trending module on shopping.yahoo.com : First module with no editorial intervention, vertically categorized
trends, fast refresh and rotating terms
Soon to Launch
HK , TW and KR Automated trends modules on FP, Mail, OMG, news etc
Editorial Power users of Timesense
• Search Forecasting Editorial Team – updates sent twice a day to 500+ subscribers
• FP Trending Now team
Plumbing, Monetization, & Games
•Regional Content programming , search editorial and SEO teams : US ,UK, HK, TW, IN [Q1 launch – all INTLs]
Upcoming
• Trending Now Syndication for Yahoo Hosted Search partners – via BOSS
• Trending Image experience
• Trending Now 2.0 automation expansion 24
25. Trending topic detection – Challenges
Systems Challenges
• Low latency requirement
• GBs of data analyzed from multiple data sources every 5
minutes
• Scalability – different verticals, segmented models.
• High Availability requirement
Sciences Challenges
Algorithmic improvements for near real time detection without
precision loss
Short Phrase Categorization
Deduping/Clustering – intent detection
Segmentation/Smoothing – Age/gender/Behavioral Tracking
Categories/Geography – signal sparsity with fine grained
segmentation.
Yahoo! Presentation, Confidential 25
Hinweis der Redaktion
From Siloed to PlatformEarlier everything was a technology and a data silo. Built one-off.CPG1.We had to get everyone on the same technology – stable unified platform services for powering innovation Trade-off: Agile (does not scale later) vs. Stable. People usually give on one vs. the other as they hurry to market. We can talk about tradeoffs in scale, latency, security, etc. Bring up the M&A example of RMX. All acquisition integrations have faced the same problem. RMX’s 300MM impressions did not scale (agility choice), we are now at 12B NGDs. We rebuilt the backend storage etc. Dapper was the same way.2. Once everyone’s on the same system, then we can share data, apply science to data on the “platform” at scale to derive business valueWe can bring up LEGO as an example for siloed properties brought to a common content platform. Sherpa example for structured data storage, disparate MySQL to a common data store.
From Siloed to PlatformEarlier everything was a technology and a data silo. Built one-off.CPG1.We had to get everyone on the same technology – stable unified platform services for powering innovation Trade-off: Agile (does not scale later) vs. Stable. People usually give on one vs. the other as they hurry to market. We can talk about tradeoffs in scale, latency, security, etc. Bring up the M&A example of RMX. All acquisition integrations have faced the same problem. RMX’s 300MM impressions did not scale (agility choice), we are now at 12B NGDs. We rebuilt the backend storage etc. Dapper was the same way.2. Once everyone’s on the same system, then we can share data, apply science to data on the “platform” at scale to derive business valueWe can bring up LEGO as an example for siloed properties brought to a common content platform. Sherpa example for structured data storage, disparate MySQL to a common data store.
CPG power ALL of Yahoo!1.Display Ads (Emphasis on Hadoop) 7 clusters, 15K notes, 17T/day, 10PB, (APT 11 4PB, RMX 16 5.8PB) Categorize Ads, BT targeting, Predict user response, Traffic protectionHadoop helps Yahoo! target billions of impressions per day across one of the largest ad networks in the world by processing declared data and recent activity to segment users and determine the right ad to serve in milliseconds. 3x improvement in accuracy of ad placements and our ability to forecast supply over legacy systems (MyNA/Panama & AWACS/ All Warehouse Access System)“Predict” - critical to serving apparatusMachine Learned Categorization for Ads and Queries to automatically assign categories to web pages, ads, and queriesKeystone – Contextual Ads, predict and model user response based on all user context, including page content, user attributes like behavioral and geographical data, referrals to the page (how the user got there), and information about the publisher page.Display Supply and Demand ForecastingFuture (supply) inventory forecastingNGD: pricing forecasting computation - advisory useNGD: estimate clicks from impressionsTraffic protectionExecute the trade in serving, and clean it up later for bad traffic, before it hits the revenue system2. Mail (Emphasize on Cloud Services)YCPI has shown to improve download speed by over 40% for Mail. Hadoop helps blocks over 300,000 spam mails/ sec globally. 24BMail, the best monetized product at Yahoo! and at the heart of the Yahoo! network, fully leverages the power of Cloud. At the same time, it also leverages several other platform capabilities such as Membership services (Over 226,000 new good accounts created per day for U.S. Mail alone, 72M successful logins/ day). MobStor is used to solve the attachment de-dup problem to increase efficiency. Ranking Systems (Vespa) is used to search through the mailboxes/ folders.3. Lego (Emphasize on Core Content Services)135 regional Media sites have moved to Content Agility last year alone.Leverages Content Agility as the single, grid-based, highly scalable CMS instead of siloed approaches for CMS, front-end development, and editorial that properties earlier had (pre-Lego). Lego provides reusable UI modules and shared tool to reduce time to launch new sites from quarters to weeks. Content Agility and Lego power the content network and bring agility to Yahoo! properties. 4.Front Page (Emphasize on KAPS - Personalization)CORE increased CTR by +263% for Today Module vs. pre-CORECORE enables a real-time feedback loop across properties, leveraging user interest, intent, and context to optimize user engagement. Increase engagement by showing the right content to users with input from science & human editors. CORE delivers the most relevant experience on the Web by serving the right content to the right user. 5.Social Chrome (Emphasize on KAPS – Social)Over 22M net cumulative installs since launch, 620K Facebook referrals generated daily. Daily active users crossed 1MM within 5 weeks of launch. Vitamix/Vitality powers social chrome on all Y! properties worldwide to increase user engagement by surfacing relevant activities from friends. Vitamix provides Facebar of Friends with activity, activity history of a friend, friends activity feed, and friends activity on top articles. Several raking type initiatives for 2012 (rank friends, show most shared article etc.). 6.Livestand (Emphasize on MPS – Cocktails)Leverages Cocktails, a presentation platform and application framework built on YUI3, to create connected experiences across devices with single codebase. Provides a simple way for publishers and advertisers to seamlessly distribute content across devices in an experience that is elegant and personalized – single serving stack across applications, framework, and runtime. Fragmented approaches slow innovation and create tech debt – one stack per device class (web 1.0, web 2.0, iOS/Android, and Feature Phones). Cocktails provides reusable modules across devices and properties, server side JavaScript execution engine, high-efficiency HTTP server for personalization/ 2-way browser-server communications, and cloud hosted applications for easy deployment and bucket testing.
What is the location of the user does not exist? Especially for building the business listings database, User Generated Places (UGP) provides the capability to crowd source and algorithmically curate locations.Ingesting over 10,000 RSS feeds with an average of 3000 / day.
Location as a key pillar of Personalization.Crowd sourced with confidence levels (Messenger use case)For properties that require more precision like Local and Travel, we are adopting a multi pronged approach:Extractions from the deep webEarly work with Sciences to apply algos