Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

2. DATAWEAVE:WHAT WE DO? • Aggregate large amounts of data publicly available on the web, and serve it to businesses in readily usable forms • Serve actionable data through APIs,Visualizations, and Dashboards • Provide reporting and analytics layer on top of datasets and APIs

3. DATAWEAVE PLATFORM API Feeds Data Services Dashboards Visualizations and Widgets Data APIs Unstructured , spread across sources and temporally changing Pricing Date Open Government Data Social Media Data Attributes Attribute Big Data Platform

4. HOW DOES IT WORK - 1? • Crawling/Scraping: from a large number of data sources • Cleaning/Deduplication: remove as much noise as possible • Data Normalization: represent related data together in standard forms

5. HOW DOES IT WORK - 2? • Store/Index: store optimally to support several complex queries • Create "Views": on top of data for easy consumption, through APIs, visualizations, dashboards, and reports • Package data as a product: to solve a bunch of related pain points in a certain domain (e.g., PriceWeave for retail)

6. AGGREGATION AND EXTRACTION Extraction Layer Ofﬂine Extraction of Factual Data Aggregation Layer Distributed Crawler Infrastructure Public Data on the Web

7. AGGREGATION LAYER Customized crawler infrastructure • vertical speciﬁc crawlers • capable of crawling the "deep web" Highly Scalable • 500+ websites on a daily basis • more with the addition of hardware Robust to failures (404s, timeouts, server restarts) • stateless distributed workers • crawl state maintained in a separate data store

8. DATA EXTRACTION LAYER • Extract as many data points from crawled pages as possible • Completely ofﬂine process, independent of crawling • Highly parallelized -- scales in a straightforward manner

9. NORMALIZATION Normalization Layer Machine Learning Techniques Remove Noise Fill Gaps in Data Represent Data Clustering Extraction Layer Ofﬂine Extraction of Factual Data Knowledge Base

10. NORMALIZATION LAYER • Remove noise, remove duplicates • Gather data from multiple sources and ﬁll "gaps" in info • Normalize data points to a standard internal representation • Cluster related data together (Machine Learning techniques) • Build a "knowledge base" -- continuous learning • "Human in the loop" for data validation

11. DATA STORAGE AND SERVING Data APIs Visualizations Dashboards Reports Serving Layer Highly Responsive Indexes Views Filters Pre-Computed Results Serving Layer Distributed Data Storage Crawl Snapshots Processed Data Clustered Data

12. DATA STORAGE LAYER • Store snapshots of crawl data -- never throw away raw data! • Store processed data -- both individual data points as well as "clusters" of related data points • Distributed data stores • Highly scalable -- add more hardware • Highly available -- replication

13. SERVING LAYER This is the system as far as a user is concerned! Must be highly responsive Process data ofﬂine and periodically push it to the serving layer • create Indexes for fast data retrieval • create views to serve queries that are known a priori • minimize computation to the extent possible

14. DATAWEAVE PLATFORM API Feeds Data Services Dashboards Visualizations and Widgets Data APIs Unstructured , spread across sources and temporally changing Pricing Date Open Government Data Social Media Data Attributes Attribute Big Data Platform

15. THANKYOU Sanket Patil sanket@dataweave.in +91-9900063093 2013 Dataweave On Facebook www.facebook.com/DataWeave Catch us onTwitter @dataweavein www.dataweave.in

Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

Ähnlich wie Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale (20)

Mehr von Yahoo Developer Network

Mehr von Yahoo Developer Network (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale