2. DATAWEAVE:WHAT WE DO?
• Aggregate large amounts of data publicly available on the web, and
serve it to businesses in readily usable forms
• Serve actionable data through APIs,Visualizations, and Dashboards
• Provide reporting and analytics layer on top of datasets and APIs
3. DATAWEAVE PLATFORM
API Feeds
Data Services
Dashboards
Visualizations and
Widgets
Data APIs
Unstructured , spread
across sources and
temporally changing
Pricing Date
Open Government Data
Social Media Data
Attributes
Attribute
Big Data Platform
4. HOW DOES IT WORK - 1?
• Crawling/Scraping:
from a large number of data sources
• Cleaning/Deduplication:
remove as much noise as possible
• Data Normalization:
represent related data together in standard forms
5. HOW DOES IT WORK - 2?
• Store/Index:
store optimally to support several complex queries
• Create "Views":
on top of data for easy consumption, through APIs, visualizations,
dashboards, and reports
• Package data as a product:
to solve a bunch of related pain points in a certain domain (e.g.,
PriceWeave for retail)
7. AGGREGATION LAYER
Customized crawler infrastructure
• vertical specific crawlers
• capable of crawling the "deep web"
Highly Scalable
• 500+ websites on a daily basis
• more with the addition of hardware
Robust to failures (404s, timeouts, server restarts)
• stateless distributed workers
• crawl state maintained in a separate data store
8. DATA EXTRACTION LAYER
• Extract as many data points from crawled pages as possible
• Completely offline process, independent of crawling
• Highly parallelized -- scales in a straightforward manner
10. NORMALIZATION LAYER
• Remove noise, remove duplicates
• Gather data from multiple sources and fill "gaps" in info
• Normalize data points to a standard internal representation
• Cluster related data together (Machine Learning techniques)
• Build a "knowledge base" -- continuous learning
• "Human in the loop" for data validation
11. DATA STORAGE
AND SERVING
Data APIs Visualizations Dashboards Reports
Serving Layer
Highly
Responsive
Indexes Views
Filters
Pre-Computed
Results
Serving Layer
Distributed Data Storage
Crawl Snapshots
Processed Data
Clustered Data
12. DATA STORAGE LAYER
• Store snapshots of crawl data -- never throw away raw data!
• Store processed data -- both individual data points as well as
"clusters" of related data points
• Distributed data stores
• Highly scalable -- add more hardware
• Highly available -- replication
13. SERVING LAYER
This is the system as far as a user is concerned!
Must be highly responsive
Process data offline and periodically push it to the serving layer
• create Indexes for fast data retrieval
• create views to serve queries that are known a priori
• minimize computation to the extent possible
14. DATAWEAVE PLATFORM
API Feeds
Data Services
Dashboards
Visualizations and
Widgets
Data APIs
Unstructured , spread
across sources and
temporally changing
Pricing Date
Open Government Data
Social Media Data
Attributes
Attribute
Big Data Platform