SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
BUILDING
DATA PRODUCTS AT SCALE
DATAWEAVE:WHAT WE DO?
• Aggregate large amounts of data publicly available on the web, and
serve it to businesses in readily usable forms
• Serve actionable data through APIs,Visualizations, and Dashboards
• Provide reporting and analytics layer on top of datasets and APIs
DATAWEAVE PLATFORM
API Feeds
Data Services
Dashboards
Visualizations and
Widgets
Data APIs
Unstructured , spread
across sources and
temporally changing
Pricing Date
Open Government Data
Social Media Data
Attributes
Attribute
Big Data Platform
HOW DOES IT WORK - 1?
• Crawling/Scraping:
from a large number of data sources
• Cleaning/Deduplication:
remove as much noise as possible
• Data Normalization:
represent related data together in standard forms
HOW DOES IT WORK - 2?
• Store/Index:
store optimally to support several complex queries
• Create "Views":
on top of data for easy consumption, through APIs, visualizations,
dashboards, and reports
• Package data as a product:
to solve a bunch of related pain points in a certain domain (e.g.,
PriceWeave for retail)
AGGREGATION
AND EXTRACTION
Extraction Layer
Offline Extraction of Factual Data
Aggregation Layer
Distributed Crawler Infrastructure
Public Data on the Web
AGGREGATION LAYER
Customized crawler infrastructure
• vertical specific crawlers
• capable of crawling the "deep web"
Highly Scalable
• 500+ websites on a daily basis
• more with the addition of hardware
Robust to failures (404s, timeouts, server restarts)
• stateless distributed workers
• crawl state maintained in a separate data store
DATA EXTRACTION LAYER
• Extract as many data points from crawled pages as possible
• Completely offline process, independent of crawling
• Highly parallelized -- scales in a straightforward manner
NORMALIZATION
Normalization Layer
Machine
Learning
Techniques
Remove Noise Fill Gaps in Data
Represent Data Clustering
Extraction Layer
Offline Extraction of Factual Data
Knowledge
Base
NORMALIZATION LAYER
• Remove noise, remove duplicates
• Gather data from multiple sources and fill "gaps" in info
• Normalize data points to a standard internal representation
• Cluster related data together (Machine Learning techniques)
• Build a "knowledge base" -- continuous learning
• "Human in the loop" for data validation
DATA STORAGE
AND SERVING
Data APIs Visualizations Dashboards Reports
Serving Layer
Highly
Responsive
Indexes Views
Filters
Pre-Computed
Results
Serving Layer
Distributed Data Storage
Crawl Snapshots
Processed Data
Clustered Data
DATA STORAGE LAYER
• Store snapshots of crawl data -- never throw away raw data!
• Store processed data -- both individual data points as well as
"clusters" of related data points
• Distributed data stores
• Highly scalable -- add more hardware
• Highly available -- replication
SERVING LAYER
This is the system as far as a user is concerned!
Must be highly responsive
Process data offline and periodically push it to the serving layer
• create Indexes for fast data retrieval
• create views to serve queries that are known a priori
• minimize computation to the extent possible
DATAWEAVE PLATFORM
API Feeds
Data Services
Dashboards
Visualizations and
Widgets
Data APIs
Unstructured , spread
across sources and
temporally changing
Pricing Date
Open Government Data
Social Media Data
Attributes
Attribute
Big Data Platform
THANKYOU
Sanket Patil
sanket@dataweave.in
+91-9900063093
2013 Dataweave
On Facebook www.facebook.com/DataWeave
Catch us onTwitter @dataweavein
www.dataweave.in

Weitere ähnliche Inhalte

Was ist angesagt?

hyperion essbase training | hyperion essbase online training | hyperion essb...
hyperion essbase training | hyperion essbase online training |  hyperion essb...hyperion essbase training | hyperion essbase online training |  hyperion essb...
hyperion essbase training | hyperion essbase online training | hyperion essb...
Nancy Thomas
 

Was ist angesagt? (20)

Oracle hyperion essbase
Oracle hyperion essbaseOracle hyperion essbase
Oracle hyperion essbase
 
OLAP
OLAPOLAP
OLAP
 
Warehouse chapter3
Warehouse chapter3   Warehouse chapter3
Warehouse chapter3
 
Big Data Ingestion Using Hadoop - Capstone Presentation
Big Data Ingestion Using Hadoop - Capstone PresentationBig Data Ingestion Using Hadoop - Capstone Presentation
Big Data Ingestion Using Hadoop - Capstone Presentation
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
hyperion essbase training | hyperion essbase online training | hyperion essb...
hyperion essbase training | hyperion essbase online training |  hyperion essb...hyperion essbase training | hyperion essbase online training |  hyperion essb...
hyperion essbase training | hyperion essbase online training | hyperion essb...
 
Victoria Tableau User Group - Getting started with Tableau
Victoria Tableau User Group - Getting started with TableauVictoria Tableau User Group - Getting started with Tableau
Victoria Tableau User Group - Getting started with Tableau
 
Hadoop Data Warehouse
Hadoop Data WarehouseHadoop Data Warehouse
Hadoop Data Warehouse
 
Introducing Data Lakes
Introducing Data LakesIntroducing Data Lakes
Introducing Data Lakes
 
Introduction to Azure HDInsight
Introduction to Azure HDInsightIntroduction to Azure HDInsight
Introduction to Azure HDInsight
 
Hyperion Essbase - Ravi Kurakula
Hyperion Essbase   -   Ravi KurakulaHyperion Essbase   -   Ravi Kurakula
Hyperion Essbase - Ravi Kurakula
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
OLAP
OLAPOLAP
OLAP
 
Translating Models to Medicine an Example of Managing Visual Communications
Translating Models to Medicine an Example of Managing Visual CommunicationsTranslating Models to Medicine an Example of Managing Visual Communications
Translating Models to Medicine an Example of Managing Visual Communications
 
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
 
Azure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analyticsAzure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analytics
 
Cloud and Analytics - From Platforms to an Ecosystem
Cloud and Analytics - From Platforms to an EcosystemCloud and Analytics - From Platforms to an Ecosystem
Cloud and Analytics - From Platforms to an Ecosystem
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
Data Visualization with Tableau - by Knowledgebee Trainings
Data Visualization with Tableau - by Knowledgebee TrainingsData Visualization with Tableau - by Knowledgebee Trainings
Data Visualization with Tableau - by Knowledgebee Trainings
 

Ähnlich wie Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
 

Ähnlich wie Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale (20)

Day 1 (Lecture 1): Data Management- The Foundation of all Analytics
Day 1 (Lecture 1): Data Management- The Foundation of all AnalyticsDay 1 (Lecture 1): Data Management- The Foundation of all Analytics
Day 1 (Lecture 1): Data Management- The Foundation of all Analytics
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoop
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI Options
 
Big data
Big dataBig data
Big data
 
Big data analytics and machine intelligence v5.0
Big data analytics and machine intelligence   v5.0Big data analytics and machine intelligence   v5.0
Big data analytics and machine intelligence v5.0
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overview
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
05_Decision Support and OLAP.pdf
05_Decision Support and OLAP.pdf05_Decision Support and OLAP.pdf
05_Decision Support and OLAP.pdf
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
How to Empower Your Business Users with Oracle Data Visualization
How to Empower Your Business Users with Oracle Data VisualizationHow to Empower Your Business Users with Oracle Data Visualization
How to Empower Your Business Users with Oracle Data Visualization
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
 
3 OLAP.pptx
3 OLAP.pptx3 OLAP.pptx
3 OLAP.pptx
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 

Mehr von Yahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 

Mehr von Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

  • 2. DATAWEAVE:WHAT WE DO? • Aggregate large amounts of data publicly available on the web, and serve it to businesses in readily usable forms • Serve actionable data through APIs,Visualizations, and Dashboards • Provide reporting and analytics layer on top of datasets and APIs
  • 3. DATAWEAVE PLATFORM API Feeds Data Services Dashboards Visualizations and Widgets Data APIs Unstructured , spread across sources and temporally changing Pricing Date Open Government Data Social Media Data Attributes Attribute Big Data Platform
  • 4. HOW DOES IT WORK - 1? • Crawling/Scraping: from a large number of data sources • Cleaning/Deduplication: remove as much noise as possible • Data Normalization: represent related data together in standard forms
  • 5. HOW DOES IT WORK - 2? • Store/Index: store optimally to support several complex queries • Create "Views": on top of data for easy consumption, through APIs, visualizations, dashboards, and reports • Package data as a product: to solve a bunch of related pain points in a certain domain (e.g., PriceWeave for retail)
  • 6. AGGREGATION AND EXTRACTION Extraction Layer Offline Extraction of Factual Data Aggregation Layer Distributed Crawler Infrastructure Public Data on the Web
  • 7. AGGREGATION LAYER Customized crawler infrastructure • vertical specific crawlers • capable of crawling the "deep web" Highly Scalable • 500+ websites on a daily basis • more with the addition of hardware Robust to failures (404s, timeouts, server restarts) • stateless distributed workers • crawl state maintained in a separate data store
  • 8. DATA EXTRACTION LAYER • Extract as many data points from crawled pages as possible • Completely offline process, independent of crawling • Highly parallelized -- scales in a straightforward manner
  • 9. NORMALIZATION Normalization Layer Machine Learning Techniques Remove Noise Fill Gaps in Data Represent Data Clustering Extraction Layer Offline Extraction of Factual Data Knowledge Base
  • 10. NORMALIZATION LAYER • Remove noise, remove duplicates • Gather data from multiple sources and fill "gaps" in info • Normalize data points to a standard internal representation • Cluster related data together (Machine Learning techniques) • Build a "knowledge base" -- continuous learning • "Human in the loop" for data validation
  • 11. DATA STORAGE AND SERVING Data APIs Visualizations Dashboards Reports Serving Layer Highly Responsive Indexes Views Filters Pre-Computed Results Serving Layer Distributed Data Storage Crawl Snapshots Processed Data Clustered Data
  • 12. DATA STORAGE LAYER • Store snapshots of crawl data -- never throw away raw data! • Store processed data -- both individual data points as well as "clusters" of related data points • Distributed data stores • Highly scalable -- add more hardware • Highly available -- replication
  • 13. SERVING LAYER This is the system as far as a user is concerned! Must be highly responsive Process data offline and periodically push it to the serving layer • create Indexes for fast data retrieval • create views to serve queries that are known a priori • minimize computation to the extent possible
  • 14. DATAWEAVE PLATFORM API Feeds Data Services Dashboards Visualizations and Widgets Data APIs Unstructured , spread across sources and temporally changing Pricing Date Open Government Data Social Media Data Attributes Attribute Big Data Platform
  • 15. THANKYOU Sanket Patil sanket@dataweave.in +91-9900063093 2013 Dataweave On Facebook www.facebook.com/DataWeave Catch us onTwitter @dataweavein www.dataweave.in