SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Crossing Analytics Systems: Case for
Integrated Provenance in Data Lakes
Isuru Suriarachchi and Beth Plale
School of Informatics and Computing
Indiana University
IEEE E-science 2016 : Hot Topics
The Data Lake has arisen within
last couple of years as
conceptualization of data
management framework with
flexibility to support multiple
data processing tools needed for
truly Big Data analytics.
Data Warehouse
• Supports multidimensional analytical processing
– Online Analytical Processing (OLAP) or
Multidimensional OLAP
• Numeric facts (measures) categorized by
dimensions creating vector space (OLAP cube).
• Interface is matrix interface like Pivot tables
• Schema is star schema, snowflake schema
• Storage is largely relational database
Credit: https://www.linkedin.com/topic/data-warehouse-architecture
Data Warehouse Architecture
• ETL: Extraction, Transformation, Load
Challenging the Warehouse: Big Data
• From numerous sources
– social media, sensor data, IoT devices, server logs,
clickstream etc.
• Not all numeric (quantitative) thus differently
structured
– Structured, semi-structured, unstructured
• Continuously generated or archived
Suitability of Data Warehouse for
Today’s Big Data
• ETL imposes burden
– Schema on write
– Inflexibility/inefficiency at ingest time
– Information loss upon schema translation
• Weak fit for popular Big Data analytical tools
(e.g., Spark, Hadoop) and data serving
platforms (e.g., HDFS, S3)
Data Lake
• A scalable storage infrastructure with no schema
enforcement at ingest
• Data ingested in raw form: no loss
• Schema-on-read
• Integrated Transformations
– With e.g., Hadoop, Spark
IngestAPI
Data
Data Lake
Clickstream
Sensor data
IoT Devices
Social Media
Could Platforms
Server Logs Metadata Lineage
Transform Transform Transform
Data Data Data
Analysis
Big Data Processing Frameworks
Ex: Hadoop, Spark, Storm
Data Lake Challenges
• Increased flexibility leads to harder
manageability
– Differently typed data can be easily dumped into
the Data Lake
– Data products can be in different stages of their
lifecycle: raw, half processed, processed etc.
– Can easily turn into “data swamps”
• Requires traceability!!..
– Provenance can help
Data Provenance
• Information about activities, entities and people
who involved in producing a data product
• Standards
– OPM
– PROV
• If a Data Lake ensures that every data product’s
provenance is in place starting from data
product’s origin, critical traceability can be had
What provenance perspective could
bring to a Data Lake?
• Track origins of data, chained transformations
• Contribute to reuse determinations of trust
and quality
• React!! Minimally constrain what enters a
Lake?
Challenges in Provenance Capturing
• Chains of Transformations
– Different analytics systems: Hadoop, Spark etc.
• Need is end to end integrated provenance across
transformations
• System specific provenance
collection methods are less
useful
– Integration/stitching
problems
– E.g.: RAMP, HadoopProv
for Hadoop
Solution to minimal lake governance
• All components in lake stream provenance to
central provenance subsystem
– Stores provenance for long term queries
– Monitors provenance stream in real time
• Event in stream represented by edge in
provenance graph
• Global lake wide policy: Uniform Persistent ID
(PID) (Handle, UUIDs, DOIs) attached to all
data objects in Data Lake
– required to guarantee integrated provenance
Model
• PID assigned to all data objects
– granularity
• Transformations T1, T2, and T3
– Distributed
– May use different frameworks
d1 T1
d2
d3
d4
d5
d6
d7
d8T2 T3
d1d3
d4
d6
d7
d8
Chain of
transformations
sharing Ids
Backward
provenance
from central
provenance store
Provenance traces integrate across systems of
Data Lake
Reference Architecture
IngestAPI
Batch Processing
Ex: Hadoop, Spark
Lineage
Raw Data from
various sources
Transformations
Workflow Engines
Ex: Kepler
Legacy Scripts
Stream Processing
Ex: Storm, Spark
Monitoring
Debugging Reproducing Data Quality
QueriesVisualization
Data Data Data
Data
Import
Lineage
Data
Export
Data Lake
Messaging System
Ingest API
Query API
ProvenanceSubsystem
Prov Stream
Processing
Prov
Storage
Prov
Stream
• Real-time provenance stream processing
• Stored provenance for long term usage
Prototype Use Case
• Different frameworks used
– Flume: Captures tweets and write into HDFS
– Hadoop Job: Computes hashtag counts
– Spark Job: Computes category counts
Central provenance store
• Uses Komadu
– A distributed
provenance
collection tool
– Visualization,
Custom Queries
I. Suriarachchi, Q. Zhou and B. Plale (2015). Komadu: A Capture and Visualization System for Scientific Data
Provenance. Journal of Open Research Software 3(1):e4
Client Library
• Log4j like API for provenance capture
• Dedicated thread pool in provenance layer
• Batching to minimize network overhead
Application Layer API
Komadu Client Layer
RabbitMQ Client Layer
client.addGeneration(A, E)
batching
prov thread
pool
RabbitMQ
Server
Komadu
Client Library
Use case evaluation
• Flume, Hadoop and Spark jobs instrumented
using Komadu client libraries
• Jobs stream provenance events into central
provenance store (Komadu)
• Persistent IDs (UUID) assigned for each data
object at entry to data lake; PID persists
thereafter with data object
Use case evaluation: experimental
environment
• 5 small VM instances, 2 2.5GhZ cores, 4 GB
RAM, 50 GB local storage
• 4 VM instances used for HDFS cluster
• 3.23 GB Twitter data collected over 5 days
running Flume on master node
• Hadoop and Spark set up on top of HDFS
cluster
• Separate instance for RabbitMQ and Komadu
Use case evaluation: Metrics
• Batch size:
– impact of batch size on provenance capture efficiency.
Measured by total execution time for Hadoop using
provenance event batching mechanism in Komadu
library
• Overhead of provenance capture:
– Measured against total tool-specific execution time
– measure overhead of customized value field (in key
value pair)
– Measure overhead of provenance capture for Hadoop
and Spark
Batch Size Test
• Hadoop job execution times with varying
batch sizes
• Optimal batch size: ~5000 KB
Overhead: Hadoop
• custom val: emits PID with key value pair
as (#nba, <2, id>) instead of (#nba, 2)
• data prov HDFS: writes provenance into HDFS,
used by HadoopProv and RAMP
Overhead: Spark
• Higher provenance capture overhead
compared to Hadoop
Future Work
• Performance overhead is prohibitively high
– decouple PID assignment from execution?
Examine granularity
• Live provenance stream processing for real
time monitoring/reaction
• Explore minimal provenance at on-line rates
and more comprehensive provenance at off-
line rates
Work funded in part by National Science
Foundation OCI-0940824
IEEE E-science 2016 : Hot Topics

Weitere ähnliche Inhalte

Was ist angesagt?

NoSQL – Data Center Centric Application Enablement
NoSQL – Data Center Centric Application EnablementNoSQL – Data Center Centric Application Enablement
NoSQL – Data Center Centric Application EnablementDATAVERSITY
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesDataWorks Summit
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networkspbelko82
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryDataWorks Summit/Hadoop Summit
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesArvind Prabhakar
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleFlink Forward
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big DataSeval Çapraz
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsPat Patterson
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidDataWorks Summit
 
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...confluent
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
 

Was ist angesagt? (20)

NoSQL – Data Center Centric Application Enablement
NoSQL – Data Center Centric Application EnablementNoSQL – Data Center Centric Application Enablement
NoSQL – Data Center Centric Application Enablement
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Accelerating Data Warehouse Modernization
Accelerating Data Warehouse ModernizationAccelerating Data Warehouse Modernization
Accelerating Data Warehouse Modernization
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query Engines
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Data streaming fundamentals
Data streaming fundamentalsData streaming fundamentals
Data streaming fundamentals
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive Industry
 
Data platform evolution
Data platform evolutionData platform evolution
Data platform evolution
 
Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion Pipelines
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big Data
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and Druid
 
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...
Mission-Critical, Real-Time Fault-Detection for NASA's Deep Space Network usi...
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
 

Andere mochten auch

TEXTO FINAL DE VOTACIÓN DEL PROYECTO DE LEY ORGÁNICA DE ECONOMÍA POPULAR Y SO...
TEXTO FINAL DE VOTACIÓN DEL PROYECTO DE LEY ORGÁNICA DE ECONOMÍA POPULAR Y SO...TEXTO FINAL DE VOTACIÓN DEL PROYECTO DE LEY ORGÁNICA DE ECONOMÍA POPULAR Y SO...
TEXTO FINAL DE VOTACIÓN DEL PROYECTO DE LEY ORGÁNICA DE ECONOMÍA POPULAR Y SO...Rocío Albán Torres
 
HPF_III_Evaluation_Report_Final_2016
HPF_III_Evaluation_Report_Final_2016HPF_III_Evaluation_Report_Final_2016
HPF_III_Evaluation_Report_Final_2016Asrade Abate
 
#EatWaterford - The Importance of Social Media
#EatWaterford - The Importance of Social Media#EatWaterford - The Importance of Social Media
#EatWaterford - The Importance of Social MediaSarah Dennison
 
Antibiotic Resistance presentation
Antibiotic Resistance presentationAntibiotic Resistance presentation
Antibiotic Resistance presentationpearse o brien
 
importancia de implantes
importancia de implantes importancia de implantes
importancia de implantes paolaarce1946
 
Infantry S3 Article SEP-OCT12
Infantry S3 Article SEP-OCT12Infantry S3 Article SEP-OCT12
Infantry S3 Article SEP-OCT12Mark Leslie
 
Derechos humanos de los delincuentes
Derechos humanos de los delincuentesDerechos humanos de los delincuentes
Derechos humanos de los delincuentesmario reyes
 
Simulation theory 510 13 ne`matov
Simulation theory 510 13 ne`matovSimulation theory 510 13 ne`matov
Simulation theory 510 13 ne`matovShohruh Nematov
 
ENERGY SAVING BY AIR WASHER PUMPS #5 ASH PLANT AT GTPS
ENERGY SAVING BY AIR WASHER PUMPS #5 ASH PLANT AT GTPSENERGY SAVING BY AIR WASHER PUMPS #5 ASH PLANT AT GTPS
ENERGY SAVING BY AIR WASHER PUMPS #5 ASH PLANT AT GTPSDilipkumar Pandya
 
aspida Presentation booklet GA1 update 3-1
aspida Presentation booklet GA1 update 3-1aspida Presentation booklet GA1 update 3-1
aspida Presentation booklet GA1 update 3-1Marios Vakanas
 
Prokochuk_Irina_architectural magazine Sporuda
Prokochuk_Irina_architectural magazine SporudaProkochuk_Irina_architectural magazine Sporuda
Prokochuk_Irina_architectural magazine SporudaIra Prokopchuk
 
BUILDING HOMES FOR HEROES-Charity-Info-v1d
BUILDING HOMES FOR HEROES-Charity-Info-v1dBUILDING HOMES FOR HEROES-Charity-Info-v1d
BUILDING HOMES FOR HEROES-Charity-Info-v1dChristie Christian
 

Andere mochten auch (20)

TEXTO FINAL DE VOTACIÓN DEL PROYECTO DE LEY ORGÁNICA DE ECONOMÍA POPULAR Y SO...
TEXTO FINAL DE VOTACIÓN DEL PROYECTO DE LEY ORGÁNICA DE ECONOMÍA POPULAR Y SO...TEXTO FINAL DE VOTACIÓN DEL PROYECTO DE LEY ORGÁNICA DE ECONOMÍA POPULAR Y SO...
TEXTO FINAL DE VOTACIÓN DEL PROYECTO DE LEY ORGÁNICA DE ECONOMÍA POPULAR Y SO...
 
報告
報告報告
報告
 
HPF_III_Evaluation_Report_Final_2016
HPF_III_Evaluation_Report_Final_2016HPF_III_Evaluation_Report_Final_2016
HPF_III_Evaluation_Report_Final_2016
 
#EatWaterford - The Importance of Social Media
#EatWaterford - The Importance of Social Media#EatWaterford - The Importance of Social Media
#EatWaterford - The Importance of Social Media
 
Antibiotic Resistance presentation
Antibiotic Resistance presentationAntibiotic Resistance presentation
Antibiotic Resistance presentation
 
Findaflorist
FindafloristFindaflorist
Findaflorist
 
Aprendizaje autónomo
Aprendizaje autónomoAprendizaje autónomo
Aprendizaje autónomo
 
Frugal Diner Final
Frugal Diner FinalFrugal Diner Final
Frugal Diner Final
 
importancia de implantes
importancia de implantes importancia de implantes
importancia de implantes
 
Computacionnnnnn
ComputacionnnnnnComputacionnnnnn
Computacionnnnnn
 
Infantry S3 Article SEP-OCT12
Infantry S3 Article SEP-OCT12Infantry S3 Article SEP-OCT12
Infantry S3 Article SEP-OCT12
 
Graphic Design Portfolio
Graphic Design PortfolioGraphic Design Portfolio
Graphic Design Portfolio
 
Derechos humanos de los delincuentes
Derechos humanos de los delincuentesDerechos humanos de los delincuentes
Derechos humanos de los delincuentes
 
odyssea-refer
odyssea-referodyssea-refer
odyssea-refer
 
Simulation theory 510 13 ne`matov
Simulation theory 510 13 ne`matovSimulation theory 510 13 ne`matov
Simulation theory 510 13 ne`matov
 
ENERGY SAVING BY AIR WASHER PUMPS #5 ASH PLANT AT GTPS
ENERGY SAVING BY AIR WASHER PUMPS #5 ASH PLANT AT GTPSENERGY SAVING BY AIR WASHER PUMPS #5 ASH PLANT AT GTPS
ENERGY SAVING BY AIR WASHER PUMPS #5 ASH PLANT AT GTPS
 
aspida Presentation booklet GA1 update 3-1
aspida Presentation booklet GA1 update 3-1aspida Presentation booklet GA1 update 3-1
aspida Presentation booklet GA1 update 3-1
 
Prokochuk_Irina_architectural magazine Sporuda
Prokochuk_Irina_architectural magazine SporudaProkochuk_Irina_architectural magazine Sporuda
Prokochuk_Irina_architectural magazine Sporuda
 
BUILDING HOMES FOR HEROES-Charity-Info-v1d
BUILDING HOMES FOR HEROES-Charity-Info-v1dBUILDING HOMES FOR HEROES-Charity-Info-v1d
BUILDING HOMES FOR HEROES-Charity-Info-v1d
 
Artist resume 2015
Artist resume 2015Artist resume 2015
Artist resume 2015
 

Ähnlich wie Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes

Survey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataSurvey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataLuiz Henrique Zambom Santana
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureInSemble
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL David Smelker
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeSaurabh K. Gupta
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindAvere Systems
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
 

Ähnlich wie Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes (20)

Operational-Analytics
Operational-AnalyticsOperational-Analytics
Operational-Analytics
 
Survey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataSurvey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big Data
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming Architecture
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Lecture1
Lecture1Lecture1
Lecture1
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 

Kürzlich hochgeladen

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Kürzlich hochgeladen (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes

  • 1. Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes Isuru Suriarachchi and Beth Plale School of Informatics and Computing Indiana University IEEE E-science 2016 : Hot Topics
  • 2. The Data Lake has arisen within last couple of years as conceptualization of data management framework with flexibility to support multiple data processing tools needed for truly Big Data analytics.
  • 3. Data Warehouse • Supports multidimensional analytical processing – Online Analytical Processing (OLAP) or Multidimensional OLAP • Numeric facts (measures) categorized by dimensions creating vector space (OLAP cube). • Interface is matrix interface like Pivot tables • Schema is star schema, snowflake schema • Storage is largely relational database
  • 4. Credit: https://www.linkedin.com/topic/data-warehouse-architecture Data Warehouse Architecture • ETL: Extraction, Transformation, Load
  • 5. Challenging the Warehouse: Big Data • From numerous sources – social media, sensor data, IoT devices, server logs, clickstream etc. • Not all numeric (quantitative) thus differently structured – Structured, semi-structured, unstructured • Continuously generated or archived
  • 6. Suitability of Data Warehouse for Today’s Big Data • ETL imposes burden – Schema on write – Inflexibility/inefficiency at ingest time – Information loss upon schema translation • Weak fit for popular Big Data analytical tools (e.g., Spark, Hadoop) and data serving platforms (e.g., HDFS, S3)
  • 7. Data Lake • A scalable storage infrastructure with no schema enforcement at ingest • Data ingested in raw form: no loss • Schema-on-read • Integrated Transformations – With e.g., Hadoop, Spark IngestAPI Data Data Lake Clickstream Sensor data IoT Devices Social Media Could Platforms Server Logs Metadata Lineage Transform Transform Transform Data Data Data Analysis Big Data Processing Frameworks Ex: Hadoop, Spark, Storm
  • 8. Data Lake Challenges • Increased flexibility leads to harder manageability – Differently typed data can be easily dumped into the Data Lake – Data products can be in different stages of their lifecycle: raw, half processed, processed etc. – Can easily turn into “data swamps” • Requires traceability!!.. – Provenance can help
  • 9. Data Provenance • Information about activities, entities and people who involved in producing a data product • Standards – OPM – PROV • If a Data Lake ensures that every data product’s provenance is in place starting from data product’s origin, critical traceability can be had
  • 10. What provenance perspective could bring to a Data Lake? • Track origins of data, chained transformations • Contribute to reuse determinations of trust and quality • React!! Minimally constrain what enters a Lake?
  • 11. Challenges in Provenance Capturing • Chains of Transformations – Different analytics systems: Hadoop, Spark etc. • Need is end to end integrated provenance across transformations • System specific provenance collection methods are less useful – Integration/stitching problems – E.g.: RAMP, HadoopProv for Hadoop
  • 12. Solution to minimal lake governance • All components in lake stream provenance to central provenance subsystem – Stores provenance for long term queries – Monitors provenance stream in real time • Event in stream represented by edge in provenance graph • Global lake wide policy: Uniform Persistent ID (PID) (Handle, UUIDs, DOIs) attached to all data objects in Data Lake – required to guarantee integrated provenance
  • 13. Model • PID assigned to all data objects – granularity • Transformations T1, T2, and T3 – Distributed – May use different frameworks d1 T1 d2 d3 d4 d5 d6 d7 d8T2 T3 d1d3 d4 d6 d7 d8 Chain of transformations sharing Ids Backward provenance from central provenance store
  • 14. Provenance traces integrate across systems of Data Lake
  • 15. Reference Architecture IngestAPI Batch Processing Ex: Hadoop, Spark Lineage Raw Data from various sources Transformations Workflow Engines Ex: Kepler Legacy Scripts Stream Processing Ex: Storm, Spark Monitoring Debugging Reproducing Data Quality QueriesVisualization Data Data Data Data Import Lineage Data Export Data Lake Messaging System Ingest API Query API ProvenanceSubsystem Prov Stream Processing Prov Storage Prov Stream • Real-time provenance stream processing • Stored provenance for long term usage
  • 16. Prototype Use Case • Different frameworks used – Flume: Captures tweets and write into HDFS – Hadoop Job: Computes hashtag counts – Spark Job: Computes category counts
  • 17. Central provenance store • Uses Komadu – A distributed provenance collection tool – Visualization, Custom Queries I. Suriarachchi, Q. Zhou and B. Plale (2015). Komadu: A Capture and Visualization System for Scientific Data Provenance. Journal of Open Research Software 3(1):e4
  • 18. Client Library • Log4j like API for provenance capture • Dedicated thread pool in provenance layer • Batching to minimize network overhead Application Layer API Komadu Client Layer RabbitMQ Client Layer client.addGeneration(A, E) batching prov thread pool RabbitMQ Server Komadu Client Library
  • 19. Use case evaluation • Flume, Hadoop and Spark jobs instrumented using Komadu client libraries • Jobs stream provenance events into central provenance store (Komadu) • Persistent IDs (UUID) assigned for each data object at entry to data lake; PID persists thereafter with data object
  • 20. Use case evaluation: experimental environment • 5 small VM instances, 2 2.5GhZ cores, 4 GB RAM, 50 GB local storage • 4 VM instances used for HDFS cluster • 3.23 GB Twitter data collected over 5 days running Flume on master node • Hadoop and Spark set up on top of HDFS cluster • Separate instance for RabbitMQ and Komadu
  • 21. Use case evaluation: Metrics • Batch size: – impact of batch size on provenance capture efficiency. Measured by total execution time for Hadoop using provenance event batching mechanism in Komadu library • Overhead of provenance capture: – Measured against total tool-specific execution time – measure overhead of customized value field (in key value pair) – Measure overhead of provenance capture for Hadoop and Spark
  • 22. Batch Size Test • Hadoop job execution times with varying batch sizes • Optimal batch size: ~5000 KB
  • 23. Overhead: Hadoop • custom val: emits PID with key value pair as (#nba, <2, id>) instead of (#nba, 2) • data prov HDFS: writes provenance into HDFS, used by HadoopProv and RAMP
  • 24. Overhead: Spark • Higher provenance capture overhead compared to Hadoop
  • 25. Future Work • Performance overhead is prohibitively high – decouple PID assignment from execution? Examine granularity • Live provenance stream processing for real time monitoring/reaction • Explore minimal provenance at on-line rates and more comprehensive provenance at off- line rates
  • 26. Work funded in part by National Science Foundation OCI-0940824 IEEE E-science 2016 : Hot Topics