SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Practical Guide to
Architecting Data Lakes
Presented By Avinash Ramineni
Agenda
• About Clairvoyant
• What is Data Lake ?
• Features of Data Lake
• Tools
• Implementation Challenges
• Questions
3Page
Clairvoyant
4Page
Clairvoyant Services
5Page
What is a Data Lake
“ A data lake is an enterprise-wide system for storing and analyzing disparate sources of data in
their native formats”
“A data lake is a central location in which to store all your data, regardless of its source or format.”
“Is Data lake a replacement or complimentary to EDW ? ”
“Is Data lake just a storage layer ? ”
“ Just having a Hadoop environment is a data lake ? ”
6Page
Data Lake Attributes
• Data Democratization
• Data Discovery
• Data Lineage
• Self-Service capabilities
• Metadata Management
7Page
Data Lake
8Page
Self Service Analytics
9Page
Data Governance
• Data Acquisition - what, when, where of data
• Data Organization – Structure, format
• Data Catalog – what data exists in the lake
• Capturing Metadata
• Data Lineage
• Data Quality
• Data Profile
• Provenance of data at file and record levels
• Business names, descriptions
• Data Provisioning
10Page
11Page
Data Lineage
12Page
Data Lake Challenges
13Page
Guidelines
• Expect structured , semi-structure, unstructured data
• store a metadata or tag for location of schema, unstructured
• Store a copy of raw input
• Raw first mile copy of the data so that we can recover our business or almost
• Replay the business if we need to
• Data Standardization – data clensing as a workflow after ingest
• Use a format that supports your data
• Automate metadata management
14Page
Data Lake Security
15Page
Data Security
16Page
Implementation Challenges
• Change Data Capture
• Mysql – binlog readers
• Oracle - tungsten
• Updating the deltas on to the data lake
• Reusable Data movement workflows
• One workflow for table ? (Generate Dynamic workflows based on metadata)
• Needs to be driven of metadata
• Schema changes on the Source end
• Streaming Data
• Partitioning Strategies on the Data Lake
• Configure them into metadata
17Page
Tools /
Products
• Smart Catalogs
• Waterline Data Inventory
• Collibra Catalog
• Data Lake Management
• Zaloni Bedrock
• Informatica Intelligent Data Lake
• Data Governance and Metadata Management
• Cloudera Navigator
• Apache Atlas
• Collibra Data Governance
• Oracle BigData Catalog
18Page
Data Lake Trends
• Data Lakes on Cloud
• IOT Data Lakes
• Logical Data Lakes
• Unified View of data that exists across data stores
• Data Discovery Portals
19Page
Questions
• Principal @ Clairvoyant
• Email: avinash@clairvoyantsoft.com
• LinkedIn: https://www.linkedin.com/in/avinashramineni

Weitere ähnliche Inhalte

Mehr von clairvoyantllc

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Databricks Community Cloud
Databricks Community CloudDatabricks Community Cloud
Databricks Community Cloudclairvoyantllc
 
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014clairvoyantllc
 
Event Driven Architectures - Phoenix Java Users Group 2013
Event Driven Architectures - Phoenix Java Users Group 2013Event Driven Architectures - Phoenix Java Users Group 2013
Event Driven Architectures - Phoenix Java Users Group 2013clairvoyantllc
 
Strata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniStrata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash Ramineniclairvoyantllc
 
HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015clairvoyantllc
 

Mehr von clairvoyantllc (8)

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Databricks Community Cloud
Databricks Community CloudDatabricks Community Cloud
Databricks Community Cloud
 
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
 
Event Driven Architectures - Phoenix Java Users Group 2013
Event Driven Architectures - Phoenix Java Users Group 2013Event Driven Architectures - Phoenix Java Users Group 2013
Event Driven Architectures - Phoenix Java Users Group 2013
 
Strata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniStrata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash Ramineni
 
HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015
 

Kürzlich hochgeladen

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Kürzlich hochgeladen (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Practical guide to architecting data lakes - Avinash Ramineni

  • 1. Practical Guide to Architecting Data Lakes Presented By Avinash Ramineni
  • 2. Agenda • About Clairvoyant • What is Data Lake ? • Features of Data Lake • Tools • Implementation Challenges • Questions
  • 5. 5Page What is a Data Lake “ A data lake is an enterprise-wide system for storing and analyzing disparate sources of data in their native formats” “A data lake is a central location in which to store all your data, regardless of its source or format.” “Is Data lake a replacement or complimentary to EDW ? ” “Is Data lake just a storage layer ? ” “ Just having a Hadoop environment is a data lake ? ”
  • 6. 6Page Data Lake Attributes • Data Democratization • Data Discovery • Data Lineage • Self-Service capabilities • Metadata Management
  • 9. 9Page Data Governance • Data Acquisition - what, when, where of data • Data Organization – Structure, format • Data Catalog – what data exists in the lake • Capturing Metadata • Data Lineage • Data Quality • Data Profile • Provenance of data at file and record levels • Business names, descriptions • Data Provisioning
  • 13. 13Page Guidelines • Expect structured , semi-structure, unstructured data • store a metadata or tag for location of schema, unstructured • Store a copy of raw input • Raw first mile copy of the data so that we can recover our business or almost • Replay the business if we need to • Data Standardization – data clensing as a workflow after ingest • Use a format that supports your data • Automate metadata management
  • 16. 16Page Implementation Challenges • Change Data Capture • Mysql – binlog readers • Oracle - tungsten • Updating the deltas on to the data lake • Reusable Data movement workflows • One workflow for table ? (Generate Dynamic workflows based on metadata) • Needs to be driven of metadata • Schema changes on the Source end • Streaming Data • Partitioning Strategies on the Data Lake • Configure them into metadata
  • 17. 17Page Tools / Products • Smart Catalogs • Waterline Data Inventory • Collibra Catalog • Data Lake Management • Zaloni Bedrock • Informatica Intelligent Data Lake • Data Governance and Metadata Management • Cloudera Navigator • Apache Atlas • Collibra Data Governance • Oracle BigData Catalog
  • 18. 18Page Data Lake Trends • Data Lakes on Cloud • IOT Data Lakes • Logical Data Lakes • Unified View of data that exists across data stores • Data Discovery Portals
  • 19. 19Page Questions • Principal @ Clairvoyant • Email: avinash@clairvoyantsoft.com • LinkedIn: https://www.linkedin.com/in/avinashramineni