SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Cruising in Data Lake:
From zero to scaleHERE Technologies | 2019
Sneha Chaphalkar, Nikita Voronin
HERE in numbers
2
Countries mapped
200
30+
Years of
experience
transforming
location
technology
8,000+Employees in 56 countries
focused on delivering the world’s best
map and location technologies
HERE Maps on board of
100Mvehicles and counting
28TB map data
collected
perday
4of5
In-car
navigation
systems in Europe
and North America
use HERE maps
+ collecting
data for
maps
400 HERE cars
3D data points
per second per car
700,000
“Origins” for our Data Lake – HERE HD Live Map
© 2019 HERE3
 Critical role in the Self-driving cars
ecosystem.
 Four key building blocks to enable
autonomous vehicles, sensing (the eyes of
the vehicle), perception, decision making
and high definition maps, what we call the
brain of the vehicle.
 “Map as an extended sensor” High
definition maps provide the vehicle
more strategic insight, allowing the car
to make more proactive decisions.
“Origins” for our Data Lake
 Multi-Layer product with data enriched from multiple sources with
petabyte scale data
 Complex graph of batch/streaming jobs making it difficult to
untangle and explore data dependencies.
Several valuable feeds:
 Sources feed;
 Predictive/Training model feeds
 Automated process feeds;
 Human-in-loop interactions.
© 2019 HERE4
Lake Range
© 2019 HERE5
Processing
data
25+Sub-systems
30+TB Active
data
Geo-spatial
Regions
Multi-Regions
Slow Data Cadence
every 1-2 hour
daily
Fast Data Cadence
Under 1 min
Process
data
90+TB Total
history
Processdata
Data-As-A-Service
Process Metrics
Business Metrics
Live Content Metrics
Quality Metrics
What is Data lake?
 Centralized Repository to store all relational and non-
relational data;
 No rigid design: raw data with schema on-read capability;
 Multi-faceted use-cases:
 Exploratory analysis;
 Predictive analytics;
 Machine Learning;
 Batch reporting;
 Reverse lookup.
© 2018 HERE | HERE Internal Use Only6
 Centralized Repository to store all relational and non-
relational data;
 No rigid design: Bring raw data and also prepare data for
analysis
 Multi-faceted use-cases:
 Batch reporting;
 Exploratory analysis;
 Predictive analytics;
 Machine Learning;
 Reverse lookup.
Foundation Blocks for Data analytics platform
© 2019 HERE7
Transformers Storage
Query Engines/
Aggregators
Mile Marker 0
Challenges for Data Warehouse
 Blending across multiple DW’s
 ETL jobs with fixed schema less
flexibility
 Long cycle time for changes in
columns due to multi—
teams/pipelines
 Single point of failure
 DB maintenance and tuning
 Contention for reads/writes
 High costs
© 2018 HERE8
Data Providers
BI Queries
High Cost
DW
Mile Marker 50
© 2019 HERE9
AWS EMR Spark Transformers &
AWS S3 for storage
 Easy to produce with Spark SQL
 No Schema maintenance (“just put
more columns”)
 “Readers” are separated from “Writers”
• Easy to scale
• No need to orchestrate
• No competition for resources*
 Easy to share
 Good compression ratio
BI Queries
Storage
Query Engies/
Aggregators
Transformers
Query Engine/
Aggregators
Buoys
Uniform data
Single format + single interface
Frequent incremental updates
Data comes in different forms and formats, from different sources. Update daily.
One Table = One Source
Preserve the external references to combine the data later on.
Consistency & Availability first
Data is useless if you can’t use it.
Security & compliance
Compliance with corporate standards + Hello, GDPR!
© 2019 HERE10
Guidelines for Storage in AWS S3 Data lake
Transformers Storage
Query Engines/
Aggregators
Log or History
• Records every change
• Append writes
• Partitioned by date*
© 2019 HERE
* -- or any other monotonically increasing key
11
Storage
Latest state
• No duplicates
• Partitioned by hash
• Copy-on-write
Lookups, aggregates
• Materialized view
• Partitioned by date*
Principles
Versioned change sets
Immutable data
Dependency tracking
“Additional content v5 uses base
content v8”
Compatibility
If two sources have matching
dependencies, then they are
“compatible”.
Append!
Merge!
Aggregate!
Query Engines/
Aggregators
Mile Marker 50 Transformers Storage
table/
date=2019-05-17/
01:15.parquet.snappy
01:30.parquet.snappy
…
12:45-parquet.snappy
date=2019-05-16/ daily.parquet.snappy
• Know your customers
• How fresh should be really?
• How much are they willing to pay?
• Is it worth it?
• Fast avenue
• Writes frequently → many files → slow reads;
• Limited scope. Example: past 24h only;
• Cadence: every 15 minutes.
• Slow avenue
• Buffers → few files → faster reads;
• Full scope
• Cadence: every 4h.
• Combine both!
• Append often & compress daily.
© 2019 HERE12
Fast & Slow data pipelines
Speed Boosters StorageTransformers
Query Engines/
Aggregators
In a nut-shell
© 2019 HERE13
Storage
Query Engines/
Aggregators
Low costHigh Cost
Transformers
What’s next ?
Auto-pilot
© 2019 HERE14
Cruising Auto-Pilot
Libraries to integrate to Data Lake
 Data compression
 Data Partitioning scheme
 Data Dependencies tracking
Monitoring and Alerting
 Centralized solutions : ELK, Splunk
 Latencies monitoring
© 2019 HERE15
Bring Your Own Data – Create a platform
Framework for boilerplate abstractions
Built-in data-design patterns
 Historical tracking
 Latest and Greatest state of data
Cost optimization
 Auto-scaling techniques
 Query cost
 TTL on storage
Questions?
© 2019 HERE16

Weitere ähnliche Inhalte

Was ist angesagt?

Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumAmazon Web Services
 
DataStax: Steps to successfully implementing NoSQL in the enterprise
DataStax: Steps to successfully implementing NoSQL in the enterpriseDataStax: Steps to successfully implementing NoSQL in the enterprise
DataStax: Steps to successfully implementing NoSQL in the enterpriseDataStax Academy
 
Getting Started With Amazon Redshift
Getting Started With Amazon Redshift Getting Started With Amazon Redshift
Getting Started With Amazon Redshift Matillion
 
How R Developers Can Build and Share Data and AI Applications that Scale with...
How R Developers Can Build and Share Data and AI Applications that Scale with...How R Developers Can Build and Share Data and AI Applications that Scale with...
How R Developers Can Build and Share Data and AI Applications that Scale with...Databricks
 
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1Mark Kromer
 
Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Amazon Web Services
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudMark Kromer
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
 
(ADV303) MediaMath’s Data Revolution with Amazon Kinesis and Amazon EMR | AWS...
(ADV303) MediaMath’s Data Revolution with Amazon Kinesis and Amazon EMR | AWS...(ADV303) MediaMath’s Data Revolution with Amazon Kinesis and Amazon EMR | AWS...
(ADV303) MediaMath’s Data Revolution with Amazon Kinesis and Amazon EMR | AWS...Amazon Web Services
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at AirbnbHao Wang
 
ESRI ERUC 2014 - Easy Automation for Process Efficiencies
ESRI ERUC 2014 - Easy Automation for Process EfficienciesESRI ERUC 2014 - Easy Automation for Process Efficiencies
ESRI ERUC 2014 - Easy Automation for Process EfficienciesTammy Kobliuk
 
Why Open Source Works for DevOps Monitoring
Why Open Source Works for DevOps MonitoringWhy Open Source Works for DevOps Monitoring
Why Open Source Works for DevOps MonitoringDevOps.com
 
01 supermapiserverintroduction
01 supermapiserverintroduction01 supermapiserverintroduction
01 supermapiserverintroductionGeoMedeelel
 
Power bi - enterprise cloud reporting platform Azure Bootcamp 19
Power bi - enterprise cloud reporting platform Azure Bootcamp 19Power bi - enterprise cloud reporting platform Azure Bootcamp 19
Power bi - enterprise cloud reporting platform Azure Bootcamp 19Ivan Donev
 
Visualising Lidar Data in ArcGIS Pro - Training - Esri UK Annual Conference 2017
Visualising Lidar Data in ArcGIS Pro - Training - Esri UK Annual Conference 2017Visualising Lidar Data in ArcGIS Pro - Training - Esri UK Annual Conference 2017
Visualising Lidar Data in ArcGIS Pro - Training - Esri UK Annual Conference 2017Esri UK
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudMark Kromer
 
2013 Enterprise Track, Building GIS, Decision Support, and Location Intellige...
2013 Enterprise Track, Building GIS, Decision Support, and Location Intellige...2013 Enterprise Track, Building GIS, Decision Support, and Location Intellige...
2013 Enterprise Track, Building GIS, Decision Support, and Location Intellige...GIS in the Rockies
 

Was ist angesagt? (20)

Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
 
DataStax: Steps to successfully implementing NoSQL in the enterprise
DataStax: Steps to successfully implementing NoSQL in the enterpriseDataStax: Steps to successfully implementing NoSQL in the enterprise
DataStax: Steps to successfully implementing NoSQL in the enterprise
 
StreamSet ETL tool
StreamSet  ETL toolStreamSet  ETL tool
StreamSet ETL tool
 
Getting Started With Amazon Redshift
Getting Started With Amazon Redshift Getting Started With Amazon Redshift
Getting Started With Amazon Redshift
 
How R Developers Can Build and Share Data and AI Applications that Scale with...
How R Developers Can Build and Share Data and AI Applications that Scale with...How R Developers Can Build and Share Data and AI Applications that Scale with...
How R Developers Can Build and Share Data and AI Applications that Scale with...
 
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1
 
Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 
(ADV303) MediaMath’s Data Revolution with Amazon Kinesis and Amazon EMR | AWS...
(ADV303) MediaMath’s Data Revolution with Amazon Kinesis and Amazon EMR | AWS...(ADV303) MediaMath’s Data Revolution with Amazon Kinesis and Amazon EMR | AWS...
(ADV303) MediaMath’s Data Revolution with Amazon Kinesis and Amazon EMR | AWS...
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at Airbnb
 
ESRI ERUC 2014 - Easy Automation for Process Efficiencies
ESRI ERUC 2014 - Easy Automation for Process EfficienciesESRI ERUC 2014 - Easy Automation for Process Efficiencies
ESRI ERUC 2014 - Easy Automation for Process Efficiencies
 
Why Open Source Works for DevOps Monitoring
Why Open Source Works for DevOps MonitoringWhy Open Source Works for DevOps Monitoring
Why Open Source Works for DevOps Monitoring
 
01 supermapiserverintroduction
01 supermapiserverintroduction01 supermapiserverintroduction
01 supermapiserverintroduction
 
Introduction to knime
Introduction to knimeIntroduction to knime
Introduction to knime
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Power bi - enterprise cloud reporting platform Azure Bootcamp 19
Power bi - enterprise cloud reporting platform Azure Bootcamp 19Power bi - enterprise cloud reporting platform Azure Bootcamp 19
Power bi - enterprise cloud reporting platform Azure Bootcamp 19
 
Visualising Lidar Data in ArcGIS Pro - Training - Esri UK Annual Conference 2017
Visualising Lidar Data in ArcGIS Pro - Training - Esri UK Annual Conference 2017Visualising Lidar Data in ArcGIS Pro - Training - Esri UK Annual Conference 2017
Visualising Lidar Data in ArcGIS Pro - Training - Esri UK Annual Conference 2017
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
2013 Enterprise Track, Building GIS, Decision Support, and Location Intellige...
2013 Enterprise Track, Building GIS, Decision Support, and Location Intellige...2013 Enterprise Track, Building GIS, Decision Support, and Location Intellige...
2013 Enterprise Track, Building GIS, Decision Support, and Location Intellige...
 

Ähnlich wie Cruising in data lake from zero to scale

Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...DataWorks Summit/Hadoop Summit
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsAshish Mrig
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSAWS User Group Kochi
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data AnalyticsAmazon Web Services
 
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfBuilding_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfAmazon Web Services
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceCambridge Semantics
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Building Fast Applications for Streaming Data
Building Fast Applications for Streaming DataBuilding Fast Applications for Streaming Data
Building Fast Applications for Streaming Datafreshdatabos
 
High Value Business Intelligence for IBM Platform compute environments
High Value Business Intelligence for IBM Platform compute environmentsHigh Value Business Intelligence for IBM Platform compute environments
High Value Business Intelligence for IBM Platform compute environmentsGabor Samu
 
The Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data IntegrationThe Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data IntegrationEric Kavanagh
 
Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/MLPreparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/MLAmazon Web Services
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalVMware Tanzu Korea
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...Amazon Web Services
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Hortonworks
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHortonworks
 
Evolution from SAP ECC6 to SAP S/4HANA.pptx
Evolution from SAP ECC6 to SAP S/4HANA.pptxEvolution from SAP ECC6 to SAP S/4HANA.pptx
Evolution from SAP ECC6 to SAP S/4HANA.pptxRiponKumarPaul
 

Ähnlich wie Cruising in data lake from zero to scale (20)

Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
 
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfBuilding_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Building Fast Applications for Streaming Data
Building Fast Applications for Streaming DataBuilding Fast Applications for Streaming Data
Building Fast Applications for Streaming Data
 
High Value Business Intelligence for IBM Platform compute environments
High Value Business Intelligence for IBM Platform compute environmentsHigh Value Business Intelligence for IBM Platform compute environments
High Value Business Intelligence for IBM Platform compute environments
 
The Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data IntegrationThe Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data Integration
 
Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/MLPreparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from Pivotal
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
 
Ask bigger questions
Ask bigger questionsAsk bigger questions
Ask bigger questions
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
 
Evolution from SAP ECC6 to SAP S/4HANA.pptx
Evolution from SAP ECC6 to SAP S/4HANA.pptxEvolution from SAP ECC6 to SAP S/4HANA.pptx
Evolution from SAP ECC6 to SAP S/4HANA.pptx
 

Mehr von John Varghese

Lessons Learned From Cloud Migrations: Planning is Everything
Lessons Learned From Cloud Migrations: Planning is EverythingLessons Learned From Cloud Migrations: Planning is Everything
Lessons Learned From Cloud Migrations: Planning is EverythingJohn Varghese
 
Leveraging AWS Cloudfront & S3 Services to Deliver Static Assets of a SPA
Leveraging AWS Cloudfront & S3 Services to Deliver Static Assets of a SPALeveraging AWS Cloudfront & S3 Services to Deliver Static Assets of a SPA
Leveraging AWS Cloudfront & S3 Services to Deliver Static Assets of a SPAJohn Varghese
 
AWS Transit Gateway-Benefits and Best Practices
AWS Transit Gateway-Benefits and Best PracticesAWS Transit Gateway-Benefits and Best Practices
AWS Transit Gateway-Benefits and Best PracticesJohn Varghese
 
Bridging Operations and Development With Observabilty
Bridging Operations and Development With ObservabiltyBridging Operations and Development With Observabilty
Bridging Operations and Development With ObservabiltyJohn Varghese
 
Security Observability for Cloud Based Applications
Security Observability for Cloud Based ApplicationsSecurity Observability for Cloud Based Applications
Security Observability for Cloud Based ApplicationsJohn Varghese
 
Building an IoT System to Protect My Lunch
Building an IoT System to Protect My LunchBuilding an IoT System to Protect My Lunch
Building an IoT System to Protect My LunchJohn Varghese
 
Building a Highly Secure S3 Bucket
Building a Highly Secure S3 BucketBuilding a Highly Secure S3 Bucket
Building a Highly Secure S3 BucketJohn Varghese
 
Reduce Amazon RDS Costs up to 50% with Proxies
Reduce Amazon RDS Costs up to 50% with ProxiesReduce Amazon RDS Costs up to 50% with Proxies
Reduce Amazon RDS Costs up to 50% with ProxiesJohn Varghese
 
Keynote - Lead the change around you
Keynote - Lead the change around youKeynote - Lead the change around you
Keynote - Lead the change around youJohn Varghese
 
AWS Systems manager 2019
AWS Systems manager 2019AWS Systems manager 2019
AWS Systems manager 2019John Varghese
 
Acd19 kubertes cluster at scale on aws at intuit
Acd19 kubertes cluster at scale on aws at intuitAcd19 kubertes cluster at scale on aws at intuit
Acd19 kubertes cluster at scale on aws at intuitJohn Varghese
 
Emerging job trends and best practices in the aws community
Emerging job trends and best practices in the aws communityEmerging job trends and best practices in the aws community
Emerging job trends and best practices in the aws communityJohn Varghese
 
Automating security in aws with divvy cloud
Automating security in aws with divvy cloudAutomating security in aws with divvy cloud
Automating security in aws with divvy cloudJohn Varghese
 
AWS temporary credentials challenges in prevention detection mitigation
AWS temporary credentials   challenges in prevention detection mitigationAWS temporary credentials   challenges in prevention detection mitigation
AWS temporary credentials challenges in prevention detection mitigationJohn Varghese
 
Securing aws workloads with embedded application security
Securing aws workloads with embedded application securitySecuring aws workloads with embedded application security
Securing aws workloads with embedded application securityJohn Varghese
 
Of CORS thats a thing how CORS in the cloud still kills security
Of CORS thats a thing how CORS in the cloud still kills securityOf CORS thats a thing how CORS in the cloud still kills security
Of CORS thats a thing how CORS in the cloud still kills securityJohn Varghese
 
Native cloud security monitoring
Native cloud security monitoringNative cloud security monitoring
Native cloud security monitoringJohn Varghese
 
Last year in AWS - 2019
Last year in AWS - 2019Last year in AWS - 2019
Last year in AWS - 2019John Varghese
 
Gpu accelerated BERT deployment on aws
Gpu accelerated BERT deployment on awsGpu accelerated BERT deployment on aws
Gpu accelerated BERT deployment on awsJohn Varghese
 

Mehr von John Varghese (20)

Lessons Learned From Cloud Migrations: Planning is Everything
Lessons Learned From Cloud Migrations: Planning is EverythingLessons Learned From Cloud Migrations: Planning is Everything
Lessons Learned From Cloud Migrations: Planning is Everything
 
Leveraging AWS Cloudfront & S3 Services to Deliver Static Assets of a SPA
Leveraging AWS Cloudfront & S3 Services to Deliver Static Assets of a SPALeveraging AWS Cloudfront & S3 Services to Deliver Static Assets of a SPA
Leveraging AWS Cloudfront & S3 Services to Deliver Static Assets of a SPA
 
AWS Transit Gateway-Benefits and Best Practices
AWS Transit Gateway-Benefits and Best PracticesAWS Transit Gateway-Benefits and Best Practices
AWS Transit Gateway-Benefits and Best Practices
 
Bridging Operations and Development With Observabilty
Bridging Operations and Development With ObservabiltyBridging Operations and Development With Observabilty
Bridging Operations and Development With Observabilty
 
Security Observability for Cloud Based Applications
Security Observability for Cloud Based ApplicationsSecurity Observability for Cloud Based Applications
Security Observability for Cloud Based Applications
 
Who Broke My Crypto
Who Broke My CryptoWho Broke My Crypto
Who Broke My Crypto
 
Building an IoT System to Protect My Lunch
Building an IoT System to Protect My LunchBuilding an IoT System to Protect My Lunch
Building an IoT System to Protect My Lunch
 
Building a Highly Secure S3 Bucket
Building a Highly Secure S3 BucketBuilding a Highly Secure S3 Bucket
Building a Highly Secure S3 Bucket
 
Reduce Amazon RDS Costs up to 50% with Proxies
Reduce Amazon RDS Costs up to 50% with ProxiesReduce Amazon RDS Costs up to 50% with Proxies
Reduce Amazon RDS Costs up to 50% with Proxies
 
Keynote - Lead the change around you
Keynote - Lead the change around youKeynote - Lead the change around you
Keynote - Lead the change around you
 
AWS Systems manager 2019
AWS Systems manager 2019AWS Systems manager 2019
AWS Systems manager 2019
 
Acd19 kubertes cluster at scale on aws at intuit
Acd19 kubertes cluster at scale on aws at intuitAcd19 kubertes cluster at scale on aws at intuit
Acd19 kubertes cluster at scale on aws at intuit
 
Emerging job trends and best practices in the aws community
Emerging job trends and best practices in the aws communityEmerging job trends and best practices in the aws community
Emerging job trends and best practices in the aws community
 
Automating security in aws with divvy cloud
Automating security in aws with divvy cloudAutomating security in aws with divvy cloud
Automating security in aws with divvy cloud
 
AWS temporary credentials challenges in prevention detection mitigation
AWS temporary credentials   challenges in prevention detection mitigationAWS temporary credentials   challenges in prevention detection mitigation
AWS temporary credentials challenges in prevention detection mitigation
 
Securing aws workloads with embedded application security
Securing aws workloads with embedded application securitySecuring aws workloads with embedded application security
Securing aws workloads with embedded application security
 
Of CORS thats a thing how CORS in the cloud still kills security
Of CORS thats a thing how CORS in the cloud still kills securityOf CORS thats a thing how CORS in the cloud still kills security
Of CORS thats a thing how CORS in the cloud still kills security
 
Native cloud security monitoring
Native cloud security monitoringNative cloud security monitoring
Native cloud security monitoring
 
Last year in AWS - 2019
Last year in AWS - 2019Last year in AWS - 2019
Last year in AWS - 2019
 
Gpu accelerated BERT deployment on aws
Gpu accelerated BERT deployment on awsGpu accelerated BERT deployment on aws
Gpu accelerated BERT deployment on aws
 

Kürzlich hochgeladen

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 

Kürzlich hochgeladen (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 

Cruising in data lake from zero to scale

  • 1. Cruising in Data Lake: From zero to scaleHERE Technologies | 2019 Sneha Chaphalkar, Nikita Voronin
  • 2. HERE in numbers 2 Countries mapped 200 30+ Years of experience transforming location technology 8,000+Employees in 56 countries focused on delivering the world’s best map and location technologies HERE Maps on board of 100Mvehicles and counting 28TB map data collected perday 4of5 In-car navigation systems in Europe and North America use HERE maps + collecting data for maps 400 HERE cars 3D data points per second per car 700,000
  • 3. “Origins” for our Data Lake – HERE HD Live Map © 2019 HERE3  Critical role in the Self-driving cars ecosystem.  Four key building blocks to enable autonomous vehicles, sensing (the eyes of the vehicle), perception, decision making and high definition maps, what we call the brain of the vehicle.  “Map as an extended sensor” High definition maps provide the vehicle more strategic insight, allowing the car to make more proactive decisions.
  • 4. “Origins” for our Data Lake  Multi-Layer product with data enriched from multiple sources with petabyte scale data  Complex graph of batch/streaming jobs making it difficult to untangle and explore data dependencies. Several valuable feeds:  Sources feed;  Predictive/Training model feeds  Automated process feeds;  Human-in-loop interactions. © 2019 HERE4
  • 5. Lake Range © 2019 HERE5 Processing data 25+Sub-systems 30+TB Active data Geo-spatial Regions Multi-Regions Slow Data Cadence every 1-2 hour daily Fast Data Cadence Under 1 min Process data 90+TB Total history Processdata Data-As-A-Service Process Metrics Business Metrics Live Content Metrics Quality Metrics
  • 6. What is Data lake?  Centralized Repository to store all relational and non- relational data;  No rigid design: raw data with schema on-read capability;  Multi-faceted use-cases:  Exploratory analysis;  Predictive analytics;  Machine Learning;  Batch reporting;  Reverse lookup. © 2018 HERE | HERE Internal Use Only6  Centralized Repository to store all relational and non- relational data;  No rigid design: Bring raw data and also prepare data for analysis  Multi-faceted use-cases:  Batch reporting;  Exploratory analysis;  Predictive analytics;  Machine Learning;  Reverse lookup.
  • 7. Foundation Blocks for Data analytics platform © 2019 HERE7 Transformers Storage Query Engines/ Aggregators
  • 8. Mile Marker 0 Challenges for Data Warehouse  Blending across multiple DW’s  ETL jobs with fixed schema less flexibility  Long cycle time for changes in columns due to multi— teams/pipelines  Single point of failure  DB maintenance and tuning  Contention for reads/writes  High costs © 2018 HERE8 Data Providers BI Queries High Cost DW
  • 9. Mile Marker 50 © 2019 HERE9 AWS EMR Spark Transformers & AWS S3 for storage  Easy to produce with Spark SQL  No Schema maintenance (“just put more columns”)  “Readers” are separated from “Writers” • Easy to scale • No need to orchestrate • No competition for resources*  Easy to share  Good compression ratio BI Queries Storage Query Engies/ Aggregators Transformers Query Engine/ Aggregators
  • 10. Buoys Uniform data Single format + single interface Frequent incremental updates Data comes in different forms and formats, from different sources. Update daily. One Table = One Source Preserve the external references to combine the data later on. Consistency & Availability first Data is useless if you can’t use it. Security & compliance Compliance with corporate standards + Hello, GDPR! © 2019 HERE10 Guidelines for Storage in AWS S3 Data lake Transformers Storage Query Engines/ Aggregators
  • 11. Log or History • Records every change • Append writes • Partitioned by date* © 2019 HERE * -- or any other monotonically increasing key 11 Storage Latest state • No duplicates • Partitioned by hash • Copy-on-write Lookups, aggregates • Materialized view • Partitioned by date* Principles Versioned change sets Immutable data Dependency tracking “Additional content v5 uses base content v8” Compatibility If two sources have matching dependencies, then they are “compatible”. Append! Merge! Aggregate! Query Engines/ Aggregators Mile Marker 50 Transformers Storage
  • 12. table/ date=2019-05-17/ 01:15.parquet.snappy 01:30.parquet.snappy … 12:45-parquet.snappy date=2019-05-16/ daily.parquet.snappy • Know your customers • How fresh should be really? • How much are they willing to pay? • Is it worth it? • Fast avenue • Writes frequently → many files → slow reads; • Limited scope. Example: past 24h only; • Cadence: every 15 minutes. • Slow avenue • Buffers → few files → faster reads; • Full scope • Cadence: every 4h. • Combine both! • Append often & compress daily. © 2019 HERE12 Fast & Slow data pipelines Speed Boosters StorageTransformers Query Engines/ Aggregators
  • 13. In a nut-shell © 2019 HERE13 Storage Query Engines/ Aggregators Low costHigh Cost Transformers
  • 15. Cruising Auto-Pilot Libraries to integrate to Data Lake  Data compression  Data Partitioning scheme  Data Dependencies tracking Monitoring and Alerting  Centralized solutions : ELK, Splunk  Latencies monitoring © 2019 HERE15 Bring Your Own Data – Create a platform Framework for boilerplate abstractions Built-in data-design patterns  Historical tracking  Latest and Greatest state of data Cost optimization  Auto-scaling techniques  Query cost  TTL on storage