SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Data Engineering
April 2019 - STL Big Data User Group
Discussion Topics
Introductions
Background - Chris
Distributed systems - Chris
Day in the life - Tim
Demo - Kit
Why Data Engineering?
● It's not just about analytics.
● Treating data as an asset.
● Data science is built upon the work of good data engineering.
● The data we collect is changing. The way we store and access it must as well.
● Automate, automate, automate.
Distributed Systems
What does it mean?
A distributed system is a system whose components are located on
different networked computers, which communicate and coordinate
their actions by passing messages to one another.
Distributed Compute
1. Overview
2. Scaling
3. What does it mean for you?
Data Architecture - Lambda & Kappa
Lambda: Batch and Speed Layer
● Stores raw data
● Computes tables in batch
● Has streaming component
● Performs data modeling
● Exposes end tables to users
Kappa: Speed Layer
● All data as a stream
● In stream processing
● Real-time views
A Day In The Life
What’s not modern about this architecture?
● Cron job runs once a day to check for presence of a file on an FTP server
● CSV file is downloaded and loaded into an RDBMS star schema
Message Queues
● Publishers
● Exchanges
● Queues
● Consumers
● Messages are removed from
queues once a consumer reads
them
Scaling Up Message Queues
What if a single consumer can’t keep up with the volume?
Scaling Them Up Even Further...
What happens if each message needs to go to more than one set of consumers?
Distributed Logs
● Topic - a named stream of messages
● Partition - an ordered list of message. A topic is made up of a set of partitions
● Producer - a process that publishes messages to a topic
● Consumer - a process that consumes messages from a topic
● Consumer group - a named group of consumers who together consume
messages from a set of topics
Topics can be consumed by multiple consumer groups independently.
Each consumer group is responsible for tracking its own position in message
processing
Messages stay in the log for the defined retention period (are not removed once all
consumer groups have processed them)
Distributed Logs
Data Formats
Apache Avro
● Row oriented
● Fast at writes
● Rich support for schema evolution
●
● {
● "type" : "record",
● "namespace" : "STL Big Data",
● "name" : "Inventory",
● "fields" : [
● { "name" : "date" , "type" : "date" },
● { "name" : "price" , "type" : "int" },
● { "name" : "size" , "type" : "int" }
● ]
● }
Apache Avro
● Row oriented
● Fast at writes
● Rich support for schema evolution
●
● {
● "type" : "record",
● "namespace" : "STL Big Data",
● "name" : "Inventory",
● "fields" : [
● { "name" : "date" , "type" : "date" },
● { "name" : "price" , "type" : "int" },
● { "name" : "size" , "type" : "int" },
● { "name" : "qty", "type" : [ "null", "int" ], "default" : null }
● ]
● }
Column Oriented Storage
● Instead of storing all columns for a row together on disk, store each column
together on disk
● Enables higher compression ratios
● What kind of queries will you be executing against the data?
Apache ORC
● Column oriented
● Enables fast reading
● Compresses extremely well
● Commonly used with Hive and Presto
● Column oriented
● Enables fast reading
● Compresses well
● Commonly used with with Impala
Apache Parquet
Compression Comparison
Format Size
CSV 2.5 GB
Avro 1.2 GB
Parquet 817 MB
ORC 796 MB
Source: Netflix Prize Data https://www.kaggle.com/netflix-inc/netflix-prize-data
Demo
https://github.com/kitmenke/spark-hello-world
AWS Kinesis
API
AWS Comprehend
Appendix
Best in class modern architecture
… At 1904labs, we have a passion for open source
technologies, strive to build cloud first applications, and
are motivated by our desire to transform businesses into
data-driven enterprises.
Infrastructure
Bare Metal / Servers On Site
Build & Maintain
Bundled Vendors
Cloud
These open source tools
are already used broadly
across industries today
Open-Source Adoption in the Industry
1904labs’ primary focus is to
help our clients design, build
and operate world class and
modern data, development
operations and decision science
capabilities...
...Utilizing best of breed open
source and/or commercial open
source tools and platforms.

Weitere ähnliche Inhalte

Was ist angesagt?

MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
Salil Navgire
 
Cassandra
CassandraCassandra
Cassandra
robjk
 

Was ist angesagt? (20)

NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social Web
 
InfluxDB Internals
InfluxDB InternalsInfluxDB Internals
InfluxDB Internals
 
What’s new in Alluxio 2: from seamless operations to structured data management
What’s new in Alluxio 2: from seamless operations to structured data managementWhat’s new in Alluxio 2: from seamless operations to structured data management
What’s new in Alluxio 2: from seamless operations to structured data management
 
MongoDB
MongoDBMongoDB
MongoDB
 
MongoDB for the SQL Server
MongoDB for the SQL ServerMongoDB for the SQL Server
MongoDB for the SQL Server
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Big data in the cloud
Big data in the cloudBig data in the cloud
Big data in the cloud
 
NoSQL
NoSQLNoSQL
NoSQL
 
Open source Technology
Open source TechnologyOpen source Technology
Open source Technology
 
Mongodb lab
Mongodb labMongodb lab
Mongodb lab
 
Cassandra
CassandraCassandra
Cassandra
 
Practical Use of a NoSQL
Practical Use of a NoSQLPractical Use of a NoSQL
Practical Use of a NoSQL
 
Scalability truths and serverless architectures
Scalability truths and serverless architecturesScalability truths and serverless architectures
Scalability truths and serverless architectures
 
Small intro to Big Data - Old version
Small intro to Big Data - Old versionSmall intro to Big Data - Old version
Small intro to Big Data - Old version
 
Oss as a competitive advantage
Oss as a competitive advantageOss as a competitive advantage
Oss as a competitive advantage
 
Introducción a NoSQL
Introducción a NoSQLIntroducción a NoSQL
Introducción a NoSQL
 
Intoduction of FIrebase Realtime Database
Intoduction of FIrebase Realtime DatabaseIntoduction of FIrebase Realtime Database
Intoduction of FIrebase Realtime Database
 
MongoDB SF Ruby
MongoDB SF RubyMongoDB SF Ruby
MongoDB SF Ruby
 
NoSQL
NoSQLNoSQL
NoSQL
 
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
 

Ähnlich wie Data engineering Stl Big Data IDEA user group

(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the Conference(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the Conference
Timothy Spann
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
André Dias
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 

Ähnlich wie Data engineering Stl Big Data IDEA user group (20)

Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
 
PyTables
PyTablesPyTables
PyTables
 
Py tables
Py tablesPy tables
Py tables
 
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
 
(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the Conference(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the Conference
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'
 
Big data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsBig data @ Hootsuite analtyics
Big data @ Hootsuite analtyics
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
PyTables
PyTablesPyTables
PyTables
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 

Mehr von Adam Doyle

Mehr von Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
 
The new big data
The new big dataThe new big data
The new big data
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 

Kürzlich hochgeladen

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 

Kürzlich hochgeladen (20)

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 

Data engineering Stl Big Data IDEA user group

  • 1. Data Engineering April 2019 - STL Big Data User Group
  • 2. Discussion Topics Introductions Background - Chris Distributed systems - Chris Day in the life - Tim Demo - Kit
  • 3. Why Data Engineering? ● It's not just about analytics. ● Treating data as an asset. ● Data science is built upon the work of good data engineering. ● The data we collect is changing. The way we store and access it must as well. ● Automate, automate, automate.
  • 4.
  • 6. What does it mean? A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another.
  • 7.
  • 8. Distributed Compute 1. Overview 2. Scaling 3. What does it mean for you?
  • 9. Data Architecture - Lambda & Kappa Lambda: Batch and Speed Layer ● Stores raw data ● Computes tables in batch ● Has streaming component ● Performs data modeling ● Exposes end tables to users Kappa: Speed Layer ● All data as a stream ● In stream processing ● Real-time views
  • 10. A Day In The Life
  • 11. What’s not modern about this architecture? ● Cron job runs once a day to check for presence of a file on an FTP server ● CSV file is downloaded and loaded into an RDBMS star schema
  • 12. Message Queues ● Publishers ● Exchanges ● Queues ● Consumers ● Messages are removed from queues once a consumer reads them
  • 13. Scaling Up Message Queues What if a single consumer can’t keep up with the volume?
  • 14. Scaling Them Up Even Further... What happens if each message needs to go to more than one set of consumers?
  • 15. Distributed Logs ● Topic - a named stream of messages ● Partition - an ordered list of message. A topic is made up of a set of partitions ● Producer - a process that publishes messages to a topic ● Consumer - a process that consumes messages from a topic ● Consumer group - a named group of consumers who together consume messages from a set of topics Topics can be consumed by multiple consumer groups independently. Each consumer group is responsible for tracking its own position in message processing Messages stay in the log for the defined retention period (are not removed once all consumer groups have processed them)
  • 18. Apache Avro ● Row oriented ● Fast at writes ● Rich support for schema evolution ● ● { ● "type" : "record", ● "namespace" : "STL Big Data", ● "name" : "Inventory", ● "fields" : [ ● { "name" : "date" , "type" : "date" }, ● { "name" : "price" , "type" : "int" }, ● { "name" : "size" , "type" : "int" } ● ] ● }
  • 19. Apache Avro ● Row oriented ● Fast at writes ● Rich support for schema evolution ● ● { ● "type" : "record", ● "namespace" : "STL Big Data", ● "name" : "Inventory", ● "fields" : [ ● { "name" : "date" , "type" : "date" }, ● { "name" : "price" , "type" : "int" }, ● { "name" : "size" , "type" : "int" }, ● { "name" : "qty", "type" : [ "null", "int" ], "default" : null } ● ] ● }
  • 20. Column Oriented Storage ● Instead of storing all columns for a row together on disk, store each column together on disk ● Enables higher compression ratios ● What kind of queries will you be executing against the data?
  • 21. Apache ORC ● Column oriented ● Enables fast reading ● Compresses extremely well ● Commonly used with Hive and Presto ● Column oriented ● Enables fast reading ● Compresses well ● Commonly used with with Impala Apache Parquet
  • 22. Compression Comparison Format Size CSV 2.5 GB Avro 1.2 GB Parquet 817 MB ORC 796 MB Source: Netflix Prize Data https://www.kaggle.com/netflix-inc/netflix-prize-data
  • 26.
  • 27. Best in class modern architecture
  • 28. … At 1904labs, we have a passion for open source technologies, strive to build cloud first applications, and are motivated by our desire to transform businesses into data-driven enterprises.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33. Infrastructure Bare Metal / Servers On Site Build & Maintain Bundled Vendors Cloud
  • 34.
  • 35. These open source tools are already used broadly across industries today Open-Source Adoption in the Industry 1904labs’ primary focus is to help our clients design, build and operate world class and modern data, development operations and decision science capabilities... ...Utilizing best of breed open source and/or commercial open source tools and platforms.