data_engineering_basics.pdf

•

0 gefällt mir•26 views

Ketan Patil

Here's your guide to the absolute basics of what and why data engineering

Technologie

DEMYSTIFYING DATA
ENGINEERING
BASICS & GETTING STARTED

Source: The AI Hierarchy of Needs - Monica Rogati

Natural Language Processing, Artificial Intelligence, Machine Learning and Deep
Learning needs a strong Data foundation.

Where to begin?
there is nothing! huge mess

● “Data” engineers design and build pipelines that transform and transport
data into a format wherein, by the time it reaches the Data Scientists or
other end users, it is in a highly usable state. These pipelines must take
data from many disparate sources and collect them into a single
warehouse that represents the data uniformly as a single source of truth.
● Designing, building and scaling systems that organize data for analytics.
● Data Engineers prepare the Big Data infrastructure to be analyzed by Data
Scientists.
● Data engineering is the process of designing and building systems that let
people collect and analyze raw data from multiple sources and formats.

Development + Cloud
Computing + Big Data
+ Databases
software
engineering
big data
cloud computing
databases

ROLES
Data Engineer:
● Data engineers work in a variety of settings to build systems that collect, manage, and convert raw
data into usable information for data scientists and business analysts to interpret.
Data Scientist:
● They use linear algebra and multivariable calculus to create new insight from existing data.
Business Analyst:
● Analysis and exploration of historical data → identify trends, patterns & understand the information →
drive business change

ETL (EXTRACT, TRANSFORM, LOAD)
the absolute core of Data Engineering

V’s of BIG DATA
Volume
◾ How much data you have
Velocity
◾ How fast data is getting to you
Variety
◾ How different your data is
Veracity
◾ How reliable your data is

TYPES
Unstructured/Raw data
● Unprocessed data in format used on source, Text, CSV, Image, Video, etc..
● High Latency
● No schema applied
● Stored in Google Cloud Storage, AWS S3
● Tools like Snowflake, MongoDB allow their specific ways to query unstructured data
Structured/Processed data
● Raw data with schema applied
● Stored in event tables/destinations in pipelines
● Analytics query language: ideally SQL-like
● Low latency data ingestion
● Read focus over large portion of data

STREAM PROCESSING
Process data on the fly, as it comes in

Batch vs Stream
Batch Processing Stream Processing
Data scope Processing over all or most of the data set processing over data on rolling window or most
recent data record
Data size Large batches of data Individual records or micro batches of few
records
Latency in minutes
to hours
in the order of seconds or milliseconds

MAP REDUCE
● MapReduce is a processing technique and a
program model for distributed computing.
● The algorithm contains two important tasks,
namely Map and Reduce. Map takes a set of data
and converts it into another set of data, where
individual elements are broken down into tuples
(key/value pairs).
● Secondly, reduce task, which takes the output from
a map as an input and combines those data tuples
into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always
performed after the map job.

Relational Database
(SQL)
Document Store
(NoSQL)

The Data Engineering
Cookbook
https://github.com/andkret/Cookbook

Connect:
● Ketan (LinkedIn)
○ Computer Science ‘24 Grad @ Michigan Tech
○ Ex - Data Engineer @ Abzooba : Abzooba is one of the top 50 Best Data Science firms in
India to work for. Focuses on developing the highest quality analytics products and
services using expertise in Big Data and Cloud, AI, and ML.
○ A constant Learner

Empfohlen

MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB

Big Data Analytics in the Cloud with Microsoft AzureMark Kromer

BigData Hadoop Kumari Surabhi

Traditional data wordorcoxsm

IT Architectures for Handling Big Data in Official Statistics: the Case of Sc...Istituto nazionale di statistica

Logical Data Fabric and Data Mesh – Driving Business OutcomesDenodo

Big Data ArchitectureGuido Schmutz

Big Data Analytics & ArchitectureAnjani Phuyal

Empfohlen

MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB

Big Data Analytics in the Cloud with Microsoft AzureMark Kromer

BigData Hadoop Kumari Surabhi

Traditional data wordorcoxsm

IT Architectures for Handling Big Data in Official Statistics: the Case of Sc...Istituto nazionale di statistica

Logical Data Fabric and Data Mesh – Driving Business OutcomesDenodo

Big Data ArchitectureGuido Schmutz

Big Data Analytics & ArchitectureAnjani Phuyal

MongoDB Breakfast Milan - Mainframe Offloading StrategiesMongoDB

unit 1 big data.pptxMohammedShahid562503

Key Skills Required for Data EngineeringFibonalabs

BD_Architecture and Charateristics.pptx.pdferamfatima43

Simply Business' Data PlatformDani Solà Lagares

Enabling Next Gen Analytics with Azure Data Lake and StreamSetsStreamsets Inc.

Hadoop Training Tutorial for Freshersrajkamaltibacademy

IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak

Ledingkart Meetup #4: Data pipeline @ lkMukesh Singh

Architecting Agile Data Applications for ScaleDatabricks

Paving The Way To Data DrivenMohd Izhar Firdaus Ismail

AGIT 2015 - Hans Viehmann: "Big Data and Smart Cities"jstrobl

Big Data Architectures @ JAX / BigDataCon 2016Guido Schmutz

Data Lake OverviewJames Serra

Introduction to NoSQL and MongoDBAhmed Farag

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn

Ajith_kumar_4.3 Years_Informatica_ETLAjith Kumar Pampatti

ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY

Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauMongoDB

How Data Virtualization Adds Value to Your Data Science StackDenodo

Manulife - Insurer Innovation Award 2024The Digital Insurer

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Weitere ähnliche Inhalte

Ähnlich wie data_engineering_basics.pdf

MongoDB Breakfast Milan - Mainframe Offloading StrategiesMongoDB

unit 1 big data.pptxMohammedShahid562503

Key Skills Required for Data EngineeringFibonalabs

BD_Architecture and Charateristics.pptx.pdferamfatima43

Simply Business' Data PlatformDani Solà Lagares

Enabling Next Gen Analytics with Azure Data Lake and StreamSetsStreamsets Inc.

Hadoop Training Tutorial for Freshersrajkamaltibacademy

IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak

Ledingkart Meetup #4: Data pipeline @ lkMukesh Singh

Architecting Agile Data Applications for ScaleDatabricks

Paving The Way To Data DrivenMohd Izhar Firdaus Ismail

AGIT 2015 - Hans Viehmann: "Big Data and Smart Cities"jstrobl

Big Data Architectures @ JAX / BigDataCon 2016Guido Schmutz

Data Lake OverviewJames Serra

Introduction to NoSQL and MongoDBAhmed Farag

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn

Ajith_kumar_4.3 Years_Informatica_ETLAjith Kumar Pampatti

ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY

Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauMongoDB

How Data Virtualization Adds Value to Your Data Science StackDenodo

Ähnlich wie data_engineering_basics.pdf (20)

MongoDB Breakfast Milan - Mainframe Offloading Strategies

unit 1 big data.pptx

Key Skills Required for Data Engineering

BD_Architecture and Charateristics.pptx.pdf

Simply Business' Data Platform

Enabling Next Gen Analytics with Azure Data Lake and StreamSets

Hadoop Training Tutorial for Freshers

IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...

Ledingkart Meetup #4: Data pipeline @ lk

Architecting Agile Data Applications for Scale

Paving The Way To Data Driven

AGIT 2015 - Hans Viehmann: "Big Data and Smart Cities"

Big Data Architectures @ JAX / BigDataCon 2016

Data Lake Overview

Introduction to NoSQL and MongoDB

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016

Ajith_kumar_4.3 Years_Informatica_ETL

ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture

Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau

How Data Virtualization Adds Value to Your Data Science Stack

Kürzlich hochgeladen

Manulife - Insurer Innovation Award 2024The Digital Insurer

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies

A Year of the Servo Reboot: Where Are We Now?Igalia

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Why Teams call analytics are critical to your entire businesspanagenda

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

GenAI Risks & Security Meetup 01052024.pdflior mazor

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Kürzlich hochgeladen (20)

Manulife - Insurer Innovation Award 2024

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

A Year of the Servo Reboot: Where Are We Now?

Automating Google Workspace (GWS) & more with Apps Script

AWS Community Day CPH - Three problems of Terraform

Why Teams call analytics are critical to your entire business

Axa Assurance Maroc - Insurer Innovation Award 2024

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

GenAI Risks & Security Meetup 01052024.pdf

Tata AIG General Insurance Company - Insurer Innovation Award 2024

HTML Injection Attacks: Impact and Mitigation Strategies

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

🐬 The future of MySQL is Postgres 🐘

Apidays New York 2024 - The value of a flexible API Management solution for O...

Data Cloud, More than a CDP by Matt Robison

data_engineering_basics.pdf

1. DEMYSTIFYING DATA ENGINEERING BASICS & GETTING STARTED

2. Source: The AI Hierarchy of Needs - Monica Rogati

3. TYPICAL ARCHITECTURE/BLUEPRINT

4. Natural Language Processing, Artificial Intelligence, Machine Learning and Deep Learning needs a strong Data foundation.

5. Where to begin? there is nothing! huge mess

6. DATA ENGINEERING

7. ● “Data” engineers design and build pipelines that transform and transport data into a format wherein, by the time it reaches the Data Scientists or other end users, it is in a highly usable state. These pipelines must take data from many disparate sources and collect them into a single warehouse that represents the data uniformly as a single source of truth. ● Designing, building and scaling systems that organize data for analytics. ● Data Engineers prepare the Big Data infrastructure to be analyzed by Data Scientists. ● Data engineering is the process of designing and building systems that let people collect and analyze raw data from multiple sources and formats.

8. SKILL SET

9. Development + Cloud Computing + Big Data + Databases software engineering big data cloud computing databases

10. DISTINCT ROLES

11. ROLES Data Engineer: ● Data engineers work in a variety of settings to build systems that collect, manage, and convert raw data into usable information for data scientists and business analysts to interpret. Data Scientist: ● They use linear algebra and multivariable calculus to create new insight from existing data. Business Analyst: ● Analysis and exploration of historical data → identify trends, patterns & understand the information → drive business change

12. let’s talk about the specifics….

13. ETL (EXTRACT, TRANSFORM, LOAD) the absolute core of Data Engineering

14. ETL Process

15. BIG DATA PROPERTIES

16. V’s of BIG DATA Volume ◾ How much data you have Velocity ◾ How fast data is getting to you Variety ◾ How different your data is Veracity ◾ How reliable your data is

17. DATA TYPES/CLASSIFICATION

18. TYPES Unstructured/Raw data ● Unprocessed data in format used on source, Text, CSV, Image, Video, etc.. ● High Latency ● No schema applied ● Stored in Google Cloud Storage, AWS S3 ● Tools like Snowflake, MongoDB allow their specific ways to query unstructured data Structured/Processed data ● Raw data with schema applied ● Stored in event tables/destinations in pipelines ● Analytics query language: ideally SQL-like ● Low latency data ingestion ● Read focus over large portion of data

19. DATA PROCESSING METHODS

20. BATCH PROCESSING

21. STREAM PROCESSING Process data on the fly, as it comes in

22. Batch vs Stream Batch Processing Stream Processing Data scope Processing over all or most of the data set processing over data on rolling window or most recent data record Data size Large batches of data Individual records or micro batches of few records Latency in minutes to hours in the order of seconds or milliseconds

23. PROCESSING FRAMEWORKS

24. MAP REDUCE ● MapReduce is a processing technique and a program model for distributed computing. ● The algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). ● Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.

25.

26. SPARK VS HADOOP

27. DATA STORAGE

28. Relational Database (SQL) Document Store (NoSQL)

29. DEMO/POC

30. REFERENCES

31. The Data Engineering Cookbook https://github.com/andkret/Cookbook

32. THANK YOU

33. Connect: ● Ketan (LinkedIn) ○ Computer Science ‘24 Grad @ Michigan Tech ○ Ex - Data Engineer @ Abzooba : Abzooba is one of the top 50 Best Data Science firms in India to work for. Focuses on developing the highest quality analytics products and services using expertise in Big Data and Cloud, AI, and ML. ○ A constant Learner