SlideShare ist ein Scribd-Unternehmen logo
1 von 67
Big Data in 200KM/h
Omid Vahdaty, Big Data Ninja
Disclaimer
● I am not the best, I simply love
what I do VERY much.
● You are more than welcome to
challenge me or anything I have
to say as I could be wrong.
● This Lecture has evolves over
time, this is the 4th iteration.
● Feel Free to send me comments
A little bit about Me
In the Past(web,api, ops db, data warehouse)
5
Then came Big Data...
6
Then came the cloud...
7
Then came the invoice ...
It keeps
growing….
TODAY’S BIG DATA
APPLICATION STACK
PaaS and DC...
MY AWS BIG DATA
APPLICATION STACK
With some “Myst”...
Jargon
Big data?
Architecture?
Considerations?
Challenges?
How to get started?
Big Jargon & Basics concepts u should know
https://amazon-aws-big-data-demystified.ninja/2019/02/18/big-data-jargon-faqs-
and-everything-you-wanted-to-know-and-didnt-ask-about-big-data/
● What is Big Data?
● scale out / up ?
● structured/semi structured/unstructured data ?
● ACID?
● OLAP VS OLTP? == Analytics VS operational
● DIY = Do it Yourself
● PaaS = Platform as a service
Big Data =
When your data outgrows
your infrastructure ability to process
● Volume (x TB processing per day)
● Velocity ( x GB/s)
● Variety (JSON, CSV, events etc)
● Veracity (how much of the data is accurate?)
Challenges creating big data architecture?
● What is the business use case ? How fast do u need the insights?
○ 15 min - 24 hours delay and above → use batch
○ Less than 15 min?
■ Might be batch - depends data source is files or events?.
■ Streaming?
● Sub seconds delay?
● Sub minute delay?
■ Streaming with in flight analytics ?
○ How complex is the compute jobs? Aggregations? joins?
Challenges creating big data architecture?
● What is the Velocity?
○ Under 100K events per second? Not a problem
○ Over 1M events per second? Costly. But doable.
○ Over 1B events per seconds? Not trivial at all.
● Volume ?
○ ~1TB a day ? Not a problem
○ Over ? it depends.
○ Over a petabyte? Well…. It depends.
● Veracity (how are you going to handle different data sources?)
○ Structured (CSV)
○ Semi structured (JSON,XML)
○ Unstructured (pictures, movies etc)
Challenges creating big data architecture?
● Performance targets?
● Costs targets?
● Security restrictions?
● Regulation restriction? privacy?
● Which technology to choose?
● Datacenter or cloud?
● Latency?
● Throughput?
● Concurrency?
● Security Access patterns?
● Pass? Max 7 technologies
● Iaas? Max 4 technologies
Cloud Architecture rules of thumb...
● Decoupling
● Rule of thumb: max 3 technologies in dc, 7 tech max in cloud
○ Do use more b/c: maintenance
○ Training time
○ complexity/simplicity
How to get started on big data architecture?
Get answers to the below:
1. What is the business use case?
a. Volume? velocity? Variety? vercity/
b. Did you map all of data sources?
2. Where should we build a data platform?
a. What is the product? → requirements.
b. Cloud? datacenter?
3. Architecture?
a. DIY? Paas? , Pay as you go? Or fixed? Decoupled?
b. Fast? Cheap? simple?
4. Did you Communicate you plans?
5. Did you map all known challenges?
Where to build the data platform?
Big Data Architecture:
● Data Ingestion
○ Online
■ messaging
■ Streaming
○ Offline
■ Batch
■ Performance aspects
● Data Transformation (Hive)
○ JSON, CSV, TXT, PARQUET, Binary
● Data Modeling - (R, ML, AI, DEEP, SPARK)
● Data Visualization (choose your poison)
● PII regulation + GPDR regulation
● And: Performance... Cost… Security… Simple... Cloud best practices...
Big Data Generic Architecture
Data Ingestion
(file based ETL from remote DC)
Data Transformation ( row to colunar + cleansing)
Data Modeling ( joins/agg/ML/R)
Data Presentation
Text,
RAW
Data Ingestion
A layer in your big data architecture
designed to do one thing : ingest
data via Batch or Streaming, I.e
move (only) data from point A to
point B. from source data to the
next layer in the architecture
(decoupled).
Big Data Generic Architecture | Data Ingestion
Data Ingestion
Data Transformation
Data Modeling
Data Presentation
S3 Considerations
● Security
○ at rest: server side S3-Managed Keys (SSE-S3)
○ at transit: SSL / VPN
○ Hardening: user, IP ACL, write permission only.
● Upload
○ AWS s3 cli
○ Multi part upload
○ Aborting Incomplete Multipart Uploads Using a
Bucket Lifecycle Policy
○ Consider S3 CLI Sync command instead of CP
Sqoop - ETL
● Open source , part of EMR
● HDFS to RDMS and back. Via JDBC.
● E.g BiDirectional ETL from RDS to HDFS
● Unlikely use case: ETL from customer source operational DB.
Flume & Kafka
● Opens source project for streaming & messaging
● Popular
● Generic
● Good practice for many use cases. (a meetup by it self)
● Highly durable, scalable, extension etc.
● Downside : DIY, Non trivial to get started
Data Transfer Options
● Direct Connect (4GB/s?)
● For all other use case
○ S3 multipart upload
○ Compression
○ Security
■ Data at motion
■ Data at rest
Kinesis family of products
● Kinesis Stream - collect@source and near real time processing
○ Near real time
○ High throughput
○ Low cost
○ Easy administration - set desired level of capacity
○ Delivery to : s3,redshift, Dynamo, ...
○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second.
○ Not managed!
● Kinesis Analytics - in flight analytics.
● Kin. Firehose - Park you data @ destination.
Kinesis Firehose - for Data parking
● Not for fast lane - no in flight analytics
● Ingest , transform and load to:
○ Kinesis
○ S3
○ Redshift
○ elastic search
● Managed Service
Comparison of Kinesis products
● Streams
○ Sub 1 sec processing latency
○ Choice of stream processor (generic)
○ For smaller events
● Firehose
○ Zero admin
○ 4 targets built in (redshift, s3, search, etc)
○ Buffering 60 sec minimum.
○ For larger “events”
Data
Transformation
A layer in your big data architecture
designed to : Transform and
Cleanse data (row data to columnar
data and convert data types, Fix
bugs in data)
Big Data Generic Architecture | Transformation
Data Ingestion
S3
Data Transformation
Data Modeling
Data Presentation
Quick word about hadoop Ecosystem
EMR ecosystem
● Hive
● Pig
● Hue
● Spark
● Oozie
● Presto
● Ganglia
● Zookeeper (hbase)
● zeppelin
EMR Architecture
● Master node
● Core nodes - like data nodes (with storage: HDFS)
● Task nodes - (extends compute)
● Does Not have Standby Master node
● Best for transient cluster (goes up and down every night)
EMR lesson learned...
● Bigger instance type is good architecture
● Use spot instances - for the tasks only.
● Don't always use TEZ (MR? Spark?)
● Make sure your choose instance with network optimized
● Resize cluster is not recommended
● Bootstrap to automate cluster upon provisioning
● Use Steps to automate steps on running cluster
● Use Glue to share Hive MetaStore
● Good Cost reduction article on EMR
So use EMR for ...
● Most dominant
○ Hive
○ Spark
○ Presto
● And many more….
● Good for:
○ Data transformation
○ Data modeling
○ Batch
○ Machine learning
Hive
● SQL over hadoop.
● Engine: spark, tez, MR
● JDBC / ODBC
● Not good when need to shuffle.
● Not peta scale.
● SerDe json, parquet,regex,text etc.
● Dynamic partitions
● Insert overwrite
● Data Transformation
● Convert to Columnar
Presto
● SQL over hadoop
● Not good always for join on 2 large tables.
● Limited by memory
● Not fault tolerant like hive.
● Optimized for ad hoc queries
● No insert overwrite
● No dynamic partitions.
● Has some connectors : redshift and more
● https://amazon-aws-big-data-
demystified.ninja/2018/07/02/aws-emr-
presto-demystified-everything-you-
wanted-to-know-about-presto/
Pig
● Distributed Shell scripting
● Generating SQL like operations.
● Engine: MR, Tez
● S3, DynamoDB access
● Use Case: for data science who don't know
SQL, for system people, for those who want
to avoid java/scala
● Fair fight compared to hive in term of
performance only
● Good for unstructured files ETL : file to file ,
and use sqoop.
Hue
● Hadoop user experience
● Logs in real time and failures.
● Multiple users
● Native access to S3.
● File browser to HDFS.
● Manipulate metascore
● Job Browser
● Query editor
● Hbase browser
● Sqoop editor, oozier editor, Pig Editor
Orchestration
● EMR Oozie
○ Opens source workflow
■ Workflow: graph of action
■ Coordinator: scheduler jobs
○ Support: hive, sqoop , spark etc.
● Other options: AirFlow, Knime, Luigi, Azkaban,AWS Data Pipeline
Big Data Generic Architecture | Transformation
Data Ingestion
S3
Data Transformation
Data Modeling
Data Visualization
Data Modeling
A layer in your big data architecture
designed to Model data: Joins,
Aggregations, nightly jobs, Machine
learning
Big Data Generic Architecture | Modeling
Data Ingestion
S3
Data Transformation
Data Modeling
Data Presentation
Spark
● In memory
● X10 to X100 times faster from hive
● Good optimizer for distribution
● Rich API
● Spark SQL
● Spark Streaming
● Spark ML (ML lib)
● Spark GraphX (DB graphs)
● SparkR
Spark Streaming
● Near real time (1 sec latency)
● like batch of 1sec windows
● Streaming jobs with API
● DIY = Not relevant to us...
Spark ML
● Classification
● Regression
● Collaborative filtering
● Clustering
● Decomposition
● Code: java, scala, python, sparkR
Spark flavours
● Standalone
● With yarn
● With mesos
Spark SQL
● Same syntax as hive
● Optional JDBC via thrift
● Non trivial learning curve
● Upto X10 faster than hive.
● Works well with Zeppelin (out of the box)
● Does not replaces Hive
● Spark not always faster than hive
● insert overwrite -
Apache Zeppelin
● Notebook - visualizer
● Built in spark integration
● Interactive data analytics
● Easy collaboration.
● Uses SQL
● works on top of Hive/ Spark SQL
● Inside EMR.
● Uses in the background:
○ Shiro
○ Livy
Redshift
● OLAP, not OLTP→ analytics , not transaction
● Fully SQL
● Fully ACID
● No indexing
● Fully managed
● Petabyte Scale
● MPP
● Can create slow queue for queries
○ which are long lasting.
● DO NOT USE FOR transformation.
● Good for : DW, Complex Joins.
EMR vs Redshift
● How much data loaded and unloaded?
● Which operations need to performed?
● Recycling data? → EMR
● History to be analyzed again and again ? → emr
● What the data needs to end up? BI?
● Use spectrum in some use cases. (aggregations)?
● Raw data? S3.
● When to use emr and when redshift?
Hive VS. Redshift
● Amount of concurrency ? low → hive, high → redshift
● Access to customers? Redshift? athena?
● Transformation, Unstructured , batch, ETL → hive.
● Peta scale ? redshift
● Complex joins → Redshift
Sage Maker
● Web notebook (jupiter
based) for data science
● Connects to all your data
sources (s3,athena etc)
● Help you manage the
entire lifecycle machine
learning
● Managed Service
● Used to create a ML to
predict cookie gender
AWS Glue
Shared meta store
Helps with some data transformation (managed service)
Automatic Schema discovery
AWD RDS (aurora, postgres, mysql)
● We used RDS aurora as Operational DB
● We did not use it for big data analytics although it supports upto 64Tb
● It is row based.
● The syntax is missing analytical functions
Big Data Generic Architecture | Modeling
Data Ingestion
S3
Data Transformation
Data Modeling
Data Presentation
Data
Presentation Used ONLY for presenting data for
operational applications or BI, Use
managed service to ensure HA.
Big Data Generic Architecture | Presentation
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
Athena
● Presto SQL
● In memory
● Hive metastore for DDL functionality
○ Complex data types
○ Multiple formats
○ Partitions
● Now supports CTAS (No inserts are supported)
● Good for:
○ Read only SQL,
○ Ad hoc query,
○ low cost,
○ Managed
● Good cost reduction article on athena
Summery Take Away Message &
Action Items
Big Data Generic Architecture | Summary
Data Ingestion
S3
Data Transformation
Data Modeling
Data Presentation
Big Data Architecture ?
Faster! Cheaper! Simpler!
How to get started | Call for Action
Lectures: AWS Big Data Demystified lectures #1 until #4
AWS Big Data Demystified Meetup Big Data Demystified meetup
Stay in touch...
● Omid Vahdaty
● +972-54-2384178
● https://big-data-demystified.ninja/
● Join our meetups subscribe to youtube channels
○ https://www.meetup.com/AWS-Big-Data-Demystified/
○ https://www.meetup.com/Big-Data-Demystified/
○ Big Data Demystified YouTube
○ AWS Big Data Demystified YouTube

Weitere ähnliche Inhalte

Was ist angesagt?

Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceHBaseCon
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)Farzin Bagheri
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ IndixRajesh Muppalla
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLdatamantra
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDBMongoDB
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidCharles Allen
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...MongoDB
 
Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...DataWorks Summit
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
What Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoWhat Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoScyllaDB
 
Traveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analyticsTraveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analyticsRendy Bambang Junior
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageKai Sasaki
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data PipelineManish Kumar
 
HBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgentHBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgentHBaseCon
 

Was ist angesagt? (20)

Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Google Cloud Spanner Preview
Google Cloud Spanner PreviewGoogle Cloud Spanner Preview
Google Cloud Spanner Preview
 
introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDB
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...
 
Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
What Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoWhat Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and Go
 
Traveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analyticsTraveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analytics
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
 
HBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgentHBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgent
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 

Ähnlich wie Big Data in 200 km/h | AWS Big Data Demystified #1.3

Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the CloudAmihay Zer-Kavod
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"Rob Winters
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 

Ähnlich wie Big Data in 200 km/h | AWS Big Data Demystified #1.3 (20)

Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 

Mehr von Omid Vahdaty

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedOmid Vahdaty
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...Omid Vahdaty
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedOmid Vahdaty
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...Omid Vahdaty
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...Omid Vahdaty
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedOmid Vahdaty
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...Omid Vahdaty
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedOmid Vahdaty
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystifiedOmid Vahdaty
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
 
Introduction to aws dynamo db
Introduction to aws dynamo dbIntroduction to aws dynamo db
Introduction to aws dynamo dbOmid Vahdaty
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSqlOmid Vahdaty
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process Omid Vahdaty
 

Mehr von Omid Vahdaty (20)

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data Demystified
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data Demystified
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystified
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystified
 
Aws s3 security
Aws s3 securityAws s3 security
Aws s3 security
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Introduction to aws dynamo db
Introduction to aws dynamo dbIntroduction to aws dynamo db
Introduction to aws dynamo db
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
 

Kürzlich hochgeladen

2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
signals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsignals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsapna80328
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodManicka Mamallan Andavar
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmDeepika Walanjkar
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESkarthi keyan
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdfAkritiPradhan2
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 

Kürzlich hochgeladen (20)

2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
signals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsignals in triangulation .. ...Surveying
signals in triangulation .. ...Surveying
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument method
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdf
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 

Big Data in 200 km/h | AWS Big Data Demystified #1.3

  • 1. Big Data in 200KM/h Omid Vahdaty, Big Data Ninja
  • 2. Disclaimer ● I am not the best, I simply love what I do VERY much. ● You are more than welcome to challenge me or anything I have to say as I could be wrong. ● This Lecture has evolves over time, this is the 4th iteration. ● Feel Free to send me comments
  • 3. A little bit about Me
  • 4.
  • 5. In the Past(web,api, ops db, data warehouse) 5
  • 6. Then came Big Data... 6
  • 7. Then came the cloud... 7
  • 8. Then came the invoice ...
  • 10. TODAY’S BIG DATA APPLICATION STACK PaaS and DC...
  • 11. MY AWS BIG DATA APPLICATION STACK With some “Myst”...
  • 13. Big Jargon & Basics concepts u should know https://amazon-aws-big-data-demystified.ninja/2019/02/18/big-data-jargon-faqs- and-everything-you-wanted-to-know-and-didnt-ask-about-big-data/ ● What is Big Data? ● scale out / up ? ● structured/semi structured/unstructured data ? ● ACID? ● OLAP VS OLTP? == Analytics VS operational ● DIY = Do it Yourself ● PaaS = Platform as a service
  • 14. Big Data = When your data outgrows your infrastructure ability to process ● Volume (x TB processing per day) ● Velocity ( x GB/s) ● Variety (JSON, CSV, events etc) ● Veracity (how much of the data is accurate?)
  • 15. Challenges creating big data architecture? ● What is the business use case ? How fast do u need the insights? ○ 15 min - 24 hours delay and above → use batch ○ Less than 15 min? ■ Might be batch - depends data source is files or events?. ■ Streaming? ● Sub seconds delay? ● Sub minute delay? ■ Streaming with in flight analytics ? ○ How complex is the compute jobs? Aggregations? joins?
  • 16. Challenges creating big data architecture? ● What is the Velocity? ○ Under 100K events per second? Not a problem ○ Over 1M events per second? Costly. But doable. ○ Over 1B events per seconds? Not trivial at all. ● Volume ? ○ ~1TB a day ? Not a problem ○ Over ? it depends. ○ Over a petabyte? Well…. It depends. ● Veracity (how are you going to handle different data sources?) ○ Structured (CSV) ○ Semi structured (JSON,XML) ○ Unstructured (pictures, movies etc)
  • 17. Challenges creating big data architecture? ● Performance targets? ● Costs targets? ● Security restrictions? ● Regulation restriction? privacy? ● Which technology to choose? ● Datacenter or cloud? ● Latency? ● Throughput? ● Concurrency? ● Security Access patterns? ● Pass? Max 7 technologies ● Iaas? Max 4 technologies
  • 18. Cloud Architecture rules of thumb... ● Decoupling ● Rule of thumb: max 3 technologies in dc, 7 tech max in cloud ○ Do use more b/c: maintenance ○ Training time ○ complexity/simplicity
  • 19. How to get started on big data architecture? Get answers to the below: 1. What is the business use case? a. Volume? velocity? Variety? vercity/ b. Did you map all of data sources? 2. Where should we build a data platform? a. What is the product? → requirements. b. Cloud? datacenter? 3. Architecture? a. DIY? Paas? , Pay as you go? Or fixed? Decoupled? b. Fast? Cheap? simple? 4. Did you Communicate you plans? 5. Did you map all known challenges?
  • 20. Where to build the data platform?
  • 21. Big Data Architecture: ● Data Ingestion ○ Online ■ messaging ■ Streaming ○ Offline ■ Batch ■ Performance aspects ● Data Transformation (Hive) ○ JSON, CSV, TXT, PARQUET, Binary ● Data Modeling - (R, ML, AI, DEEP, SPARK) ● Data Visualization (choose your poison) ● PII regulation + GPDR regulation ● And: Performance... Cost… Security… Simple... Cloud best practices...
  • 22. Big Data Generic Architecture Data Ingestion (file based ETL from remote DC) Data Transformation ( row to colunar + cleansing) Data Modeling ( joins/agg/ML/R) Data Presentation Text, RAW
  • 23. Data Ingestion A layer in your big data architecture designed to do one thing : ingest data via Batch or Streaming, I.e move (only) data from point A to point B. from source data to the next layer in the architecture (decoupled).
  • 24. Big Data Generic Architecture | Data Ingestion Data Ingestion Data Transformation Data Modeling Data Presentation
  • 25. S3 Considerations ● Security ○ at rest: server side S3-Managed Keys (SSE-S3) ○ at transit: SSL / VPN ○ Hardening: user, IP ACL, write permission only. ● Upload ○ AWS s3 cli ○ Multi part upload ○ Aborting Incomplete Multipart Uploads Using a Bucket Lifecycle Policy ○ Consider S3 CLI Sync command instead of CP
  • 26. Sqoop - ETL ● Open source , part of EMR ● HDFS to RDMS and back. Via JDBC. ● E.g BiDirectional ETL from RDS to HDFS ● Unlikely use case: ETL from customer source operational DB.
  • 27. Flume & Kafka ● Opens source project for streaming & messaging ● Popular ● Generic ● Good practice for many use cases. (a meetup by it self) ● Highly durable, scalable, extension etc. ● Downside : DIY, Non trivial to get started
  • 28. Data Transfer Options ● Direct Connect (4GB/s?) ● For all other use case ○ S3 multipart upload ○ Compression ○ Security ■ Data at motion ■ Data at rest
  • 29. Kinesis family of products ● Kinesis Stream - collect@source and near real time processing ○ Near real time ○ High throughput ○ Low cost ○ Easy administration - set desired level of capacity ○ Delivery to : s3,redshift, Dynamo, ... ○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second. ○ Not managed! ● Kinesis Analytics - in flight analytics. ● Kin. Firehose - Park you data @ destination.
  • 30. Kinesis Firehose - for Data parking ● Not for fast lane - no in flight analytics ● Ingest , transform and load to: ○ Kinesis ○ S3 ○ Redshift ○ elastic search ● Managed Service
  • 31. Comparison of Kinesis products ● Streams ○ Sub 1 sec processing latency ○ Choice of stream processor (generic) ○ For smaller events ● Firehose ○ Zero admin ○ 4 targets built in (redshift, s3, search, etc) ○ Buffering 60 sec minimum. ○ For larger “events”
  • 32. Data Transformation A layer in your big data architecture designed to : Transform and Cleanse data (row data to columnar data and convert data types, Fix bugs in data)
  • 33. Big Data Generic Architecture | Transformation Data Ingestion S3 Data Transformation Data Modeling Data Presentation
  • 34. Quick word about hadoop Ecosystem
  • 35. EMR ecosystem ● Hive ● Pig ● Hue ● Spark ● Oozie ● Presto ● Ganglia ● Zookeeper (hbase) ● zeppelin
  • 36. EMR Architecture ● Master node ● Core nodes - like data nodes (with storage: HDFS) ● Task nodes - (extends compute) ● Does Not have Standby Master node ● Best for transient cluster (goes up and down every night)
  • 37. EMR lesson learned... ● Bigger instance type is good architecture ● Use spot instances - for the tasks only. ● Don't always use TEZ (MR? Spark?) ● Make sure your choose instance with network optimized ● Resize cluster is not recommended ● Bootstrap to automate cluster upon provisioning ● Use Steps to automate steps on running cluster ● Use Glue to share Hive MetaStore ● Good Cost reduction article on EMR
  • 38. So use EMR for ... ● Most dominant ○ Hive ○ Spark ○ Presto ● And many more…. ● Good for: ○ Data transformation ○ Data modeling ○ Batch ○ Machine learning
  • 39. Hive ● SQL over hadoop. ● Engine: spark, tez, MR ● JDBC / ODBC ● Not good when need to shuffle. ● Not peta scale. ● SerDe json, parquet,regex,text etc. ● Dynamic partitions ● Insert overwrite ● Data Transformation ● Convert to Columnar
  • 40. Presto ● SQL over hadoop ● Not good always for join on 2 large tables. ● Limited by memory ● Not fault tolerant like hive. ● Optimized for ad hoc queries ● No insert overwrite ● No dynamic partitions. ● Has some connectors : redshift and more ● https://amazon-aws-big-data- demystified.ninja/2018/07/02/aws-emr- presto-demystified-everything-you- wanted-to-know-about-presto/
  • 41. Pig ● Distributed Shell scripting ● Generating SQL like operations. ● Engine: MR, Tez ● S3, DynamoDB access ● Use Case: for data science who don't know SQL, for system people, for those who want to avoid java/scala ● Fair fight compared to hive in term of performance only ● Good for unstructured files ETL : file to file , and use sqoop.
  • 42. Hue ● Hadoop user experience ● Logs in real time and failures. ● Multiple users ● Native access to S3. ● File browser to HDFS. ● Manipulate metascore ● Job Browser ● Query editor ● Hbase browser ● Sqoop editor, oozier editor, Pig Editor
  • 43. Orchestration ● EMR Oozie ○ Opens source workflow ■ Workflow: graph of action ■ Coordinator: scheduler jobs ○ Support: hive, sqoop , spark etc. ● Other options: AirFlow, Knime, Luigi, Azkaban,AWS Data Pipeline
  • 44. Big Data Generic Architecture | Transformation Data Ingestion S3 Data Transformation Data Modeling Data Visualization
  • 45. Data Modeling A layer in your big data architecture designed to Model data: Joins, Aggregations, nightly jobs, Machine learning
  • 46. Big Data Generic Architecture | Modeling Data Ingestion S3 Data Transformation Data Modeling Data Presentation
  • 47. Spark ● In memory ● X10 to X100 times faster from hive ● Good optimizer for distribution ● Rich API ● Spark SQL ● Spark Streaming ● Spark ML (ML lib) ● Spark GraphX (DB graphs) ● SparkR
  • 48. Spark Streaming ● Near real time (1 sec latency) ● like batch of 1sec windows ● Streaming jobs with API ● DIY = Not relevant to us...
  • 49. Spark ML ● Classification ● Regression ● Collaborative filtering ● Clustering ● Decomposition ● Code: java, scala, python, sparkR
  • 50. Spark flavours ● Standalone ● With yarn ● With mesos
  • 51. Spark SQL ● Same syntax as hive ● Optional JDBC via thrift ● Non trivial learning curve ● Upto X10 faster than hive. ● Works well with Zeppelin (out of the box) ● Does not replaces Hive ● Spark not always faster than hive ● insert overwrite -
  • 52. Apache Zeppelin ● Notebook - visualizer ● Built in spark integration ● Interactive data analytics ● Easy collaboration. ● Uses SQL ● works on top of Hive/ Spark SQL ● Inside EMR. ● Uses in the background: ○ Shiro ○ Livy
  • 53. Redshift ● OLAP, not OLTP→ analytics , not transaction ● Fully SQL ● Fully ACID ● No indexing ● Fully managed ● Petabyte Scale ● MPP ● Can create slow queue for queries ○ which are long lasting. ● DO NOT USE FOR transformation. ● Good for : DW, Complex Joins.
  • 54. EMR vs Redshift ● How much data loaded and unloaded? ● Which operations need to performed? ● Recycling data? → EMR ● History to be analyzed again and again ? → emr ● What the data needs to end up? BI? ● Use spectrum in some use cases. (aggregations)? ● Raw data? S3. ● When to use emr and when redshift?
  • 55. Hive VS. Redshift ● Amount of concurrency ? low → hive, high → redshift ● Access to customers? Redshift? athena? ● Transformation, Unstructured , batch, ETL → hive. ● Peta scale ? redshift ● Complex joins → Redshift
  • 56. Sage Maker ● Web notebook (jupiter based) for data science ● Connects to all your data sources (s3,athena etc) ● Help you manage the entire lifecycle machine learning ● Managed Service ● Used to create a ML to predict cookie gender
  • 57. AWS Glue Shared meta store Helps with some data transformation (managed service) Automatic Schema discovery
  • 58. AWD RDS (aurora, postgres, mysql) ● We used RDS aurora as Operational DB ● We did not use it for big data analytics although it supports upto 64Tb ● It is row based. ● The syntax is missing analytical functions
  • 59. Big Data Generic Architecture | Modeling Data Ingestion S3 Data Transformation Data Modeling Data Presentation
  • 60. Data Presentation Used ONLY for presenting data for operational applications or BI, Use managed service to ensure HA.
  • 61. Big Data Generic Architecture | Presentation Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 62. Athena ● Presto SQL ● In memory ● Hive metastore for DDL functionality ○ Complex data types ○ Multiple formats ○ Partitions ● Now supports CTAS (No inserts are supported) ● Good for: ○ Read only SQL, ○ Ad hoc query, ○ low cost, ○ Managed ● Good cost reduction article on athena
  • 63. Summery Take Away Message & Action Items
  • 64. Big Data Generic Architecture | Summary Data Ingestion S3 Data Transformation Data Modeling Data Presentation
  • 65. Big Data Architecture ? Faster! Cheaper! Simpler!
  • 66. How to get started | Call for Action Lectures: AWS Big Data Demystified lectures #1 until #4 AWS Big Data Demystified Meetup Big Data Demystified meetup
  • 67. Stay in touch... ● Omid Vahdaty ● +972-54-2384178 ● https://big-data-demystified.ninja/ ● Join our meetups subscribe to youtube channels ○ https://www.meetup.com/AWS-Big-Data-Demystified/ ○ https://www.meetup.com/Big-Data-Demystified/ ○ Big Data Demystified YouTube ○ AWS Big Data Demystified YouTube