AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

AWS Big Data Demystified #1.1
Big Data Architecture
Lessons Learned
Omid Vahdaty, Big Data Ninja
Disclaimer
● I am not trying to sell anything.
● This is purely knowledge transfer session.
● You are more than welcome to challenge each slide, during the lecture and
afterwards :)
● This lecture was released at 2018, take into account things change over time.
When the data outgrows your ability to process
● Volume (100TB processing per day)
● Velocity (4GB/s)
● Variety (JSON, CSV, events etc)
● Veracity (how much of the data is accurate?)
In the past (web,api, ops db, data warehouse)
API
DW
TODAY’S BIG DATA
APPLICATION STACK
PaaS and DC...
MY BIG DATA
APPLICATION STACK
“demystified”...
MY AWS BIG DATA
APPLICATION STACK
With some “Myst”...
Challenges creating big data architecture?
● What is the business use case ? How fast do u need the insights?
○ 15 min - 24 hours delay and above → use batch
○ Less than 15 min?
■ Might be batch - depends data source is files or events?.
■ Streaming?
● Sub seconds delay?
● Sub minute delay?
■ Streaming with in flight analytics ?
○ How complex is the compute jobs? Aggregations? joins?
Challenges creating big data architecture?
● What is the velocity?
○ Under 100K events per second? Not a problem
○ Over 1M events per second? Costly. But doable.
○ Over 1B events per seconds? Not trivial at all.
● Volume ?
○ ~1TB a day ? Not a problem
○ Over ? it depends.
○ Over a petabyte? Well…. It depends.
● Veracity (how are you going to handle different data sources?)
○ Structured (CSV)
○ Semi structured (JSON,XML)
○ Unstructured (pictures, movies etc)
Challenges creating big data architecture?
● Performance targets?
● Costs targets?
● Security restrictions?
● Regulation restriction? privacy?
● Which technology to choose?
● Datacenter or cloud?
● Latency?
● Throughput?
● Concurrency?
● Security Access patterns?
● Pass? Max 7 technologies
● Iaas? Max 4 technologies
Cloud Architecture rules of thumb...
● Decouple :
○ Store
○ Process
○ Store
○ Process
○ insight...
● Rule of thumb: max 3 technologies in dc, 7 tech max in cloud
○ Do use more b/c: maintenance
○ Training time
○ complexity/simplicity
Use Case 1: Analyzing browsing history
● Data Collection: browsing history from an ISP
● Product - derives user intent and interest for marketing purposes.
● Challenges
○ Velocity: 1 TB per day
○ History of: 3M
○ Remote DC
○ Enterprise grade security
○ Privacy
Use Case 2: Insights from location based data
● Data collection: from a Mobile operator
● Products:
○ derives user intent and interest for marketing purposes.
○ derive location based intent for marketing purposes.
● Challenges
○ Velocity: 4GB/s …
○ Scalability: Rate was expected double every year...
○ Remote DC
○ Enterprise grade security
○ Privacy
○ Edge analytics
Use Case 3: Analyzing location based events.
● Data collection: streaming
● Product: building location based audiences
● Challenges: minimizing DevOps work on maintenance of a
DIY streaming system
So what is the product?
● Big data platform that
○ collects data from multiple sources
○ Analyzes the data
○ Generates insights :
■ Smart Segments (online marketing)
■ Smart reports (for marketer)
■ Audience analysis (for agencies)
● Customers?
○ Marketers
○ Publishers
○ Agencies
My Big Data platform is about:
● Data Collection
○ Online
■ messaging
■ Streaming
○ Offline
■ Batch
■ Performance aspects
● Data Transformation (Hive)
○ JSON, CSV, TXT, PARQUET, Binary
● Data Modeling - (R, ML, AI, DEEP, SPARK)
● Data Visualization (choose your poison)
● PII regulation + GPDR regulation
● And: Performance... Cost… Security… Simple... Cloud best practices...
Big Data Generic Architecture
Data Collection
(file based ETL from remote DC)
Data Transformation ( row to colunar + cleansing)
Data Modeling ( joins/agg/ML/R)
Data Visualization
Text,
RAW
Big Data Generic Architecture | Data Collection
Data Collection
Data Transformation
Data Modeling
Data Visualization
Batch Data collection considerations
● Every hour , about 30GB compressed CSV file
● Why s3
○ Multi part upload
○ S3 CLI
○ S3 SDK
○ (tip : gzip! )
● Why Client - needs to run at remote DC
● Why NOT your own client
○ Involves code →
■ Bugs?
■ maintenance
○ Don't analyze data at Edge , since you cant go back in time.
● Why Not Streaming?
○ less accurate
○ Expensive
S3 Considerations
● Security
○ at rest: server side S3-Managed Keys (SSE-S3)
○ at transit: SSL / VPN
○ Hardening: user, IP ACL, write permission only.
● Upload
○ AWS s3 cli
○ Multi part upload
○ Aborting Incomplete Multipart Uploads Using a
Bucket Lifecycle Policy
○ Consider S3 CLI Sync command instead of CP
Sqoop - ETL
● Open source , part of EMR
● HDFS to RDMS and back. Via JDBC.
● E.g BiDirectional ETL from RDS to HDFS
● Unlikely use case: ETL from customer source DB.
Flume & Kafka
● Opens source project for streaming & messaging
● Popular
● Generic
● Good practice for many use cases. (a meetup by it self)
● Highly durable, scalable, extension etc.
● Downside : DIY, Non trivial to get started
Data Transfer Options
● VPN
● Direct Connect (4GB/s?)
● For all other use case
○ S3 multipart upload
○ Compression
○ Security
■ Data at motion
■ Data at rest
● bandwidth
Quick intro to Stream collection
● Kinesis Client Library (code)
● AWS lambda (code)
● EMR (managed hadoop)
● Third party (DIY)
○ Spark streaming (latency min =1 sec) , near real time, with lot of libraries.
○ Storm - Most real time (sub millisec), java code based.
○ Flink (similar to spark)
Kinesis
● Stream - collect@source and near real time processing
○ Near real time
○ High throughput
○ Low cost
○ Easy administration - set desired level of capacity
○ Delivery to : s3,redshift, Dynamo, ...
○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second.
○ Not managed!
● Analytics - in flight analytics.
● Firehose - Park you data @ destination.
Firehose - for Data parking
● Not for fast lane - no in flight analytics
● Capture , transform and load.
○ Kinesis
○ S3
○ Redshift
○ elastic search
● Managed Service
Comparison of Kinesis product
● Streams
○ Sub 1 sec processing latency
○ Choice of stream processor (generic)
○ For smaller events
● Firehose
○ Zero admin
○ 4 targets built in (redshift, s3, search, etc)
○ Buffering 60 sec minimum.
○ For larger “events”
Big Data Generic Architecture | Data Collection
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
Big Data Generic Architecture | Transformation
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
EMR ecosystem
● Hive
● Pig
● Hue
● Spark
● Oozie
● Presto
● Ganglia
● Zookeeper (hbase)
● zeppelin
EMR Architecture
● Master node
● Core nodes - like data nodes (with storage: HDFS)
● Task nodes - (extends compute)
● Does Not have Standby Master node
● Best for transient cluster (goes up and down every night)
EMR lesson learned...
● Bigger instance type is good architecture
● Use spot instances - for the tasks.
● Don't always use TEZ (MR? Spark?)
● Make sure your choose instance with network optimized
● Resize cluster is not recommended
● Bootstrap to automate cluster upon provisioning
● Use Steps to automate steps on running cluster
● Use Glue to share Hive MetaStore
So use EMR for ...
● Most dominant
○ Hive
○ Spark
○ Presto
● And many more….
● Good for:
○ Data transformation
○ Data modeling
○ Batch
○ Machine learning
Hive
● SQL over hadoop.
● Engine: spark, tez, MR
● JDBC / ODBC
● Not good when need to shuffle.
● Not peta scale.
● SerDe json, parquet,regex,text etc.
● Dynamic partitions
● Insert overwrite
● Data Transformation
● Convert to Columnar
Presto
● SQL over hadoop
● Not good always for join on 2 large tables.
● Limited by memory
● Not fault tolerant like hive.
● Optimized for ad hoc queries
● No insert overwrite
● No dynamic partitions.
● Has some connectors : redshift and more
● https://amazon-aws-big-data-
demystified.ninja/2018/07/02/aws-emr-
presto-demystified-everything-you-
wanted-to-know-about-presto/
Pig
● Distributed Shell scripting
● Generating SQL like operations.
● Engine: MR, Tez
● S3, DynamoDB access
● Use Case: for data science who don't know
SQL, for system people, for those who want
to avoid java/scala
● Fair fight compared to hive in term of
performance only
● Good for unstructured files ETL : file to file ,
and use sqoop.
Hue
● Hadoop user experience
● Logs in real time and failures.
● Multiple users
● Native access to S3.
● File browser to HDFS.
● Manipulate metascore
● Job Browser
● Query editor
● Hbase browser
● Sqoop editor, oozier editor, Pig Editor
Orchestration
● EMR Oozie
○ Opens source workflow
■ Workflow: graph of action
■ Coordinator: scheduler jobs
○ Support: hive, sqoop , spark etc.
● Other: AirFlow, Knime, Luigi, Azkaban,AWS Data Pipeline
Big Data Generic Architecture | Transformation
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
Big Data Generic Architecture | Modeling
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
Spark
● In memory
● X10 to X100 times faster
● Good optimizer for distribution
● Rich API
● Spark SQL
● Spark Streaming
● Spark ML (ML lib)
● Spark GraphX (DB graphs)
● SparkR
Spark Streaming
● Near real time (1 sec latency)
● like batch of 1sec windows
● Streaming jobs with API
● Not relevant to us...
Spark ML
● Classification
● Regression
● Collaborative filtering
● Clustering
● Decomposition
● Code: java, scala, python, sparkR
Spark flavours
● Standalone
● With yarn
● With mesos
Spark Downside
● Compute intensive
● Performance gain over mapreduce is not guaranteed.
● Streaming processing is actually batch with very small window.
● Different behaviour between hive and spark SQL
Spark SQL
● Same syntax as hive
● Optional JDBC via thrift
● Non trivial learning curve
● Upto X10 faster than hive.
● Works well with Zeppelin (out of the box)
● Does not replaces Hive
● Spark not always faster than hive
● insert overwrite -
Apache Zeppelin
● Notebook - visualizer
● Built in spark integration
● Interactive data analytics
● Easy collaboration.
● Uses SQL
● work s on top of Hive/ SparkSQL
● Inside EMR.
● Uses in the background:
○ Shiro
○ Livy
R + spark R
● Open source package for statistical computing.
● Works with EMR
● “Matlab” equivalent
● Works with spark
● Not for developer :) for statistician
● R is single threaded - use spark R to distribute.
● Not everything works perfect.
Redshift
● OLAP, not OLTP→ analytics , not transaction
● Fully SQL
● Fully ACID
● No indexing
● Fully managed
● Petabyte Scale
● MPP
● Can create slow queue for queries
○ which are long lasting.
● DO NOT USE FOR transformation.
● Good for : DW, Complex Joins.
Redshift spectrum
● Extension of Redshift, use external table on S3.
● Require redshift cluster.
● Not possible for CTAS to s3, complex data structure, joins.
● Good for
○ Read only Queries
○ Aggregations on Exabyte.
EMR vs Redshift
● How much data loaded and unloaded?
● Which operations need to performed?
● Recycling data? → EMR
● History to be analyzed again and again ? → emr
● What the data needs to end up? BI?
● Use spectrum in some use cases. (aggregations)?
● Raw data? s3.
Hive VS. Redshift
● Amount of concurrency ? low → hive, high → redshift
● Access to customers? Redshift?
● Transformation, Unstructured , batch, ETL → hive.
● Peta scale ? redshift
● Complex joins → Redshift
Big Data Generic Architecture | Modeling
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
Big Data Generic Architecture | Visualize
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
Athena
● Presto SQL
● In memory
● Hive metastore for DDL functionality
○ Complex data types
○ Multiple formats
○ Partitions
● Good for:
○ Read only SQL,
○ Ad hoc query,
○ low cost,
○ managed
Visualize
● QuickSight
● Managed Visualizer, simple, cheap
Big Data Generic Architecture | Summary
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
Summary: Lesson learned
● Productivity of Data Science and Data engineering
○ Common language of both teams IS SQL!
○ Spark cluster has many bridges: SparkR, Spark ML, SparkSQL , Spark core.
● Minimize the amount DB’s used
○ Different syntax (presto/hive/redshift)
○ Different data types
○ Minimize ETLS via External Tables+Glue!
● Not always Streaming is justified (what is the business use case? PaaS?)
● Spark SQL
○ Sometimes faster than redshift
○ Sometimes slower than hive
○ Learning curve is non trivial
● Smart Big Data Architecture is all about:
○ Faster, Cheaper, Simpler, More Secured.
Stay in touch...
● Omid Vahdaty
● +972-54-2384178
● https://amazon-aws-big-data-demystified.ninja/
● Join our meetup, FB group and youtube channel
○ https://www.meetup.com/AWS-Big-Data-Demystified/
○ https://www.facebook.com/groups/amazon.aws.big.data.demystified/
○ https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
1 von 59

Recomendados

Databricks + Snowflake: Catalyzing Data and AI Initiatives von
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks
1.8K views22 Folien
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ... von
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...Amazon Web Services
1.4K views37 Folien
AWS Glue - let's get stuck in! von
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!Chris Taylor
846 views28 Folien
A Thorough Comparison of Delta Lake, Iceberg and Hudi von
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
11.1K views27 Folien
Data Lakehouse, Data Mesh, and Data Fabric (r2) von
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
6.3K views30 Folien
Cloud comparison - AWS vs Azure vs Google von
Cloud comparison - AWS vs Azure vs GoogleCloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GooglePatrick Pierson
1.8K views37 Folien

Más contenido relacionado

Was ist angesagt?

Azure data platform overview von
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
19.4K views57 Folien
Build Real-Time Applications with Databricks Streaming von
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingDatabricks
964 views13 Folien
Snowflake Overview von
Snowflake OverviewSnowflake Overview
Snowflake OverviewSnowflake Computing
20.6K views21 Folien
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic von
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
77 views42 Folien
Scaling your Data Pipelines with Apache Spark on Kubernetes von
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
2.1K views37 Folien
Welcome & AWS Big Data Solution Overview von
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewAmazon Web Services
7.3K views52 Folien

Was ist angesagt?(20)

Azure data platform overview von James Serra
Azure data platform overviewAzure data platform overview
Azure data platform overview
James Serra19.4K views
Build Real-Time Applications with Databricks Streaming von Databricks
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
Databricks964 views
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic von DataScienceConferenc1
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
Scaling your Data Pipelines with Apache Spark on Kubernetes von Databricks
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks2.1K views
Apache Iceberg - A Table Format for Hige Analytic Datasets von Alluxio, Inc.
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.6.6K views
대규모 온프레미스 하둡 마이그레이션을 위한 실행 전략과 최적화 방안 소개-유철민, AWS Data Architect / 박성열,AWS Pr... von Amazon Web Services Korea
대규모 온프레미스 하둡 마이그레이션을 위한 실행 전략과 최적화 방안 소개-유철민, AWS Data Architect / 박성열,AWS Pr...대규모 온프레미스 하둡 마이그레이션을 위한 실행 전략과 최적화 방안 소개-유철민, AWS Data Architect / 박성열,AWS Pr...
대규모 온프레미스 하둡 마이그레이션을 위한 실행 전략과 최적화 방안 소개-유철민, AWS Data Architect / 박성열,AWS Pr...
AWS EMR Cost optimization von SANG WON PARK
AWS EMR Cost optimizationAWS EMR Cost optimization
AWS EMR Cost optimization
SANG WON PARK1.3K views
Learn to Use Databricks for the Full ML Lifecycle von Databricks
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML Lifecycle
Databricks643 views
Understanding cloud with Google Cloud Platform von Dr. Ketan Parmar
Understanding cloud with Google Cloud PlatformUnderstanding cloud with Google Cloud Platform
Understanding cloud with Google Cloud Platform
Dr. Ketan Parmar8.4K views
Snowflake Data Science and AI/ML at Scale von Adam Doyle
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
Adam Doyle649 views
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트) von Amazon Web Services Korea
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)

Similar a AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

AWS Big Data Demystified #1: Big data architecture lessons learned von
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
1.7K views56 Folien
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned von
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
1.2K views81 Folien
Big Data in 200 km/h | AWS Big Data Demystified #1.3 von
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
473 views67 Folien
Introduction to AWS Big Data von
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
533 views86 Folien
NetflixOSS Meetup season 3 episode 1 von
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
21.4K views90 Folien
Netflix Open Source Meetup Season 4 Episode 2 von
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
19.4K views77 Folien

Similar a AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English(20)

AWS Big Data Demystified #1: Big data architecture lessons learned von Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty1.7K views
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned von Omid Vahdaty
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty1.2K views
Big Data in 200 km/h | AWS Big Data Demystified #1.3 von Omid Vahdaty
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty473 views
Introduction to AWS Big Data von Omid Vahdaty
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
Omid Vahdaty533 views
NetflixOSS Meetup season 3 episode 1 von Ruslan Meshenberg
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg21.4K views
Netflix Open Source Meetup Season 4 Episode 2 von aspyker
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker19.4K views
Spark Meetup at Uber von Databricks
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Databricks15.8K views
Apache Storm Concepts von André Dias
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
André Dias950 views
TRHUG 2015 - Veloxity Big Data Migration Use Case von Hakan Ilter
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
Hakan Ilter2.3K views
Building data "Py-pelines" von Rob Winters
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
Rob Winters299 views
Make your data fly - Building data platform in AWS von Kimmo Kantojärvi
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
Kimmo Kantojärvi319 views
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S... von Spark Summit
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit1.8K views
Machine learning and big data @ uber a tale of two systems von Zhenxiao Luo
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo2.2K views
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016 von Adrianos Dadis
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Adrianos Dadis5.1K views
The Parquet Format and Performance Optimization Opportunities von Databricks
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks8.2K views
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ... von Hernan Costante
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Hernan Costante1.9K views
Effectively deploying hadoop to the cloud von Avinash Ramineni
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloud
Avinash Ramineni275 views

Más de Omid Vahdaty

Data Pipline Observability meetup von
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
371 views24 Folien
Couchbase Data Platform | Big Data Demystified von
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedOmid Vahdaty
232 views62 Folien
Machine Learning Essentials Demystified part2 | Big Data Demystified von
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
544 views112 Folien
Machine Learning Essentials Demystified part1 | Big Data Demystified von
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
636 views76 Folien
The technology of fake news between a new front and a new frontier | Big Dat... von
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...Omid Vahdaty
238 views60 Folien
Making your analytics talk business | Big Data Demystified von
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedOmid Vahdaty
354 views48 Folien

Más de Omid Vahdaty(20)

Data Pipline Observability meetup von Omid Vahdaty
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
Omid Vahdaty371 views
Couchbase Data Platform | Big Data Demystified von Omid Vahdaty
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data Demystified
Omid Vahdaty232 views
Machine Learning Essentials Demystified part2 | Big Data Demystified von Omid Vahdaty
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
Omid Vahdaty544 views
Machine Learning Essentials Demystified part1 | Big Data Demystified von Omid Vahdaty
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
Omid Vahdaty636 views
The technology of fake news between a new front and a new frontier | Big Dat... von Omid Vahdaty
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...
Omid Vahdaty238 views
Making your analytics talk business | Big Data Demystified von Omid Vahdaty
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data Demystified
Omid Vahdaty354 views
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H... von Omid Vahdaty
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
Omid Vahdaty379 views
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy... von Omid Vahdaty
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
Omid Vahdaty1.1K views
Aerospike meetup july 2019 | Big Data Demystified von Omid Vahdaty
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
Omid Vahdaty289 views
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei... von Omid Vahdaty
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
Omid Vahdaty143 views
AWS Big Data Demystified #4 data governance demystified [security, networ... von Omid Vahdaty
AWS Big Data Demystified #4   data governance demystified   [security, networ...AWS Big Data Demystified #4   data governance demystified   [security, networ...
AWS Big Data Demystified #4 data governance demystified [security, networ...
Omid Vahdaty642 views
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r... von Omid Vahdaty
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty683 views
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive von Omid Vahdaty
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
Omid Vahdaty2.2K views
Amazon aws big data demystified | Introduction to streaming and messaging flu... von Omid Vahdaty
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Omid Vahdaty467 views
Emr spark tuning demystified von Omid Vahdaty
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
Omid Vahdaty7.7K views
Emr zeppelin & Livy demystified von Omid Vahdaty
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystified
Omid Vahdaty1.5K views
Zeppelin and spark sql demystified von Omid Vahdaty
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystified
Omid Vahdaty763 views
Introduction to streaming and messaging flume,kafka,SQS,kinesis von Omid Vahdaty
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Omid Vahdaty2.1K views
Introduction to aws dynamo db von Omid Vahdaty
Introduction to aws dynamo dbIntroduction to aws dynamo db
Introduction to aws dynamo db
Omid Vahdaty901 views

Último

MongoDB.pdf von
MongoDB.pdfMongoDB.pdf
MongoDB.pdfArthyR3
49 views6 Folien
Searching in Data Structure von
Searching in Data StructureSearching in Data Structure
Searching in Data Structureraghavbirla63
17 views8 Folien
Plant Design Report-Oil Refinery.pdf von
Plant Design Report-Oil Refinery.pdfPlant Design Report-Oil Refinery.pdf
Plant Design Report-Oil Refinery.pdfSafeen Yaseen Ja'far
7 views10 Folien
Design of machine elements-UNIT 3.pptx von
Design of machine elements-UNIT 3.pptxDesign of machine elements-UNIT 3.pptx
Design of machine elements-UNIT 3.pptxgopinathcreddy
37 views31 Folien
DESIGN OF SPRINGS-UNIT4.pptx von
DESIGN OF SPRINGS-UNIT4.pptxDESIGN OF SPRINGS-UNIT4.pptx
DESIGN OF SPRINGS-UNIT4.pptxgopinathcreddy
21 views47 Folien
Global airborne satcom market report von
Global airborne satcom market reportGlobal airborne satcom market report
Global airborne satcom market reportdefencereport78
6 views13 Folien

Último(20)

MongoDB.pdf von ArthyR3
MongoDB.pdfMongoDB.pdf
MongoDB.pdf
ArthyR349 views
Design of machine elements-UNIT 3.pptx von gopinathcreddy
Design of machine elements-UNIT 3.pptxDesign of machine elements-UNIT 3.pptx
Design of machine elements-UNIT 3.pptx
gopinathcreddy37 views
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf von AlhamduKure
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdfASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
AlhamduKure8 views
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc... von csegroupvn
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...
csegroupvn8 views
Créativité dans le design mécanique à l’aide de l’optimisation topologique von LIEGE CREATIVE
Créativité dans le design mécanique à l’aide de l’optimisation topologiqueCréativité dans le design mécanique à l’aide de l’optimisation topologique
Créativité dans le design mécanique à l’aide de l’optimisation topologique
LIEGE CREATIVE8 views
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx von lwang78
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
lwang78180 views
GDSC Mikroskil Members Onboarding 2023.pdf von gdscmikroskil
GDSC Mikroskil Members Onboarding 2023.pdfGDSC Mikroskil Members Onboarding 2023.pdf
GDSC Mikroskil Members Onboarding 2023.pdf
gdscmikroskil63 views
REACTJS.pdf von ArthyR3
REACTJS.pdfREACTJS.pdf
REACTJS.pdf
ArthyR337 views
Web Dev Session 1.pptx von VedVekhande
Web Dev Session 1.pptxWeb Dev Session 1.pptx
Web Dev Session 1.pptx
VedVekhande17 views
SUMIT SQL PROJECT SUPERSTORE 1.pptx von Sumit Jadhav
SUMIT SQL PROJECT SUPERSTORE 1.pptxSUMIT SQL PROJECT SUPERSTORE 1.pptx
SUMIT SQL PROJECT SUPERSTORE 1.pptx
Sumit Jadhav 22 views

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

  • 1. AWS Big Data Demystified #1.1 Big Data Architecture Lessons Learned Omid Vahdaty, Big Data Ninja
  • 2. Disclaimer ● I am not trying to sell anything. ● This is purely knowledge transfer session. ● You are more than welcome to challenge each slide, during the lecture and afterwards :) ● This lecture was released at 2018, take into account things change over time.
  • 3. When the data outgrows your ability to process ● Volume (100TB processing per day) ● Velocity (4GB/s) ● Variety (JSON, CSV, events etc) ● Veracity (how much of the data is accurate?)
  • 4. In the past (web,api, ops db, data warehouse) API DW
  • 5. TODAY’S BIG DATA APPLICATION STACK PaaS and DC...
  • 6. MY BIG DATA APPLICATION STACK “demystified”...
  • 7. MY AWS BIG DATA APPLICATION STACK With some “Myst”...
  • 8. Challenges creating big data architecture? ● What is the business use case ? How fast do u need the insights? ○ 15 min - 24 hours delay and above → use batch ○ Less than 15 min? ■ Might be batch - depends data source is files or events?. ■ Streaming? ● Sub seconds delay? ● Sub minute delay? ■ Streaming with in flight analytics ? ○ How complex is the compute jobs? Aggregations? joins?
  • 9. Challenges creating big data architecture? ● What is the velocity? ○ Under 100K events per second? Not a problem ○ Over 1M events per second? Costly. But doable. ○ Over 1B events per seconds? Not trivial at all. ● Volume ? ○ ~1TB a day ? Not a problem ○ Over ? it depends. ○ Over a petabyte? Well…. It depends. ● Veracity (how are you going to handle different data sources?) ○ Structured (CSV) ○ Semi structured (JSON,XML) ○ Unstructured (pictures, movies etc)
  • 10. Challenges creating big data architecture? ● Performance targets? ● Costs targets? ● Security restrictions? ● Regulation restriction? privacy? ● Which technology to choose? ● Datacenter or cloud? ● Latency? ● Throughput? ● Concurrency? ● Security Access patterns? ● Pass? Max 7 technologies ● Iaas? Max 4 technologies
  • 11. Cloud Architecture rules of thumb... ● Decouple : ○ Store ○ Process ○ Store ○ Process ○ insight... ● Rule of thumb: max 3 technologies in dc, 7 tech max in cloud ○ Do use more b/c: maintenance ○ Training time ○ complexity/simplicity
  • 12. Use Case 1: Analyzing browsing history ● Data Collection: browsing history from an ISP ● Product - derives user intent and interest for marketing purposes. ● Challenges ○ Velocity: 1 TB per day ○ History of: 3M ○ Remote DC ○ Enterprise grade security ○ Privacy
  • 13. Use Case 2: Insights from location based data ● Data collection: from a Mobile operator ● Products: ○ derives user intent and interest for marketing purposes. ○ derive location based intent for marketing purposes. ● Challenges ○ Velocity: 4GB/s … ○ Scalability: Rate was expected double every year... ○ Remote DC ○ Enterprise grade security ○ Privacy ○ Edge analytics
  • 14. Use Case 3: Analyzing location based events. ● Data collection: streaming ● Product: building location based audiences ● Challenges: minimizing DevOps work on maintenance of a DIY streaming system
  • 15. So what is the product? ● Big data platform that ○ collects data from multiple sources ○ Analyzes the data ○ Generates insights : ■ Smart Segments (online marketing) ■ Smart reports (for marketer) ■ Audience analysis (for agencies) ● Customers? ○ Marketers ○ Publishers ○ Agencies
  • 16. My Big Data platform is about: ● Data Collection ○ Online ■ messaging ■ Streaming ○ Offline ■ Batch ■ Performance aspects ● Data Transformation (Hive) ○ JSON, CSV, TXT, PARQUET, Binary ● Data Modeling - (R, ML, AI, DEEP, SPARK) ● Data Visualization (choose your poison) ● PII regulation + GPDR regulation ● And: Performance... Cost… Security… Simple... Cloud best practices...
  • 17. Big Data Generic Architecture Data Collection (file based ETL from remote DC) Data Transformation ( row to colunar + cleansing) Data Modeling ( joins/agg/ML/R) Data Visualization Text, RAW
  • 18. Big Data Generic Architecture | Data Collection Data Collection Data Transformation Data Modeling Data Visualization
  • 19. Batch Data collection considerations ● Every hour , about 30GB compressed CSV file ● Why s3 ○ Multi part upload ○ S3 CLI ○ S3 SDK ○ (tip : gzip! ) ● Why Client - needs to run at remote DC ● Why NOT your own client ○ Involves code → ■ Bugs? ■ maintenance ○ Don't analyze data at Edge , since you cant go back in time. ● Why Not Streaming? ○ less accurate ○ Expensive
  • 20. S3 Considerations ● Security ○ at rest: server side S3-Managed Keys (SSE-S3) ○ at transit: SSL / VPN ○ Hardening: user, IP ACL, write permission only. ● Upload ○ AWS s3 cli ○ Multi part upload ○ Aborting Incomplete Multipart Uploads Using a Bucket Lifecycle Policy ○ Consider S3 CLI Sync command instead of CP
  • 21. Sqoop - ETL ● Open source , part of EMR ● HDFS to RDMS and back. Via JDBC. ● E.g BiDirectional ETL from RDS to HDFS ● Unlikely use case: ETL from customer source DB.
  • 22. Flume & Kafka ● Opens source project for streaming & messaging ● Popular ● Generic ● Good practice for many use cases. (a meetup by it self) ● Highly durable, scalable, extension etc. ● Downside : DIY, Non trivial to get started
  • 23. Data Transfer Options ● VPN ● Direct Connect (4GB/s?) ● For all other use case ○ S3 multipart upload ○ Compression ○ Security ■ Data at motion ■ Data at rest ● bandwidth
  • 24. Quick intro to Stream collection ● Kinesis Client Library (code) ● AWS lambda (code) ● EMR (managed hadoop) ● Third party (DIY) ○ Spark streaming (latency min =1 sec) , near real time, with lot of libraries. ○ Storm - Most real time (sub millisec), java code based. ○ Flink (similar to spark)
  • 25. Kinesis ● Stream - collect@source and near real time processing ○ Near real time ○ High throughput ○ Low cost ○ Easy administration - set desired level of capacity ○ Delivery to : s3,redshift, Dynamo, ... ○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second. ○ Not managed! ● Analytics - in flight analytics. ● Firehose - Park you data @ destination.
  • 26. Firehose - for Data parking ● Not for fast lane - no in flight analytics ● Capture , transform and load. ○ Kinesis ○ S3 ○ Redshift ○ elastic search ● Managed Service
  • 27. Comparison of Kinesis product ● Streams ○ Sub 1 sec processing latency ○ Choice of stream processor (generic) ○ For smaller events ● Firehose ○ Zero admin ○ 4 targets built in (redshift, s3, search, etc) ○ Buffering 60 sec minimum. ○ For larger “events”
  • 28. Big Data Generic Architecture | Data Collection Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 29. Big Data Generic Architecture | Transformation Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 30. EMR ecosystem ● Hive ● Pig ● Hue ● Spark ● Oozie ● Presto ● Ganglia ● Zookeeper (hbase) ● zeppelin
  • 31. EMR Architecture ● Master node ● Core nodes - like data nodes (with storage: HDFS) ● Task nodes - (extends compute) ● Does Not have Standby Master node ● Best for transient cluster (goes up and down every night)
  • 32. EMR lesson learned... ● Bigger instance type is good architecture ● Use spot instances - for the tasks. ● Don't always use TEZ (MR? Spark?) ● Make sure your choose instance with network optimized ● Resize cluster is not recommended ● Bootstrap to automate cluster upon provisioning ● Use Steps to automate steps on running cluster ● Use Glue to share Hive MetaStore
  • 33. So use EMR for ... ● Most dominant ○ Hive ○ Spark ○ Presto ● And many more…. ● Good for: ○ Data transformation ○ Data modeling ○ Batch ○ Machine learning
  • 34. Hive ● SQL over hadoop. ● Engine: spark, tez, MR ● JDBC / ODBC ● Not good when need to shuffle. ● Not peta scale. ● SerDe json, parquet,regex,text etc. ● Dynamic partitions ● Insert overwrite ● Data Transformation ● Convert to Columnar
  • 35. Presto ● SQL over hadoop ● Not good always for join on 2 large tables. ● Limited by memory ● Not fault tolerant like hive. ● Optimized for ad hoc queries ● No insert overwrite ● No dynamic partitions. ● Has some connectors : redshift and more ● https://amazon-aws-big-data- demystified.ninja/2018/07/02/aws-emr- presto-demystified-everything-you- wanted-to-know-about-presto/
  • 36. Pig ● Distributed Shell scripting ● Generating SQL like operations. ● Engine: MR, Tez ● S3, DynamoDB access ● Use Case: for data science who don't know SQL, for system people, for those who want to avoid java/scala ● Fair fight compared to hive in term of performance only ● Good for unstructured files ETL : file to file , and use sqoop.
  • 37. Hue ● Hadoop user experience ● Logs in real time and failures. ● Multiple users ● Native access to S3. ● File browser to HDFS. ● Manipulate metascore ● Job Browser ● Query editor ● Hbase browser ● Sqoop editor, oozier editor, Pig Editor
  • 38. Orchestration ● EMR Oozie ○ Opens source workflow ■ Workflow: graph of action ■ Coordinator: scheduler jobs ○ Support: hive, sqoop , spark etc. ● Other: AirFlow, Knime, Luigi, Azkaban,AWS Data Pipeline
  • 39. Big Data Generic Architecture | Transformation Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 40. Big Data Generic Architecture | Modeling Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 41. Spark ● In memory ● X10 to X100 times faster ● Good optimizer for distribution ● Rich API ● Spark SQL ● Spark Streaming ● Spark ML (ML lib) ● Spark GraphX (DB graphs) ● SparkR
  • 42. Spark Streaming ● Near real time (1 sec latency) ● like batch of 1sec windows ● Streaming jobs with API ● Not relevant to us...
  • 43. Spark ML ● Classification ● Regression ● Collaborative filtering ● Clustering ● Decomposition ● Code: java, scala, python, sparkR
  • 44. Spark flavours ● Standalone ● With yarn ● With mesos
  • 45. Spark Downside ● Compute intensive ● Performance gain over mapreduce is not guaranteed. ● Streaming processing is actually batch with very small window. ● Different behaviour between hive and spark SQL
  • 46. Spark SQL ● Same syntax as hive ● Optional JDBC via thrift ● Non trivial learning curve ● Upto X10 faster than hive. ● Works well with Zeppelin (out of the box) ● Does not replaces Hive ● Spark not always faster than hive ● insert overwrite -
  • 47. Apache Zeppelin ● Notebook - visualizer ● Built in spark integration ● Interactive data analytics ● Easy collaboration. ● Uses SQL ● work s on top of Hive/ SparkSQL ● Inside EMR. ● Uses in the background: ○ Shiro ○ Livy
  • 48. R + spark R ● Open source package for statistical computing. ● Works with EMR ● “Matlab” equivalent ● Works with spark ● Not for developer :) for statistician ● R is single threaded - use spark R to distribute. ● Not everything works perfect.
  • 49. Redshift ● OLAP, not OLTP→ analytics , not transaction ● Fully SQL ● Fully ACID ● No indexing ● Fully managed ● Petabyte Scale ● MPP ● Can create slow queue for queries ○ which are long lasting. ● DO NOT USE FOR transformation. ● Good for : DW, Complex Joins.
  • 50. Redshift spectrum ● Extension of Redshift, use external table on S3. ● Require redshift cluster. ● Not possible for CTAS to s3, complex data structure, joins. ● Good for ○ Read only Queries ○ Aggregations on Exabyte.
  • 51. EMR vs Redshift ● How much data loaded and unloaded? ● Which operations need to performed? ● Recycling data? → EMR ● History to be analyzed again and again ? → emr ● What the data needs to end up? BI? ● Use spectrum in some use cases. (aggregations)? ● Raw data? s3.
  • 52. Hive VS. Redshift ● Amount of concurrency ? low → hive, high → redshift ● Access to customers? Redshift? ● Transformation, Unstructured , batch, ETL → hive. ● Peta scale ? redshift ● Complex joins → Redshift
  • 53. Big Data Generic Architecture | Modeling Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 54. Big Data Generic Architecture | Visualize Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 55. Athena ● Presto SQL ● In memory ● Hive metastore for DDL functionality ○ Complex data types ○ Multiple formats ○ Partitions ● Good for: ○ Read only SQL, ○ Ad hoc query, ○ low cost, ○ managed
  • 56. Visualize ● QuickSight ● Managed Visualizer, simple, cheap
  • 57. Big Data Generic Architecture | Summary Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 58. Summary: Lesson learned ● Productivity of Data Science and Data engineering ○ Common language of both teams IS SQL! ○ Spark cluster has many bridges: SparkR, Spark ML, SparkSQL , Spark core. ● Minimize the amount DB’s used ○ Different syntax (presto/hive/redshift) ○ Different data types ○ Minimize ETLS via External Tables+Glue! ● Not always Streaming is justified (what is the business use case? PaaS?) ● Spark SQL ○ Sometimes faster than redshift ○ Sometimes slower than hive ○ Learning curve is non trivial ● Smart Big Data Architecture is all about: ○ Faster, Cheaper, Simpler, More Secured.
  • 59. Stay in touch... ● Omid Vahdaty ● +972-54-2384178 ● https://amazon-aws-big-data-demystified.ninja/ ● Join our meetup, FB group and youtube channel ○ https://www.meetup.com/AWS-Big-Data-Demystified/ ○ https://www.facebook.com/groups/amazon.aws.big.data.demystified/ ○ https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber