SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Big Data Architecture
Lessons Learned
Omid Vahdaty, Big Data Ninja
In the past (web,api, ops db, data warehouse)
API
DW
When the data outgrows your ability to process
● Volume (100TB processing per day)
● Velocity (4GB/s)
● Variety (JSON, CSV, events etc)
● Veracity (how much of the data is accurate?)
TODAY’S BIG DATA
APPLICATION STACK
PaaS and DC...
MY BIG DATA
APPLICATION STACK
“demystified”...
MY AWS BIG DATA
APPLICATION STACK
With some “Myst”...
Use Case 1: Analyzing browsing history
● Data Collection: browsing history from an ISP
● Product - derives user intent and interest for marketing purposes.
● Challenges
○ Velocity: 1 TB per day
○ History of: 3M
○ Remote DC
○ Enterprise grade security
○ Privacy
Use Case 2: Insights from location based data
● Data collection: from a Mobile operator
● Products:
○ derives user intent and interest for marketing purposes.
○ derive location based intent for marketing purposes.
● Challenges
○ Velocity: 4GB/s …
○ Scalability: Rate was expected double every year...
○ Remote DC
○ Enterprise grade security
○ Privacy
○ Edge analytics
Use Case 3: Analyzing location based events.
● Data collection: streaming
● Product: building location based audiences
● Challenges: minimizing DevOps work on maintenance of a DIY streaming
system
So what is the product?
● Big data platform that
○ collects data from multiple sources
○ Analyzes the data
○ Generates insights :
■ Smart Segments (online marketing)
■ Smart reports (for marketer)
■ Audience analysis (for agencies)
● Customers?
○ Marketers
○ Publishers
○ Agencies
My Big Data platform is about:
● Data Collection
○ Online
■ messaging
■ Streaming
○ Offline
■ Batch
■ Performance aspects
● Data Transformation (Hive)
○ JSON, CSV, TXT, PARQUET, Binary
● Data Modeling - (R, ML, AI, DEEP, SPARK)
● Data Visualization (choose your poison)
● PII regulation + GPDR regulation
● And: Performance... Cost… Security… Simple... Cloud best practices...
Cloud Architecture Tips
● Decouple :
○ Store
○ Process
○ Store
○ Process
○ insight...
● Rule of thumb: max 3 technologies in dc, 7 tech max in cloud
○ Do use more b/c: maintenance
○ Training time
○ complexity/simplicity
Big Data Architecture considerations
● unStructured? Structured? Semi structured?
● Latency?
● Throughput?
● Concurrency?
● Security Access patterns?
● Pass? Max 7 technologies
● Iaas? Max 4 technologies
Data ingestions Architecture challenges
● Durability
● HA
● Ingestions types:
○ Batch
○ Stream
Big Data Generic Architecture
Data Collection
(file based ETL from remote DC)
Data Transformation ( row to colunar + cleansing)
Data Modeling ( joins/agg/ML/R)
Data Visualization
Text,
RAW
Big Data Generic Architecture | Data Collection
Data Collection
Data Transformation
Data Modeling
Data Visualization
Batch Data collection considerations
● Every hour , about 30GB compressed CSV file
● Why s3
○ Multi part upload
○ S3 CLI
○ S3 SDK
○ (tip : gzip! )
● Why Client - needs to run at remote DC
● Why NOT your own client
○ Involves code →
■ Bugs?
■ maintenance
○ Don't analyze data at Edge , since you cant go back in time.
● Why Not Streaming?
○ less accurate
○ Expensive
S3 Considerations
● Security
○ at rest: server side S3-Managed Keys (SSE-S3)
○ at transit: SSL / VPN
○ Hardening: user, IP ACL, write permission only.
● Upload
○ AWS s3 cli
○ Multi part upload
○ Aborting Incomplete Multipart Uploads Using a
Bucket Lifecycle Policy
○ Consider S3 CLI Sync command instead of CP
Sqoop - ETL
● Open source , part of EMR
● HDFS to RDMS and back. Via JDBC.
● E.g BiDirectional ETL from RDS to HDFS
● Unlikely use case: ETL from customer source DB.
Flume & Kafka
● Opens source project for streaming & messaging
● Popular
● Generic
● Good practice for many use cases. (a meetup by it self)
● Highly durable, scalable, extension etc.
● Downside : DIY, Non trivial to get started
Data Transfer Options
● VPN
● Direct Connect (4GB/s?)
● For all other use case
○ S3 multipart upload
○ Compression
○ Security
■ Data at motion
■ Data at rest
● bandwidth
Quick intro to Stream collection
● Kinesis Client Library (code)
● AWS lambda (code)
● EMR (managed hadoop)
● Third party (DIY)
○ Spark streaming (latency min =1 sec) , near real time, with lot of libraries.
○ Storm - Most real time (sub millisec), java code based.
○ Flink (similar to spark)
Kinesis
● Stream - collect@source and near real time processing
○ Near real time
○ High throughput
○ Low cost
○ Easy administration - set desired level of capacity
○ Delivery to : s3,redshift, Dynamo, ...
○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second.
○ Not managed!
● Analytics - in flight analytics.
● Firehose - Park you data @ destination.
Firehose - for Data parking
● Not for fast lane - no in flight analytics
● Capture , transform and load.
○ Kinesis
○ S3
○ Redshift
○ elastic search
● Managed Service
Comparison of Kinesis product
● Stream
○ Sub 1 sec processing latency
○ Choice of stream processor (generic)
○ For smaller events
● Firehose
○ Zero admin
○ 4 targets built in (redshift, s3, search, etc)
○ Buffering 60 sec minimum.
○ For larger “events”
Big Data Generic Architecture | Data Collection
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
Big Data Generic Architecture | Transformation
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
EMR ecosystem
● Hive
● Pig
● Hue
● Spark
● Oozie
● Presto
● Ganglia
● Zookeeper (hbase)
● zeppelin
EMR Architecture
● Master node
● Core nodes - like data nodes (with storage: HDFS)
● Task nodes - (extends compute)
● Does Not have Standby Master node
● Best for transient cluster (goes up and down every night)
EMR lesson learned...
● Bigger instance type is good architecture
● Use spot instances - for the tasks.
● Don't always use TEZ (MR? Spark?)
● Make sure your choose instance with network
optimized
● Resize cluster is not recommended
● Bootstrap to automate cluster upon provisioning
● Use Steps to automate steps on running cluster
● Use Glue to share Hive MetaStore
So used EMR for ...
● Most dominant
○ Hive
○ Spark
○ Presto
● And many more….
● Good for:
○ Data transformation
○ Data modeling
○ Batch
○ Machine learning
Hive
● SQL over hadoop.
● Engine: spark, tez, MR
● JDBC / ODBC
● Not good when need to shuffle.
● Not peta scale.
● SerDe json, parquet,regex,text etc.
● Dynamic partitions
● Insert overwrite
● Data Transformation
● Convert to Columnar
Presto
● SQL over hadoop
● Not good always for join on 2 large tables.
● Limited by memory
● Not fault tolerant like hive.
● Optimized for ad hoc queries
● No insert overwrite
● No dynamic partitions.
Pig
● Distributed Shell scripting
● Generating SQL like operations.
● Engine: MR, Tez
● S3, DynamoDB access
● Use Case: for data science who don't know
SQL, for system people, for those who want to
avoid java/scala
● Fair fight compared to hive in term of
performance only
● Good for unstructured files ETL : file to file ,
and use sqoop.
Hue
● Hadoop user experience
● Logs in real time and failures.
● Multiple users
● Native access to S3.
● File browser to HDFS.
● Manipulate metascore
● Job Browser
● Query editor
● Hbase browser
● Sqoop editor, oozier editor, Pig Editor
Orchestration
● EMR Oozie
○ Opens source workflow
■ Workflow: graph of action
■ Coordinator: scheduler jobs
○ Support: hive, sqoop , spark etc.
● Other: AirFlow, Knime, Luigi,
Azkaban,AWS Data Pipeline
Big Data Generic Architecture | Transformation
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
Big Data Generic Architecture | Modeling
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
Spark
● In memory
● X10 to X100 times faster
● Good optimizer for distribution
● Rich API
● Spark SQL
● Spark Streaming
● Spark ML (ML lib)
● Spark GraphX (DB graphs)
● SparkR
Spark Streaming
● Near real time (1 sec latency)
● like batch of 1sec windows
● Streaming jobs with API
● Not relevant to us...
Spark ML
● Classification
● Regression
● Collaborative filtering
● Clustering
● Decomposition
● Code: java, scala, python, sparkR
Spark flavours
● Standalone
● With yarn
● With mesos
Spark Downside
● Compute intensive
● Performance gain over mapreduce is not guaranteed.
● Streaming processing is actually batch with very small window.
● Different behaviour between hive and spark SQL
Spark SQL
● Same syntax as hive
● Optional JDBC via thrift
● Non trivial learning curve
● Upto X10 faster than hive.
● Works well with Zeppelin (out of the box)
● Does not replaces Hive
● Spark not always faster than hive
● insert overwrite - not supported on partitions!
Apache Zeppelin
● Notebook - visualizer
● Built in spark integration
● Interactive data analytics
● Easy collaboration.
● Uses SQL
● work s on top of Hive/ SparkSQL
● Inside EMR.
● Uses in the background:
○ Shiro
○ Livy
R + spark R
● Open source package for statistical computing.
● Works with EMR
● “Matlab” equivalent
● Works with spark
● Not for developer :) for statistician
● R is single threaded - use spark R to distribute.
● Not everything works perfect.
Redshift
● OLAP, not OLTP→ analytics , not transaction
● Fully SQL
● Fully ACID
● No indexing
● Fully managed
● Petabyte Scale
● MPP
● Can create slow queue for queries
○ which are long lasting.
● DO NOT USE FOR transformation.
● Good for : DW, Complex Joins.
Redshift spectrum
● Extension of Redshift, use external table on S3.
● Require redshift cluster.
● Not possible for CTAS to s3, complex data structure, joins.
● Good for
○ Read only Queries
○ Aggregations on Exabyte.
EMR vs Redshift
● How much data loaded and unloaded?
● Which operations need to performed?
● Recycling data? → EMR
● History to be analyzed again and again ? → emr
● What the data needs to end up? BI?
● Use spectrum in some use cases. (aggregations)?
● Raw data? s3.
Hive VS. Redshift
● Amount of concurrency ? low → hive, high → redshift
● Access to customers? Redshift?
● Transformation, Unstructured , batch, ETL → hive.
● Peta scale ? redshift
● Complex joins → Redshift
Big Data Generic Architecture | Modeling
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
Big Data Generic Architecture | Visualize
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
Athena
● Presto SQL
● In memory
● Hive metastore for DDL functionality
○ Complex data types
○ Multiple formats
○ Partitions
● Good for:
○ Read only SQL,
○ Ad hoc query,
○ low cost,
○ managed
Visualize
● QuickSight
● Managed Visualizer, simple, cheap
Big Data Generic Architecture | Summary
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
Summary: Lesson learned
● Productivity of Data Science and Data engineering
○ Common language of both teams IS SQL!
○ Spark cluster has many bridges: SparkR, Spark ML, SparkSQL , Spark core.
● Minimize the amount DB’s used
○ Different syntax (presto/hive/redshift)
○ Different data types
○ Minimize ETLS via External Tables+Glue!
● Not always Streaming is justified (what is the business use case? PaaS?)
● Spark SQL
○ Sometimes faster than redshift
○ Sometimes slower than hive
○ Learning curve is non trivial
● Smart Big Data Architecture is all about:
○ Faster, Cheaper, Simpler, More Secured.

Weitere ähnliche Inhalte

Was ist angesagt?

Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Imply
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...DataWorks Summit
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageKai Sasaki
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irdatastack
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSTreasure Data, Inc.
 
Welcome: MariaDB today and our vision for the future
Welcome: MariaDB today and our vision for the futureWelcome: MariaDB today and our vision for the future
Welcome: MariaDB today and our vision for the futureMariaDB plc
 
In Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAPIn Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAPHBaseCon
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsCassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsDataStax
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoopRommel Garcia
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureNiels Naglé
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Meetup Google BigQuery powered by ai
Meetup Google BigQuery powered by aiMeetup Google BigQuery powered by ai
Meetup Google BigQuery powered by aiIdo Volff
 
Real-Time Analytics in Transactional Applications by Brian Bulkowski
Real-Time Analytics in Transactional Applications by Brian BulkowskiReal-Time Analytics in Transactional Applications by Brian Bulkowski
Real-Time Analytics in Transactional Applications by Brian BulkowskiData Con LA
 

Was ist angesagt? (20)

Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Introducing Amazon Aurora
Introducing Amazon AuroraIntroducing Amazon Aurora
Introducing Amazon Aurora
 
Big data in Azure
Big data in AzureBig data in Azure
Big data in Azure
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.ir
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWS
 
Welcome: MariaDB today and our vision for the future
Welcome: MariaDB today and our vision for the futureWelcome: MariaDB today and our vision for the future
Welcome: MariaDB today and our vision for the future
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
In Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAPIn Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAP
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsCassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architecture
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Meetup Google BigQuery powered by ai
Meetup Google BigQuery powered by aiMeetup Google BigQuery powered by ai
Meetup Google BigQuery powered by ai
 
Real-Time Analytics in Transactional Applications by Brian Bulkowski
Real-Time Analytics in Transactional Applications by Brian BulkowskiReal-Time Analytics in Transactional Applications by Brian Bulkowski
Real-Time Analytics in Transactional Applications by Brian Bulkowski
 

Ähnlich wie AWS Big Data Demystified #1: Big data architecture lessons learned

Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloudAvinash Ramineni
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the CloudAmihay Zer-Kavod
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkNicola Ferraro
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm ConceptsAndré Dias
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Adrianos Dadis
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
 

Ähnlich wie AWS Big Data Demystified #1: Big data architecture lessons learned (20)

Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloud
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 

Mehr von Omid Vahdaty

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedOmid Vahdaty
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...Omid Vahdaty
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedOmid Vahdaty
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...Omid Vahdaty
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...Omid Vahdaty
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedOmid Vahdaty
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...Omid Vahdaty
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive Omid Vahdaty
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedOmid Vahdaty
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystifiedOmid Vahdaty
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
 
Introduction to aws dynamo db
Introduction to aws dynamo dbIntroduction to aws dynamo db
Introduction to aws dynamo dbOmid Vahdaty
 

Mehr von Omid Vahdaty (20)

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data Demystified
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data Demystified
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystified
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystified
 
Aws s3 security
Aws s3 securityAws s3 security
Aws s3 security
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Introduction to aws dynamo db
Introduction to aws dynamo dbIntroduction to aws dynamo db
Introduction to aws dynamo db
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 

Kürzlich hochgeladen

BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxNiranjanYadav41
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxachiever3003
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectErbil Polytechnic University
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectssuserb6619e
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 

Kürzlich hochgeladen (20)

BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptx
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptx
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction Project
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 

AWS Big Data Demystified #1: Big data architecture lessons learned

  • 1. Big Data Architecture Lessons Learned Omid Vahdaty, Big Data Ninja
  • 2. In the past (web,api, ops db, data warehouse) API DW
  • 3. When the data outgrows your ability to process ● Volume (100TB processing per day) ● Velocity (4GB/s) ● Variety (JSON, CSV, events etc) ● Veracity (how much of the data is accurate?)
  • 4. TODAY’S BIG DATA APPLICATION STACK PaaS and DC...
  • 5. MY BIG DATA APPLICATION STACK “demystified”...
  • 6. MY AWS BIG DATA APPLICATION STACK With some “Myst”...
  • 7. Use Case 1: Analyzing browsing history ● Data Collection: browsing history from an ISP ● Product - derives user intent and interest for marketing purposes. ● Challenges ○ Velocity: 1 TB per day ○ History of: 3M ○ Remote DC ○ Enterprise grade security ○ Privacy
  • 8. Use Case 2: Insights from location based data ● Data collection: from a Mobile operator ● Products: ○ derives user intent and interest for marketing purposes. ○ derive location based intent for marketing purposes. ● Challenges ○ Velocity: 4GB/s … ○ Scalability: Rate was expected double every year... ○ Remote DC ○ Enterprise grade security ○ Privacy ○ Edge analytics
  • 9. Use Case 3: Analyzing location based events. ● Data collection: streaming ● Product: building location based audiences ● Challenges: minimizing DevOps work on maintenance of a DIY streaming system
  • 10. So what is the product? ● Big data platform that ○ collects data from multiple sources ○ Analyzes the data ○ Generates insights : ■ Smart Segments (online marketing) ■ Smart reports (for marketer) ■ Audience analysis (for agencies) ● Customers? ○ Marketers ○ Publishers ○ Agencies
  • 11. My Big Data platform is about: ● Data Collection ○ Online ■ messaging ■ Streaming ○ Offline ■ Batch ■ Performance aspects ● Data Transformation (Hive) ○ JSON, CSV, TXT, PARQUET, Binary ● Data Modeling - (R, ML, AI, DEEP, SPARK) ● Data Visualization (choose your poison) ● PII regulation + GPDR regulation ● And: Performance... Cost… Security… Simple... Cloud best practices...
  • 12. Cloud Architecture Tips ● Decouple : ○ Store ○ Process ○ Store ○ Process ○ insight... ● Rule of thumb: max 3 technologies in dc, 7 tech max in cloud ○ Do use more b/c: maintenance ○ Training time ○ complexity/simplicity
  • 13. Big Data Architecture considerations ● unStructured? Structured? Semi structured? ● Latency? ● Throughput? ● Concurrency? ● Security Access patterns? ● Pass? Max 7 technologies ● Iaas? Max 4 technologies
  • 14. Data ingestions Architecture challenges ● Durability ● HA ● Ingestions types: ○ Batch ○ Stream
  • 15. Big Data Generic Architecture Data Collection (file based ETL from remote DC) Data Transformation ( row to colunar + cleansing) Data Modeling ( joins/agg/ML/R) Data Visualization Text, RAW
  • 16. Big Data Generic Architecture | Data Collection Data Collection Data Transformation Data Modeling Data Visualization
  • 17. Batch Data collection considerations ● Every hour , about 30GB compressed CSV file ● Why s3 ○ Multi part upload ○ S3 CLI ○ S3 SDK ○ (tip : gzip! ) ● Why Client - needs to run at remote DC ● Why NOT your own client ○ Involves code → ■ Bugs? ■ maintenance ○ Don't analyze data at Edge , since you cant go back in time. ● Why Not Streaming? ○ less accurate ○ Expensive
  • 18. S3 Considerations ● Security ○ at rest: server side S3-Managed Keys (SSE-S3) ○ at transit: SSL / VPN ○ Hardening: user, IP ACL, write permission only. ● Upload ○ AWS s3 cli ○ Multi part upload ○ Aborting Incomplete Multipart Uploads Using a Bucket Lifecycle Policy ○ Consider S3 CLI Sync command instead of CP
  • 19. Sqoop - ETL ● Open source , part of EMR ● HDFS to RDMS and back. Via JDBC. ● E.g BiDirectional ETL from RDS to HDFS ● Unlikely use case: ETL from customer source DB.
  • 20. Flume & Kafka ● Opens source project for streaming & messaging ● Popular ● Generic ● Good practice for many use cases. (a meetup by it self) ● Highly durable, scalable, extension etc. ● Downside : DIY, Non trivial to get started
  • 21. Data Transfer Options ● VPN ● Direct Connect (4GB/s?) ● For all other use case ○ S3 multipart upload ○ Compression ○ Security ■ Data at motion ■ Data at rest ● bandwidth
  • 22. Quick intro to Stream collection ● Kinesis Client Library (code) ● AWS lambda (code) ● EMR (managed hadoop) ● Third party (DIY) ○ Spark streaming (latency min =1 sec) , near real time, with lot of libraries. ○ Storm - Most real time (sub millisec), java code based. ○ Flink (similar to spark)
  • 23. Kinesis ● Stream - collect@source and near real time processing ○ Near real time ○ High throughput ○ Low cost ○ Easy administration - set desired level of capacity ○ Delivery to : s3,redshift, Dynamo, ... ○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second. ○ Not managed! ● Analytics - in flight analytics. ● Firehose - Park you data @ destination.
  • 24. Firehose - for Data parking ● Not for fast lane - no in flight analytics ● Capture , transform and load. ○ Kinesis ○ S3 ○ Redshift ○ elastic search ● Managed Service
  • 25. Comparison of Kinesis product ● Stream ○ Sub 1 sec processing latency ○ Choice of stream processor (generic) ○ For smaller events ● Firehose ○ Zero admin ○ 4 targets built in (redshift, s3, search, etc) ○ Buffering 60 sec minimum. ○ For larger “events”
  • 26. Big Data Generic Architecture | Data Collection Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 27. Big Data Generic Architecture | Transformation Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 28. EMR ecosystem ● Hive ● Pig ● Hue ● Spark ● Oozie ● Presto ● Ganglia ● Zookeeper (hbase) ● zeppelin
  • 29. EMR Architecture ● Master node ● Core nodes - like data nodes (with storage: HDFS) ● Task nodes - (extends compute) ● Does Not have Standby Master node ● Best for transient cluster (goes up and down every night)
  • 30. EMR lesson learned... ● Bigger instance type is good architecture ● Use spot instances - for the tasks. ● Don't always use TEZ (MR? Spark?) ● Make sure your choose instance with network optimized ● Resize cluster is not recommended ● Bootstrap to automate cluster upon provisioning ● Use Steps to automate steps on running cluster ● Use Glue to share Hive MetaStore
  • 31. So used EMR for ... ● Most dominant ○ Hive ○ Spark ○ Presto ● And many more…. ● Good for: ○ Data transformation ○ Data modeling ○ Batch ○ Machine learning
  • 32. Hive ● SQL over hadoop. ● Engine: spark, tez, MR ● JDBC / ODBC ● Not good when need to shuffle. ● Not peta scale. ● SerDe json, parquet,regex,text etc. ● Dynamic partitions ● Insert overwrite ● Data Transformation ● Convert to Columnar
  • 33. Presto ● SQL over hadoop ● Not good always for join on 2 large tables. ● Limited by memory ● Not fault tolerant like hive. ● Optimized for ad hoc queries ● No insert overwrite ● No dynamic partitions.
  • 34. Pig ● Distributed Shell scripting ● Generating SQL like operations. ● Engine: MR, Tez ● S3, DynamoDB access ● Use Case: for data science who don't know SQL, for system people, for those who want to avoid java/scala ● Fair fight compared to hive in term of performance only ● Good for unstructured files ETL : file to file , and use sqoop.
  • 35. Hue ● Hadoop user experience ● Logs in real time and failures. ● Multiple users ● Native access to S3. ● File browser to HDFS. ● Manipulate metascore ● Job Browser ● Query editor ● Hbase browser ● Sqoop editor, oozier editor, Pig Editor
  • 36. Orchestration ● EMR Oozie ○ Opens source workflow ■ Workflow: graph of action ■ Coordinator: scheduler jobs ○ Support: hive, sqoop , spark etc. ● Other: AirFlow, Knime, Luigi, Azkaban,AWS Data Pipeline
  • 37. Big Data Generic Architecture | Transformation Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 38. Big Data Generic Architecture | Modeling Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 39. Spark ● In memory ● X10 to X100 times faster ● Good optimizer for distribution ● Rich API ● Spark SQL ● Spark Streaming ● Spark ML (ML lib) ● Spark GraphX (DB graphs) ● SparkR
  • 40. Spark Streaming ● Near real time (1 sec latency) ● like batch of 1sec windows ● Streaming jobs with API ● Not relevant to us...
  • 41. Spark ML ● Classification ● Regression ● Collaborative filtering ● Clustering ● Decomposition ● Code: java, scala, python, sparkR
  • 42. Spark flavours ● Standalone ● With yarn ● With mesos
  • 43. Spark Downside ● Compute intensive ● Performance gain over mapreduce is not guaranteed. ● Streaming processing is actually batch with very small window. ● Different behaviour between hive and spark SQL
  • 44. Spark SQL ● Same syntax as hive ● Optional JDBC via thrift ● Non trivial learning curve ● Upto X10 faster than hive. ● Works well with Zeppelin (out of the box) ● Does not replaces Hive ● Spark not always faster than hive ● insert overwrite - not supported on partitions!
  • 45. Apache Zeppelin ● Notebook - visualizer ● Built in spark integration ● Interactive data analytics ● Easy collaboration. ● Uses SQL ● work s on top of Hive/ SparkSQL ● Inside EMR. ● Uses in the background: ○ Shiro ○ Livy
  • 46. R + spark R ● Open source package for statistical computing. ● Works with EMR ● “Matlab” equivalent ● Works with spark ● Not for developer :) for statistician ● R is single threaded - use spark R to distribute. ● Not everything works perfect.
  • 47. Redshift ● OLAP, not OLTP→ analytics , not transaction ● Fully SQL ● Fully ACID ● No indexing ● Fully managed ● Petabyte Scale ● MPP ● Can create slow queue for queries ○ which are long lasting. ● DO NOT USE FOR transformation. ● Good for : DW, Complex Joins.
  • 48. Redshift spectrum ● Extension of Redshift, use external table on S3. ● Require redshift cluster. ● Not possible for CTAS to s3, complex data structure, joins. ● Good for ○ Read only Queries ○ Aggregations on Exabyte.
  • 49. EMR vs Redshift ● How much data loaded and unloaded? ● Which operations need to performed? ● Recycling data? → EMR ● History to be analyzed again and again ? → emr ● What the data needs to end up? BI? ● Use spectrum in some use cases. (aggregations)? ● Raw data? s3.
  • 50. Hive VS. Redshift ● Amount of concurrency ? low → hive, high → redshift ● Access to customers? Redshift? ● Transformation, Unstructured , batch, ETL → hive. ● Peta scale ? redshift ● Complex joins → Redshift
  • 51. Big Data Generic Architecture | Modeling Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 52. Big Data Generic Architecture | Visualize Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 53. Athena ● Presto SQL ● In memory ● Hive metastore for DDL functionality ○ Complex data types ○ Multiple formats ○ Partitions ● Good for: ○ Read only SQL, ○ Ad hoc query, ○ low cost, ○ managed
  • 54. Visualize ● QuickSight ● Managed Visualizer, simple, cheap
  • 55. Big Data Generic Architecture | Summary Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 56. Summary: Lesson learned ● Productivity of Data Science and Data engineering ○ Common language of both teams IS SQL! ○ Spark cluster has many bridges: SparkR, Spark ML, SparkSQL , Spark core. ● Minimize the amount DB’s used ○ Different syntax (presto/hive/redshift) ○ Different data types ○ Minimize ETLS via External Tables+Glue! ● Not always Streaming is justified (what is the business use case? PaaS?) ● Spark SQL ○ Sometimes faster than redshift ○ Sometimes slower than hive ○ Learning curve is non trivial ● Smart Big Data Architecture is all about: ○ Faster, Cheaper, Simpler, More Secured.