9. GLACIER
CONCEPTS
▸ Keep all your data at a much lower cost
▸ Move automatically from S3 to Glacier
▸ Compliance requirements to keep your data
▸ Vault Lock (based on IAM policies)
16. AWS KINESIS STREAMS
CONCEPTS
▸ Streams - ordered sequence of data records
▸ Data record - Sequence Number, Partition Key, Data Blob
▸ 1MB max
▸ Retention period - 24h - 7d
▸ Producers, Consumers
▸ Shards
26. AWS KINESIS ANALYTICS
STREAMING SQL
▸ Tumbling Window
[...] GROUP BY
FLOOR((“SOURCE_SQL_STREAM_001”.ROWTIME – TIMESTAMP
‘1970-01-01 00:00:00’) SECOND / 10 TO SECOND)
▸ Sliding Window
SELECT AVG(change) OVER W1 as avg_change
FROM "SOURCE_SQL_STREAM_001"
WINDOW W1 AS (PARTITION BY ticker_symbol RANGE INTERVAL
'10' SECOND PRECEDING)
30. SQS
CONCEPTS
▸ Simple Queue Service
▸ Send, Store, Retrieve messages
▸ between applications
▸ Acts as a buffer
▸ At least once delivery
▸ FIFO is also supported
31. SQS
CONCEPTS
▸ Queues are created in regions
▸ 14 days retention
▸ no message priority is supported (2 queues)
35. IOT
CONCEPTS
▸ Managed Cloud Platform from Internet of Things
▸ Billions of devices, trillions of messages
▸ Messages can be routed to other AWS services and other
devices
36. IOT
IOT AND BIG DATA
▸ IoT devices produce data
▸ Analyze streaming data real time
▸ Process and store data
▸ Don't worry about capacity, scaling, infrastructure
41. DATA PIPELINE
TERMINOLOGY
▸ A web service to process and move data between AWS
compute and storage services or on-premise data sources
▸ ETL Workflow
▸ Runs on an EC2 instance or EMR cluster that are
provisioned automatically
46. DYNAMO DB
CONCEPTS
▸ Fully managed NoSQL database
▸ No visible "servers"
▸ Single digit latency
▸ Document & Key-Value models
▸ No storage limitations
▸ Runs on SSD
47. DYNAMO DB
CONCEPTS
▸ Collection of Tables
▸ Performance is set on the tables
▸ Write Capacity Units - count of 1KB blocks
▸ Read Capacity Units - count of 4KB blocks
▸ Eventually Consistent by default (data in 3 regions)
▸ Strong consistent reads supported
49. DYNAMO DB
DATA TYPES
▸ String
▸ Number
▸ Binary
▸ Bool
▸ Null
▸ Document (List/Map)
▸ Set
50. DYNAMO DB
DYNAMO DB IN THE AWS ECOSYSTEM
▸ On EMR Dynamo DB is integrated with Hive
▸ Copy data to/from S3 with Data Pipeline
▸ Lambda can be used as triggers
▸ Move data into Redshift with the COPY command
56. EMR
CORE NODE
▸ Like a slave node
▸ Runs tasks
▸ HDFS
▸ Data node, Node Manager, Application Master
57. EMR
TASK NODE
▸ Like a slave node
▸ No HDFS
▸ Can be added/remove from a running cluster
▸ Extra compute capacity
58. EMR
STORAGE OPTIONS
▸ Instance Store
▸ ephemeral - deleted when instance terminates/lost
▸ use when High I/O performance is necessary
▸ EBS for HDFS
▸ EMRFS
59. EMR
EMRFS
▸ Implementation of HDFS
▸ Wrapper over S3
▸ Resize, terminate clusters without loosing data
▸ Multiple clusters can point to the same data in S3
▸ EMRFS & HDFS
▸ S3DistCp
60. EMR
EMRFS - CONSISTENT VIEWS
▸ S3
▸ Read After Write consistent for new data
▸ Eventual consistent for overwrite & delete
▸ Solve it with Consistent Views switch
82. AWS BIG DATA
READING MATERIALS
▸ Kinesis
▸ Kinesis Firehose Transformation with Lambda
▸ Implementing producers with Amazon Kinesis Producer Library
▸ EMR
▸ Best practices for Amazon EMR (2013)
▸ Lambda
▸ Big Data processing with serverless MapReduce
▸ Redshift
▸ Redshift Table Design
▸ Optimizing for Star Schemas and Interleaved Sorting
83. AWS BIG DATA
HOMEWORK - OPTIONAL
1. Create a transient EMR cluster that automatically starts
your application (Hive, Pig, Spark, etc...)
2. Create a Redshift cluster, experiment with the COPY
command (bulk load rows) vs INSERT statement
‣ 100 rows at least
Note: Make sure you do not forget to delete the created
resources ...