The AWS Workshop Series Online is a series of live webinars designed for IT professionals who are looking to leverage the AWS Cloud to build and transform their business, are new to the AWS Cloud or looking to further expand their skills and expertise. In this series, we will cover: 'Modern Data Architectures for Real-time Analytics and Engagement'.
5. Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Modern Data Architecture
6. Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Real-time Pipeline
Amazon
Kinesis
Machines
Devices
Mobile
Clickstream
9. Amazon Kinesis
Stream
SHARD1000 TPS or 1MB 5 TPS or 2MB
SHARD
2000 TPS or 2MB 10 TPS or 4MB
SHARD
3000 TPS or 3MB 15 TPS or 6MB
Retention: 24 hours to 7 Days
12. • Writes to one or more Amazon Kinesis Streams
• Retry Mechanism
• Uses PutRecords
• Aggregates
• Integrates with Amazon KCL to de-aggregate
• Submits Amazon CloudWatch metrics
Kinesis Producer Library
13. Kinesis Agent
• Monitors files and sends new data records to your delivery stream
• Handles file rotation, checkpointing, and retry upon failures
• Delivers all data in a reliable, timely, and simple manner
• Emits AWS CloudWatch metrics
43. New X1 Instance - Tons of Memory
• Large-scale, in-memory applications
• Intel® Xeon® E7 8880 v3 Haswell processors
• Up to 2TB of memory
• Up to 128 vCPUs per instance
44. Intel® Processor Technologies
Intel® AVX – Dramatically increases performance for highly parallel HPC workloads
such as life science engineering, data mining, financial analysis, media processing
Intel® AES-NI – Enhances security with new encryption instructions that reduce the
performance penalty associated with encrypting/decrypting data
Intel® Turbo Boost Technology – Increases computing power with performance that
adapts to spikes in workloads
Intel Transactional Synchronization (TSX) Extensions – Enables execution of
transactions that are independent to accelerate throughput
P state & C state control – provides granular performance tuning for cores and sleep
states to improve overall application performance
The second option for writing to Kinesis Streams is to use the Kinesis Producer Library (KPL).
This has a number of advantages over writing directly to the endpoint, the KPL,
Writes to one or more Amazon Kinesis Streams with an automatic and configurable Retry Mechanism.
Uses PutRecords to write multiple records to multiple shards per Request.
Aggregates user records to increase payload size and improve throughput.
Integrates seamlessly with the Amazon KCL to de-aggregate Batched Records.
Submits Amazon CloudWatch metrics on your behalf to provide visibility into producer performance.
The second option for getting data into Firehose is to use the Kinesis Agent.
This allows you to easily get log files into Firehose by monitoring those files.
The Kinesis Client library simplifies parallel processing of streaming big data by allowing customers to write simple applications for processing records.
The application with the Kinesis library is called a Kinesis Worker.
Kinesis’s library notifies the customer’s processing code when there is new data to process.
Kinesis Workers will often process data and write output to an external data store such as DynamoDB, EMR, Redshift, or S3.
Kinesis ensures that every record within a Kinesis Data stream is processed at least once.
The KCL takes care of ensuring that there is always a worker process for each shard.
You can put the EC2 instances into an autoscaling group to deal with increased load.
The desired functionality: twitter-trends.com website that displays top-10 current twitter trends.
On AWS you might start out by simply creating an ElasticBeanstalk website.
Unfortunately the Twitter Firehose feed is quickly becoming too big to process on a single machine.
We will need to somehow divide up the work so that it can be done in parallel on multiple machines and then coalesced into a output that can be vended by our Beanstalk web server.
What we’d like to do is compute a local “top-10” value for all the tweets coming into each orange processing box and then combine all the local top-10 lists into a global top-10 list at the web server.
The problem is that, without some sort of sorting/grouping function, each processing box will receive a randomly selection of tweets, making it impossible for any one box to know how many tweets are occurring for any given topic.
We can solve this problem by taking a page from the map/reduce world. If we were computing the top-10 tweet topics for a stand-alone batch of tweets then what we would do is treat the topic of each tweet as a partition key. Each map task would take in a fraction of all the tweets, group by topic, and send the count for each topic to the appropriate reduce task responsible for that topic.
We can do something equivalent by introducing an intermediate stage that takes in the stream of tweets and orders them by topic. Each processing box would then pull only the tweets for the set of topics that it is responsible for. This would enable each box to be the authoritative counter of tweets for some subset of all currently active tweet topics. This is essentially the streaming equivalent of the map/reduce processing paradigm.
You need to manually do your own shard management and manually scale-out to multiple EC2 instances.
The Kinesis client library automatically load balances the number of shards acquired by an EC2 instance across all the participating EC2 instances.
If we create an auto-scaling group for our instances then we can manage the number of EC2 instances required automatically. We do this by defining a metric in Cloudwatch that the ASG can use to determine when to add or remove EC2 instances from the ASG.
Suppose our initial estimate of how many EC2 instances we needed was low, or suppose that we get an increase in traffic.
The Cloudwatch metric will breach the load level at which the ASG will spin up another EC2 instance.
When the client library code on each EC2 instance next queries the shard management table in DDB each instance will notice that there is an underloaded EC2 instance and some of the instances will relinquish their lock on one of the shards they own.
The result will be that the new EC2 instance will then be able to acquire the lock on one or more shards and thereby take over some of the processing of the stream.
Kinesis Client library simplifies parallel processing of streaming big data by allowing customers to write simple applications for processing records. Kinesis is a Java library that customers compile into their data processing application. The application with the Kinesis library is called a Kinesis Worker. Kinesis’s library notifies the customer’s processing code when there is new data to process. Kinesis Workers will often process data and write output to an external data store such as DynamoDB, EMR, Redshift, or S3. Kinesis guarantees that every record within a Kinesis Data stream is processed at least once
The grey box on the top right – will be EC2 instances in your space. Within there is the Kinesis worker process in green. That is your specific application and business logic. That logic leverages our libraries that connect back to Kinesis (in pink on the left) . There is also another VIP on the GET side.
Amazon Elasticsearch is a managed service for Elasticsearch.
Use the AWS Management Console or simple API calls to access a production-ready Amazon Elasticsearch cluster in minutes without worrying about infrastructure provisioning, or installing and maintaining Elasticsearch software.
Amazon Elasticsearch Service simplifies time-consuming management tasks --such as ensuring high availability, patch management, failure detection and node replacement, backups, and monitoring-
With Firehose, you do not provision shards. Instead, Firehose will automatically scale up to meet demand.
Firehose will also aggregate, compress and encrypt the messages before writing to either S3, Redshift or Elasticsearch.
Kinesis Analytics allows you to query a Kinesis Stream or Firehose using SQL.
The output of this can then be sent to another Stream or Firehose.