This presentation deck will cover specific services such as Amazon S3, Kinesis, Redshift, Elastic MapReduce, and DynamoDB, including their features and performance characteristics. It will also cover architectural designs for the optimal use of these services based on dimensions of your data source (structured or unstructured data, volume, item size and transfer rates) and application considerations - for latency, cost and durability. It will also share customer success stories and resources to help you get started.
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
1. Managing Big Data in the AWS Cloud
Siva Raghupathy
Principal Solutions Architect
Amazon Web Services
2. Agenda
• Big data challenges
• AWS big data portfolio
• Architectural considerations
• Customer success stories
• Resources to help you get started
• Q&A
3. Data Volume, Velocity, & Variety
• 4.4 zettabytes (ZB) of data exists
in the digital universe today
– 1 ZB = 1 billion terabytes
• 450 billion transaction per day by
2020
• More unstructured data than
structured data GB
TB
PB
ZB
EB
1990 2000 2010 2020
4. Big Data
• Hourly server logs: how your systems were
misbehaving an hour ago
• Weekly / Monthly Bill: What you spent this
past billing cycle?
• Daily customer-preferences report from your
web-site’s click stream: tells you what deal
or ad to try next time
• Daily fraud reports: tells you if there was
fraud yesterday
Real-time Big Data
• Real-time metrics: what just went wrong
now
• Real-time spending alerts/caps:
guaranteeing you can’t overspend
• Real-time analysis: tells you what to offer
the current customer now
• Real-time detection: blocks fraudulent use
now
Big Data : Best Served Fresh
5. Data Analysis Gap
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Generated data
Available for analysis
Data volume - Gap
1990 2000 2010 2020
6. Big Data
Potentially massive datasets
Iterative, experimental style of data
manipulation and analysis
Frequently not a steady-state workload;
peaks and valleys
Time to results is key
Hard to configure/manage
AWS Cloud
Massive, virtually unlimited capacity
Iterative, experimental style of
infrastructure deployment/usage
At its most efficient with highly
variable workloads
Parallel compute clusters from singe
data source
Managed services
7. AWS Big Data Portfolio
Collect / Ingest
Kinesis
Store Process / Analyze
Visualize / Report
EMR EC2
Redshift Data Pipeline
S3
DynamoDB
Glacier
RDS
Import Export
Direct Connect
Amazon SQS
9. Why Data Ingest Tools?
• Data ingest tools convert
random streams of data into
fewer set of sequential streams
– Sequential streams are easier to
process
– Easier to scale
– Easier to persist
Processing
Processing
Processing
Processing
Processing
Kafka
Or
Kinesis
Processing
10. Data Ingest Tools
• Facebook Scribe Data collectors
• Apache Kafka Data collectors
• Apache Flume Data Movement and Transformation
• Amazon Kinesis Data collectors
11. Real-time processing of streaming data
High throughput
Elastic
Easy to use
Connectors for EMR, S3, Redshift, DynamoDB
Amazon
Kinesis
12. AmAamzaozno Kn iKneinseiss iAs rAchrcitheictetucrtuere
AZ AZ AZ
Durable, highly consistent storage replicates data
across three data centers (availability zones)
Amazon Web Services
Aggregate and
archive to S3
Millions of
sources producing
100s of terabytes
per hour
Front
End
Authentication
Authorization
Ordered stream
of events supports
multiple readers
Real-time
dashboards
and alarms
Machine learning
algorithms or
sliding window
analytics
Aggregate analysis
in Hadoop or a
data warehouse
Inexpensive: $0.028 per million puts
13. Kinesis Stream:
Managed ability to capture and store data
• Streams are made of Shards
• Each Shard ingests data up to
1MB/sec, and up to 1000 TPS
• Each Shard emits up to 2 MB/sec
• All data is stored for 24 hours
• Scale Kinesis streams by adding or
removing Shards
• Replay data inside of 24Hr. Window
14. Simple Put interface to store data in Kinesis
• Producers use a PUT call to store data in a Stream
• PutRecord {Data, PartitionKey, StreamName}
• A Partition Key is supplied by producer and used to
distribute the PUTs across Shards
• Kinesis MD5 hashes supplied partition key over the hash
key range of a Shard
• A unique Sequence # is returned to the Producer upon a
successful PUT call
15. Building Kinesis Processing Apps: Kinesis Client Library
Client library for fault-tolerant, at least-once, Continuous Processing
o Java client library, source available on Github
o Build & Deploy app with KCL on your EC2 instance(s)
o KCL is intermediary b/w your application & stream
Automatically starts a Kinesis Worker for each shard
Simplifies reading by abstracting individual shards
Increase / Decrease Workers as # of shards changes
Checkpoints to keep track of a Worker’s location in the
stream, Restarts Workers if they fail
o Integrates with AutoScaling groups to redistribute workers
to new instances
20. Store anything
Object storage
Scalable
Designed for 99.999999999% durability
Amazon
S3
21. Why is Amazon S3 good for Big Data?
• No limit on the number of Objects
• Object size up to 5TB
• Central data storage for all systems
• High bandwidth
• 99.999999999% durability
• Versioning, Lifecycle Policies
• Glacier Integration
22. Amazon S3 Best Practices
• Use random hash prefix for keys
• Ensure a random access pattern
• Use Amazon CloudFront for high throughput GETs and PUTs
• Leverage the high durability, high throughput design of Amazon S3 for
backup and as a common storage sink
• Durable sink between data services
• Supports de-coupling and asynchronous delivery
• Consider RRS for lower cost, lower durability storage of derivatives or copies
• Consider parallel threads and multipart upload for faster writes
• Consider parallel threads and range get for faster reads
23. Aggregate All Data in S3 Surrounded by a collection of the right tools
EMR Kinesis
Data Pipeline
Redshift DynamoDB RDS
Cassandra Storm Spark Streaming
Amazon
S3
Amazon S3
24. Fully-managed NoSQL database service
Built on solid-state drives (SSDs)
Consistent low latency performance
Any throughput rate
No storage limits
Amazon
DynamoDB
26. DynamoDB: Access and Query Model
• Two primary key options
• Hash key: Key lookups: “Give me the status for user abc”
• Composite key (Hash with Range): “Give me all the status updates for user ‘abc’
that occurred within the past 24 hours”
• Support for multiple data types
– String, number, binary… or sets of strings, numbers, or binaries
• Supports both strong and eventual consistency
– Choose your consistency level when you make the API call
– Different parts of your app can make different choices
• Global Secondary Indexes
28. What does DynamoDB handle for me?
• Scaling without down-time
• Automatic sharding
• Security inspections, patches, upgrades
• Automatic hardware failover
• Multi-AZ replication
• Hardware configuration designed specifically for DynamoDB
• Performance tuning
…and a lot more
29. Amazon DynamoDB Best Practices
• Keep item size small
• Store metadata in Amazon DynamoDB and blobs in Amazon S3
• Use a table with a hash key for extremely high scale
• Use hash-range key to model
– 1:N relationships
– Multi-tenancy
• Avoid hot keys and hot partitions
• Use table per day, week, month etc. for storing time series data
• Use conditional updates
32. Processing Frameworks
• Batch Processing
– Take large amount (>100TB) of cold data and ask questions
– Takes hours to get answers back
• Stream Processing (real-time)
– Take small amount of hot data and ask questions
– Takes short amount of time to get your answer back
34. Columnar data warehouse
ANSI SQL compatible
Massively parallel
Petabyte scale
Fully-managed
Very cost-effective
Amazon
Redshift
35. Amazon Redshift architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via
Amazon S3
– Parallel load from Amazon DynamoDB
• Hardware optimized for data processing
• Two hardware platforms
– DW1: HDD; scale from 2TB to 1.6PB
– DW2: SSD; scale from 160GB to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
36. Amazon Redshift Best Practices
• Use COPY command to load large data sets from Amazon S3, Amazon
DynamoDB, Amazon EMR/EC2/Unix/Linux hosts
– Split your data into multiple files
– Use GZIP or LZOP compression
– Use manifest file
• Choose proper sort key
– Range or equality on WHERE clause
• Choose proper distribution key
– Join column, foreign key or largest dimension, group by column
37. Hadoop/HDFS clusters
Hive, Pig, Impala, HBase
Easy to use; fully managed
On-demand and spot pricing
Tight integration with S3,
DynamoDB, and Kinesis
Amazon
Elastic
MapReduce
38. EMR Cluster
S3
1. Put the data
into S3
2. Choose: Hadoop distribution, #
of nodes, types of nodes, Hadoop
apps like Hive/Pig/HBase
4. Get the output
from S3
3. Launch the cluster using
the EMR console, CLI, SDK, or
APIs
How Does EMR Work?
39. EMR Cluster
EMR
S3
You can easily resize the
cluster
And launch parallel
clusters using the same
data
How Does EMR Work?
40. EMR Cluster
EMR
S3
Use Spot
nodes to save
time and
money
How Does EMR Work?
42. Amazon EMR Best Practices
• Balance transient vs persistent clusters
to get the best TCO
• Leverage Amazon S3 integration
– Consistent View for EMRFS
• Use Compression (LZO is a good pick)
• Avoid small files (< 100MB; s3distcp can help!)
• Size cluster to suit each job
• Use EC2 Spot Instances
43. Amazon EMR Nodes and Size
• Tuning cluster size can be more efficient than tuning Hadoop code
• Use m1 and c1 family for functional testing
• Use m3 and c3 xlarge and larger nodes for production workloads
• Use cc2/c3 for memory and CPU intensive jobs
• hs1, hi1, i2 instances for HDFS workloads
• Prefer a smaller cluster of larger nodes
49. Data Characteristics: Hot, Warm, Cold
Hot Warm Cold
Volume MB–GB GB–TB PB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–High High Very High
Request rate Very High High Low
Cost/GB $$-$ $-¢¢ ¢
50. Average
latency
Data
volume
Item size
Request
rate
Cost
($/GB/month)
Durability
Elasti-
Cache
ms
GB
B-KB
Very
High
$$
Low -
Moderate
Amazon
DynamoDB
ms
GB-TBs
(no limit)
B-KB
(64 KB max)
Very
High
¢¢
Very High
Amazon
RDS
ms.sec
GB-TB
(3 TB
max)
KB
(~rowsize)
High
¢¢
High
Cloud
Search
ms.sec
GB-TB
KB
(1 MB max)
High
$
High
Amazon
Redshift
sec.min
TB-PB
(1.6 PB max)
KB
(64 K max)
Low
¢
High
Amazon
EMR (Hive)
sec.min,
hrs
GB-PB
(~nodes)
KB-MB
Low
¢
High
Amazon
S3
ms,sec,
min (~size)
GB-PB
(no limit)
KB-GB
(5 TB max)
Low-Very
High (no limit)
¢
Very High
Amazon
Glacier
hrs
GB-PB
(no limit)
GB
(40 TB max)
Very Low
(no limit)
¢
Very High
51. Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project that will greatly increase
my team’s use of Amazon S3. Hoping you could answer
some questions. The current iteration of the design calls for
many small files, perhaps up to a billion during peak. The
total size would be on the order of 1.5 TB per month…”
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per month
300 2048 1483 777,600,000
52. Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
DynamoDB or S3? 300 2,048 1,483 777,600,000
53. Amazon DynamoDB
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per month
Scenario 1 300 2,048 1,483 777,600,000
Scenario 2 300 32,768 23,730 777,600,000
Amazon S3
use
use
55. Putting it all together
De-coupled architecture
• Multi-tier data processing architecture
• Ingest & Store de-coupled from Processing
• Ingest tools write to multiple data stores
• Processing frameworks (Hadoop, Spark, etc.) read from data stores
• Consumers can decide which data store to read from depending on
their data processing requirement
56. Hot Data Temperature Cold
Spark
Streaming /
Storm
Redshift
Impala Spark
EMR/
Hadoop
Redshift
EMR/
Hadoop
Spark
Kinesis/
Kafka
Data NoSQL / DynamoDB / Hadoop HDFS S3
Low
Latency
High
Answers
60. A look at how it works
Data Analyzed Using EMR:
Months of user history Common misspellings
Weste
Winstin
Westa
Whenstin
Automatic spelling corrections
61. Yelp web site log data goes into Amazon S3
Months of user search data
Search terms
Misspellings
Final click throughs
Amazon S3
63. All 200 nodes of the cluster simultaneously look for
common misspellings
Hadoop Cluster
Amazon S3 Amazon EMR
Westen
Wistin
Westan
64. A map of common misspellings and suggested corrections
are loaded back into Amazon S3.
Hadoop Cluster
Amazon S3 Amazon EMR
Westen
Wistin
Westan
65. Then the cluster is shut down
Yelp only pays for the time they used it
Hadoop Cluster
Amazon S3 Amazon EMR
66. Each of Yelp’s 80 Engineers Can Do This Whenever
They Have a Big Data Problem
spins up over
250 Hadoop
clusters per
week in EMR.
Amazon S3 Amazon EMR
68. Data Innovation Meets Action at Scale
at NASDAQ OMX
• NASDAQ’s technology powers more than 70 marketplaces in 50 countries
• NASDAQ’s global platform can handle more than 1 million messages/second at
a median speed of sub-55 microseconds
• NASDAQ own & operate 26 markets including 3 clearinghouse & 5 central
securities repositories
• More than 5,500 structured products are tied to NASDAQ’s global indexes with
the notional value of at least $1 trillion
• NASDAQ powers 1 in 10 of the world’s securities transactions
69. NASDAQ’s Big Data Challenge
• Archiving Market Data
– A classic “Big Data” problem
• Power Surveillance and Business Intelligence/Analytics
• Minimize Cost
– Not only infrastructure, but development/IT labor costs too
• Empower the business for self-service
70. SIP Total Monthly Message Volumes
OPRA, UQDF and CQS
Market
Data
Is Big
Data
Charts courtesy of the
Financial Information Forum
NASDAQ Exchange Daily Peak Messages
Financial Information Forum, Redistribution without permission from FIF prohibited, email: fifinfo@fif.com
Total Monthly Message Volume Combined
Average Daily
Date UQDF CQS Volume
Aug-12 2,317,804,321 8,241,554,280 459,102,548
Sep-12 1,948,330,199 7,452,279,225 494,768,917
Oct-12 1,016,336,632 7,452,279,225 403,267,422
Nov-12 2,148,867,295 9,552,313,807 557,199,100
Dec-12 2,017,355,401 8,052,399,165 503,487,728
Jan-13 2,099,233,536 7,474,101,082 455,873,077
Feb-13 1,969,123,978 7,531,093,813 500,011,463
Mar-13 2,010,832,630 7,896,498,260 495,366,545
Apr-13 2,447,109,450 9,805,224,566 556,924,273
May-13 2,400,946,680 9,430,865,048 537,809,624
Jun-13 2,601,863,331 11,062,086,463 683,197,490
Jul-13 2,142,134,920 8,266,215,553 473,106,840
Aug-13 2,188,338,764 9,079,813,726 512,188,750
23
OPRA Annual Increase: 69%
CQS Annual Increase: 10%
UQDF Annual Decrease: 6%
Total Monthly
Message Volume Average Daily
Date OPRA Volume
Aug-12 80,600,107,361 3,504,352,494
Sep-12 77,303,404,427 4,068,600,233
Oct-12 98,407,788,187 4,686,085,152
Nov-12 104,739,265,089 4,987,584,052
Dec-12 81,363,853,339 4,068,192,667
Jan-13 82,227,243,377 3,915,583,018
Feb-13 87,207,025,489 4,589,843,447
Mar-13 93,573,969,245 4,678,698,462
Apr-13 123,865,614,055 5,630,255,184
May-13 134,587,099,561 6,117,595,435
Jun-13 162,771,803,250 8,138,590,163
Jul-13 120,920,111,089 5,496,368,686
Aug-13 136,237,441,349 6,192,610,970
600,000,000
400,000,000
200,000,000
0
Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00
71. NASDAQ’s Legacy Solution
• On-premises MPP DB
– Relatively expensive, finite storage
– Required periodic additional expenses to add more storage
– Ongoing IT (administrative) human costs
• Legacy BI tool
– Requires developer involvement for new data sources, reports,
dashboards, etc.
72. New Solution: Amazon Redshift
• Cost Effective
– Redshift is 43% of the cost of legacy
• Assuming equal storage capacities
– Doesn’t include IT ongoing costs!
• Performance
– Outperforms NASDAQ’s legacy BI/DB solution
– Insert 550K rows/second on a 2 node 8XL cluster
• Elastic
– NASDAQ can add additional capacity on demand, easy to grow their cluster
73. New Solution: Pentaho BI/ETL
• Amazon Redshift partner
– http://aws.amazon.com/redshift/partn
ers/pentaho/
• Self Service
– Tools empower BI users to integrate
new data sources, create their own
analytics, dashboards, and reports
without requiring development
involvement
• Cost effective
74. Net Result
• New solution is cheaper, faster, and offers capabilities that NASDAQ
didn’t have before
– Empowers NASDAQ’s business users to explore data like they never
could before
– Reduces IT and development as bottlenecks
– Margin improvement (expense reduction and supports business
decisions to grow revenue)
Organized the deck so that the partner slide in each section closes that section.
2 x 2 Matrix
Structured
Level of query (from none to complex)
Draw down the slide
Transition Statement – RDBMS is still a viable and important component in Big Data Architecture
Traditional SQL Database
Fully managed which means zero admin
Most popular flavors
Binary compatible
Generally come in two major types
Batch
Streaming
Examples
Needs a transition statement – Looking at AWS Portfolio in context of Processing ….
Columnar data warehouseMassively parallel (MPP)
Petabyte scaleFully managed$1,000/TB/Year (with Heavy RI)
Leader node
Compute Node
Hardware optimized
Two different hardware platforms (SSD and HDD)
Parallel Load
API (of course)
Copy
Split files into 1 to 2 GB compressed
Use manifest file
Sort keys
Distribution keys
System has option to make educated guess
Regular Hadoop/HDFS
Support for popular add-ons
Fully managed and easy to use
On demand and SPOT pricing
Integrated with other AWS services
S3
DDB
Kinesis
Bootstrap capabilities have most flexibility at the layer above core Hadoop/HDFS
Popular pattern
1-Customer puts data into S3
2-Make some decisions about what to run (type, number and other technologies to install)
3-Use CLI, SDK, Console or API to launch
4-Output is sent to S3
Call out S3 integration as an important innovation and addition
Time to resize is going to be a combination of EC2/AMI boot time + the bootstrap options.
Call out that the nodes that are added to a running cluster that are SPOT must be task nodes (details)
Additional nodes to a running cluster that are SPOT
S3DistCp to load/unload from HDFS
Shutdown the cluster (stop being charged except
Core Hadoop is:
Map Reduce – Computational Model
HDFS – Hadoop Distributed File System
Additional Tools have entered the eco system
Tools to help get data into Hadoop
Tools to connect to Relational Systems
Monitoring
Machine Learning
This slide is a small slice
EMRFS
all of your files will be processed as intended when you run a chained series of MapReduce jobs. This is not a replacement file system. Instead, it extends the existing file system with mechanisms that are designed to detect and react to inconsistencies. The detection and recovery process includes a retry mechanism. After it has reached a configurable limit on the number of retries (to allow S3 to return what EMRFS expects in the consistent view), it will either (your choice) raise an exception or log the issue and continue.
The EMRFS consistent view creates and uses metadata in an Amazon DynamodB table to maintain a consistent view of your S3 objects. This table tracks certain operations but does not hold any of your data. The information in the table is used to confirm that the results returned from an S3 LIST operation are as expected, thereby allowing EMRFS to check list consistency and read-after-write consistency.
Compression
Always Compress Data Files On Amazon S3
Reduces Bandwidth Between Amazon S3 and Amazon EMR
Speeds Up Your Job
Compress Mappers and Reducer Output
Advise Compressing all files for an instance for a day
Do not use smaller nodes for production workload unless you’re 100% sure you know what you’re doing. The majority of jobs I’ve seen requires more CPU and Memory the smaller instances have to offer and most of the times causes job failures if the cluster is not fine tuned. Instead of spending time fine tunning small nodes, get a larger node and run your workload with peace of mind. Anything larger and including m1.xlarge is a good candidate. m1.xlarge, c1.xlarge, m2.4xlarge and all cluster compute instances are good choices.
To summarize the review of the AWS Big Data Portfolio
There’s no single tool that can do every job needed
Emphasize that this is an “aid” for the design process used to compare options.
In my role as an SA it helps to have a heuristic tool to think about the requirements
Is the data HOT, Warm or cold
As a designer – by asking various questions can slot the data into one of these buckets
Less of a rule and more of a guideline
This material in customer’s own words
http://www.youtube.com/watch?v=j7uZGgSxJGM&t=3m0s
Access to AWS expertise
Solution Architects and security experts
Architecture reviews and best practices
Case studies and sample architectures
http://aws.amazon.com/solutions/case-studies/
http://aws.amazon.com/architecture/
Training and Premium Support
Four support tiers, including Enterprise: http://aws.amazon.com/premiumsupport/
Trusted Advisor service for cost optimization
AWS Professional Services
Domain and product experts
APN Competency Program (to Customers)
What: The APN Competency Program is designed to provide AWS Customers with top quality APN Partners who have demonstrated technical proficiency and proven success in specialized solutions areas. Partners who’ve attained an APN Competency offer a variety of services, software, and solutions on the AWS Cloud.
- Big Data Specific:
o Who: Big Data Competency Partners help customers evaluate and use the tools, techniques, and technologies of working with data productively, at any scale. Learn More about Big Data Competency Partners: aws.amazon.com/partners/competencies/big-data/
- Announcement – Big Data Competency for APN Technology Partners
o Technology Partners for Launch: Microstrategy, SAP, Informatica, SnapLogic, Tableau Software, Attunity, MapR, Sumo Logic, Splunk
o Find out more information on these partners on the Big Data Competency Partners page(aws.amazon.com/partners/competencies/big-data)
- Advise customers to take advantage of these partners software and solutions in Big Data on AWS – they’ve been qualified by Partner Teams, Service Teams, have Marketplace Software, Built Test Drives, etc.
APN Competency Program (to Partners)
What: The APN Competency program is designed to highlight APN Partners who have demonstrated technical proficiency and proven customer success in specialized solution areas. Attaining an APN Competency allows partners to differentiate themselves to customers by showcasing expertise in a specific solution area.
- Banner Attached – Learn More about APN Competencies
- Learn More about the Program: APN Competency Program (aws.amazon.com/partners/competencies)
- Learn More about APN Competency Partners:
o SAP (aws.amazon.com/partners/competencies/sap)
o Oracle (aws.amazon.com/partners/competencies/oracle)
o Big Data (aws.amazon.com/partners/competencies/big-data)
o MSP (aws.amazon.com/partners/competencies/msp)
o Microsoft (aws.amazon.com/partners/competencies/Microsoft)
- Announcement – Big Data Competency for APN Technology Partners
o Technology Partners for Launch: Microstrategy, SAP, Informatica, SnapLogic, Tableau Software, Attunity, MapR, Sumo Logic, Splunk
o Find out more information on these partners on the Big Data Competency Partners page(aws.amazon.com/partners/competencies/big-data)
Life technologies
LinkedIn
DropCam
ICRAR
CDC
Channel4
Yelp
Nokia
AWS Marketplace is the AWS Online Software Store
Customer can find, research, buy software including a wide variety of big data options and software to help you manage your databases
With AWS Marketplace, the simple hourly pricing of most products aligns with EC2 usage model
You can find, purchase and 1-Click launch in minutes, making deployment easy
Marketplace billing integrated into your AWS account
1300+ product listings across 25 categories
Description: Attunity CloudBeam for Amazon Redshift (Express) enables organizations to simplify, automate, and accelerate bulk data loading from database sources (Oracle, Microsoft SQL Server, and MySQL) to Amazon Redshift. Attunity CloudBeam allows your team to avoid the heavy lifting of manually extracting data, transferring via API/script, chopping, staging, and importing.
We will provide researchers and professors of accredited schools and universities with free access to AWS to accelerate science and discovery.
With AWS in Education, educators, academic researchers, and students can apply to obtain free usage credits to tap into the on-demand infrastructure of the Amazon Web Services cloud to teach advanced courses, tackle research endeavors, and explore new projects – tasks that previously would have required expensive up-front and ongoing investments in infrastructure.
Microstrategy
Splunk
QlikView
EMR
Pig
MongoDB
Oracle BI, OBIEE 11g
SAP Hana
Yellowfin BI
Speaker Notes:
We have just released “Big Data on AWS”, a new technical training course for individuals who are responsible for implementing big data environments, namely Data Scientists, Data Analysts, and Enterprise Big Data Solution Architects. This course is designed to teach technical end users how to use Amazon EMR to process data using the broad ecosystem of Hadoop tools like Pig and Hive. We also cover how to create big data environments, work with Amazon DynamoDB and Amazon Redshift, understand the benefits of Amazon Kinesis, and leverage best practices to design big data environments for security and cost-effectiveness.
Upcoming classes include:
Audience
Individuals responsible for implementing big data environments: Data Scientists, Data Analysts, and Enterprise Big Data Solution Architects
Objectives
Understand the architecture of an Amazon EMR cluster
Choose appropriate AWS data storage options for use with Amazon EMR
Know your options for ingesting, transferring, and compressing data for use with Amazon EMR
Use common programming frameworks for Amazon EMR including Hive, Pig, and Streaming
Work with Amazon Redshift and Spark/Shark to implement big data solutions
Leverage big data visualization software
Choose appropriate security and cost management options for Amazon EMR
Understand the benefits of using Amazon Kinesis for big data
Prerequisites
Basic familiarity with big data technologies, including Apache Hadoop and HDFS
Knowledge of big data technologies such as Pig, Hive, and MapReduce helpful, but not required
Working knowledge of core AWS services and public cloud implementation
AWS Essentials course completion or equivalent experience
Basic understanding of data warehousing, relational database systems, and database design
Format
Instructor-Led & Hands-on Labs
Duration
3 days
Details
aws.amazon.com/training/course-descriptions/bigdata/
Big Data on AWS
Big Data on AWS introduces you to cloud-based big data solutions and Amazon Elastic MapReduce (EMR), the AWS big data platform. In this course, we show you how to use Amazon EMR to process data using the broad ecosystem of Hadoop tools like Pig and Hive. We also teach you how to create big data environments, work with Amazon DynamoDB and Amazon Redshift, understand the benefits of Amazon Kinesis, and leverage best practices to design big data environments for security and cost-effectiveness.
Intended Audience
This course is intended for:
Partners and customers responsible for implementing big data environments, including: Data Scientists
Data Analysts
Enterprise, Big Data Solution Architects
Prerequisites
We recommend that attendees of this course have:
Basic familiarity with big data technologies, including Apache Hadoop and HDFS.
Knowledge of big data technologies such as Pig, Hive, and MapReduce is helpful but not required
Working knowledge of core AWS services and public cloud implementation.
Students should complete the AWS Essentials course or have equivalent experience: http://aws.amazon.com/training/course-descriptions/essentials/
Basic understanding of data warehousing, relational database systems, and database design
Delivery Method
Instructor-Led Training (ILT)
Hands-on Labs on AWS
Hands-On Activity
This course allows you to test new skills and apply knowledge to your working environment through a variety of practical exercises.
Duration
3 days
Course Outline
Day 1
Overview of Big Data and Apache Hadoop
Benefits of Amazon EMR
Amazon EMR Architecture
Using Amazon EMR
Launching and Using an Amazon EMR Cluster
High-Level Apache Hadoop Programming Frameworks
Using Hive for Advertising Analytics
Day 2
Other Apache Hadoop Programming Frameworks
Using Streaming for Life Sciences Analytics
Overview: Spark and Shark for In-Memory Analytics
Using Spark and Shark for In-Memory Analytics
Managing Amazon EMR Costs
Overview of Amazon EMR Security
Exploring Amazon EMR Security
Data Ingestion, Transfer, and Compression
Day 3
Using Amazon Kinesis for Real-Time Big Data Processing
AWS Data Storage Options
Using DynamoDB with Amazon EMR
Overview: Amazon Redshift and Big Data
Using Amazon Redshift for Big Data
Visualizing and Orchestrating Big Data
Using Tableau Desktop or Jaspersoft BI to Visualize Big Data
By the end of this course, you will be able to:
Understand Apache Hadoop in the context of Amazon EMR
Understand the architecture of an Amazon EMR cluster
Launch an Amazon EMR cluster using an appropriate Amazon Machine Image and Amazon EC2 instance types
Choose appropriate AWS data storage options for use with Amazon EMR
Know your options for ingesting, transferring, and compressing data for use with Amazon EMR
Use common programming frameworks available for Amazon EMR including Hive, Pig, and Streaming
Work with Amazon Redshift to implement a big data solution
Leverage big data visualization software
Choose appropriate security options for Amazon EMR and your data
Perform in-memory data analysis with Spark and Shark on Amazon EMR
Choose appropriate options to manage your Amazon EMR environment cost-effectively
Understand the benefits of using Amazon Kinesis for big data
Sign Up
Big Data & HPC track with over 20 sessions
Link to reinvent
20 + sessions on big data and high performance computing