SlideShare ist ein Scribd-Unternehmen logo
1 von 160
Downloaden Sie, um offline zu lesen
Big Data Analytics
Abhishek Sinha
Business Development Manager,
AWS
@abysinha
sinhaar@amazon.com
An engineer’s definition
When your data sets become so large that you have to start
innovating how to collect, store, organize, analyze and
share it
What does big data look like ?
Volume
Velocity
Variety
3Vs
Where is this data coming from ?
Human generated
Machine generated
Tweet
Surf the internet
Buy and sell products
Upload images and videos
Play games
Check in at restaurants
Search for cafes
Find deals
Watch content online
Look for directions
Use social media
Human generated
Machine generated
Networks and security devices
Mobile phones
Cell phone towers
Smart grids
Smart meters
Telematics from cars
Sensors on machines
Videos from traffic and security
cameras
What are people using this for ?
Big Data Verticals and Use cases
Media/Advertising
Targeted
Advertising
Image and
Video
Processing
Oil & Gas
Seismic
Analysis
Retail
Recommendati
ons
Transactions
Analysis
Life Sciences
Genome
Analysis
Financial
Services
Monte Carlo
Simulations
Risk
Analysis
Security
Anti-virus
Fraud
Detection
Image
Recognition
Social
Network/Gaming
User
Demographi
cs
Usage
analysis
In-game
metrics
Why is big data hard ?
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Lower cost,
higher throughput
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Highly
constrained
Lower cost,
higher throughput
Big Gap in turning data into actionable
information
Amazon Web Services helps remove constraints
Big Data + Cloud = Awesome Combination
Big data:
• Potentially massive datasets
• Iterative, experimental style
of data manipulation and
analysis
• Frequently not a steady-state
workload; peaks and valleys
• Data is a combination of
structured and unstructured
data in many formats
AWS Cloud:
• Massive, virtually unlimited
capacity
• Iterative, experimental style of
infrastructure deployment/usage
• At its most efficient with highly
variable workloads
• Tools for managing structured
and unstructured data
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Data size
• Global reach
• Native app for almost every smartphone, SMS, web, mobile-web
• 10M+ users, 15M+ venues, ~1B check-ins
• Terabytes of log data
Stack
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat
Files
Databases LogsDataStack
Amazon S3 Database Dumps Log Files
Hadoop Elastic Map Reduce
Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs
mongoexport
postgres dump
Flume
Stack – Front end Application
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat
Files
Databases LogsDataStack
Amazon S3 Database Dumps Log Files
Hadoop Elastic Map Reduce
Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs
mongoexport
postgres dump
Flume
Stack – Collection and Storage
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat
Files
Databases LogsDataStack
Amazon S3 Database Dumps Log Files
Hadoop Elastic Map Reduce
Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs
mongoexport
postgres dump
Flume
Stack – analysis and sharing
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat
Files
Databases LogsDataStack
Amazon S3 Database Dumps Log Files
Hadoop Elastic Map Reduce
Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs
mongoexport
postgres dump
Flume
Users Overtime
“Who is using our
service?”
Identified early mobile usage
Invested heavily in mobile development
Finding signal in the noise of logs
9,432,061 unique mobile devices
used the Yelp mobile app.
4 million+ calls. 5 million+ directions.
In January 2013
Autocomplete Search
Recommendations
Automatic spelling
corrections
“What kind of movies do people
like ?”
More than 25 Million Streaming Members
50 Billion Events Per Day
30 Million plays every day
2 billion hours of video in 3 months
4 million ratings per day
3 million searches
Device location , time , day,
week etc.
Social data
10 TB of streaming data per day
Data consumed in multiple ways
S3
EMR
Prod Cluster
(EMR)
Recommendati
on Engine
Ad-hoc
Analysis
Personalization
AWS
Import/Export
Corporate
data center
Amazon
Elastic
MapReduce
Amazon
Simple
Storage
Service (S3)
BI Users
Clickstream data from
500+ websites and
VoD platform
“Who buys video games?”
Who is Razorfish
• Full service Digital Agency
• Developed an Ad-Serving Platform compatible with most browsers
• Clickstream analysis of data , current historical trends and segmentation of
users
• Segmentation is used to serve ads and cross sell
• 45TB of Log data
• Problems at scale
– Giant Datasets
– Building Infrastructure requires large continuous investment
– Build for peak holiday season
– Traditional Data stores are not scaling
3.5 billion records
13 TB of click stream logs
71 million unique cookies
Per day:
Previously in 2009
Today
Today
This happens in 8 hours everyday
Why AWS + EMR
• Prefect Clarity of Cost
• No upfront infrastructure investment
• No client processing contention
• Without EMR/Hadoop it takes 3 days , with EMR 8 hours
– Scalability 1 node x 100 hours = 100 nodes x 1 hour
• Meet SLA
Playfish improves in-game experience for its users
through data mining
Challenge:
Must understand player usage trends across
50M month users, multiple platforms, 10s of
games, and in the face of rapid growth. This
drives both in-game improvements and
defines what games to target next.
Solution:
EMR provides Playfish the flexibility to
experiment and rapidly ask new questions.
All usage data is stored in S3 and analysts
run ad-hoc hive queries that can slice the
data by time, game, and user.
Data Driven Game Design
Data is being used to understand what gamers are doing
inside the game (behavioral analysis)
- What features people like (rely on data instead of
forum posts)
- What features are abandoned
- A/B testing
- Monetization – In Game Analytics
Building a big data architecture
Design Patterns
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Getting your Data into AWS
Amazon S3
Corporate Data
Center
• Console Upload
• FTP
• AWS Import Export
• S3 API
• Direct Connect
• Storage Gateway
• 3rd Party Commercial Apps
• Tsunami UDP
1
Write directly to a data source
Your application Amazon S3
DynamoDB
Any other data
store
Amazon S3
Amazon EC2
2
Queue , pre-process and then write to data source
Amazon Simple
Queue Service
(SQS)
Amazon S3
DynamoDB
Any other data
store
3
Agency Customer: Video Analytics on AWS
Elastic Load
Balancer
Edge Servers
on EC2
Workers on
EC2
Logs Reports
HDFS Cluster
Amazon Simple Queue
Service (SQS)
Amazon Simple Storage Service
(S3)
Amazon Elastic MapReduce
Aggregate and write to data source
Flume running
on EC2
Amazon S3
Any other data
store
HDFS
4
What is Flume
• Collection, Aggregation of streaming Event Data
– Typically used for log data, sensor data , GPS data etc
• Significant advantages over ad-hoc solutions
– Reliable, Scalable, Manageable, Customizable and High Performance
– Declarative, Dynamic Configuration
– Contextual Routing
– Feature rich
– Fully extensible
Typical Aggregation Flow
[Client]+  Agent [ Agent]*  Destination
Flume uses a multi-tier approach where multiple agents can send data to
another agent which acts as a aggregator. For each agent , data can from
either an agent or a client or can be sent to another agent or a sink
Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
S3 as a “single source of truth”
S3
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Choose depending upon design
Choice of storage systems (Structure and Volume)
Structure
LowHigh
Large
Small
Size
S3
RDS
Dynamo DB
NoSQL
EBS
1
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Hadoop based Analysis
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
EMR is Hadoop in the Cloud
What is Amazon Elastic MapReduce (EMR)?
A framework
Splits data into pieces
Lets processing occur
Gathers the results
distributed computing
Difficulty
Number of Machines
1
1
Difficulty
Number of Machines
1
1
106
2
Difficulty
Number of Machines
1
1
106
2
distributed computing
is hard
distributed computing
requires god-like engineers
Innovation #1:
Hadoop is…
The MapReduce computational paradigm
Hadoop is…
The MapReduce computational paradigm
… implemented as an Open-source, Scalable,
Fault-tolerant, Distributed System
Person Start End
Bob 00:44:48 00:45:11
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11
Person Start End Duration
Bob 00:44:48 00:45:11
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11
Person Start End Duration
Bob 00:44:48 00:45:11 23
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11
Person Start End Duration
Bob 00:44:48 00:45:11 23
Charlie 02:16:02 02:16:18 16
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11
Person Start End Duration
Bob 00:44:48 00:45:11 23
Charlie 02:16:02 02:16:18 16
Charlie 11:16:59 11:17:17 18
Charlie 11:17:24 11:17:38 14
Bob 11:23:10 11:23:25 15
Alice 16:26:46 16:26:54 8
David 17:20:28 17:20:45 17
Alice 18:16:53 18:17:00 7
Charlie 19:33:44 19:33:59 15
Bob 21:13:32 21:13:43 11
David 22:36:22 22:36:34 12
Alice 23:42:01 23:42:11 10
Person Duration
Bob 23
Charlie 16
Charlie 18
Charlie 14
Bob 15
Alice 8
David 17
Alice 7
Charlie 15
Bob 11
David 12
Alice 10
Person Duration
Bob 23
Charlie 16
Charlie 18
Charlie 14
Bob 15
Alice 8
David 17
Alice 7
Charlie 15
Bob 11
David 12
Alice 10
Person Start End
Bob 00:44:48 00:45:11
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11
map
Person Duration
Bob 23
Charlie 16
Charlie 18
Charlie 14
Bob 15
Alice 8
David 17
Alice 7
Charlie 15
Bob 11
David 12
Alice 10
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
Person Total
Alice 25
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
Person Total
Bob 49
Alice 25
Person Total
Charlie 63
Bob 49
Alice 25
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
Person Total
David 29
Charlie 63
Bob 49
Alice 25
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
Person Total
David 29
Charlie 63
Bob 49
Alice 25
Person Total
Alice 25
Bob 49
Charlie 63
David 29
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
reduce
Person Start End
Bob 00:44:48 00:45:11
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
map
reduce
Works on one record. In this case it
does “end time minus start time”
In parallel over all the records
Group together common records
(e.g “Alice, Bob”) and add all the
results
Hadoop is…
The MapReduce computational paradigm
Hadoop is…
The MapReduce computational paradigm
… implemented as an Open-source, Scalable,
Fault-tolerant, Distributed System
distributed computing
requires god-like engineers
distributed computing (with Hadoop)
requires god-like talented engineers
Launch a Hadoop cluster from the CLI (
elastic-mapreduce --create --alive 
--instance-type m1.xlarge 
--num-instances 5
The Hadoop Ecosystem
EMR makes it easy to use Hive and Pig
Pig:
• High-level programming
language (Pig Latin)
• Supports UDFs
• Ideal for data flow/ETL
Hive:
• Data Warehouse for Hadoop
• SQL-like query language
(HiveQL)
R:
• Language and software
environment for statistical
computing and graphics
• Open source
EMR makes it easy to use other tools and applications
Mahout:
• Machine learning library
• Supports recommendation
mining, clustering,
classification, and frequent
itemset mining
Hive
Schema on read
Launch a Hive cluster from the CLI (step 1/1)
./elastic-mapreduce --create --alive 
--name "Test Hive" 
--hadoop-version 0.20 
--num-instances 5 
--instance-type m1.large 
--hive-interactive 
--hive-versions 0.7.1
SQL Interface for working with data
Simple way to use Hadoop
Create Table statement references data location on
S3
Language called HiveQL, similar to SQL
An example of a query could be:
SELECT COUNT(1) FROM sometable;
Requires to setup a mapping to the input data
Uses SerDe:s to make different input formats
queryable
Powerful data types (Array & Map..)
SQL HiveQL
Updates UPDATE, INSERT,
DELETE
INSERT, OVERWRITE
TABLE
Transactions Supported Not supported
Indexes Supported Not supported
Latency Sub-second Minutes
Functions Hundreds Dozens
Multi-table inserts Not supported Supported
Create table as select Not valid SQL-92 Supported
./elastic-mapreduce –create
--name "Hive job flow”
--hive-script
--args s3://myawsbucket/myquery.q
--args -d,INPUT=s3://myawsbucket/input,-
d,OUTPUT=s3://myawsbucket/output
HiveQL to execute
./elastic-mapreduce
--create
--alive
--name "Hive job flow”
--num-instances 5 --instance-type m1.large 
--hive-interactive
Interactive hive session
114
{
requestBeginTime: "19191901901",
requestEndTime: "19089012890",
browserCookie: "xFHJK21AS6HLASLHAS",
userCookie: "ajhlasH6JASLHbas8",
searchPhrase: "digital cameras" adId:
"jalhdahu789asashja",
impresssionId: "hjakhlasuhiouasd897asdh",
referrer: "http://cooking.com/recipe?id=10231",
hostname: "ec2-12-12-12-12.ec2.amazonaws.com",
modelId: "asdjhklasd7812hjkasdhl",
processId: "12901", threadId: "112121",
timers:
{ requestTime: "1910121", modelLookup: "1129101" }
counters:
{ heapSpace: "1010120912012" }
}
115
{
requestBeginTime: "19191901901",
requestEndTime: "19089012890",
browserCookie: "xFHJK21AS6HLASLHAS",
userCookie: "ajhlasH6JASLHbas8",
adId: "jalhdahu789asashja",
impresssionId:
hjakhlasuhiouasd897asdh",
clickId: "ashda8ah8asdp1uahipsd",
referrer: "http://recipes.com/",
directedTo: "http://cooking.com/" }
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string
)
PARTITIONED BY (dt string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='requestBeginTime,
adId, impressionId, referrer, userAgent,
userCookie, ip' )
LOCATION ‘s3://mybucketsource/tables/impressions' ;
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string
)
PARTITIONED BY (dt string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='requestBeginTime,
adId, impressionId, referrer, userAgent,
userCookie, ip' )
LOCATION ‘s3://mybucketsource/tables/impressions' ;
Table structure
to create
(happens fast as
just mapping to
source)
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string
)
PARTITIONED BY (dt string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='requestBeginTime,
adId, impressionId, referrer, userAgent,
userCookie, ip' )
LOCATION ‘s3://mybucketsource/tables/impressions' ;
Source data in S3
Hadoop lowers the cost of developing
a distributed system.
hive> select * from impressions limit 5;
Selecting from source
data directly via Hadoop
What about the cost of operating
a distributed system?
November traffic at amazon.com
November traffic at amazon.com
November traffic at amazon.com
76%
24%
Innovation #2:
EMR is Hadoop in the Cloud
What is Amazon Elastic MapReduce (EMR)?
1 instance x 100 hours = 100 instances x 1 hour
EMR Cluster
S3
Put the data
into S3
Choose: Hadoop distribution, # of
nodes, types of nodes, custom
configs, Hive/Pig/etc.
Get the output from
S3
Launch the cluster using the
EMR console, CLI, SDK, or
APIs
You can also store
everything in HDFS
How does EMR work ?
S3
What can you run on EMR…
EMR Cluster
Resize Nodes
EMR Cluster
You can easily add and
remove nodes
On and Off Fast Growth
Predictable peaksVariable peaks
WASTE
Fast GrowthOn and Off
Predictable peaksVariable peaks
Your choice of tools on Hadoop/EMR
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
SQL based processing
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Pre-processing
framework
Petabyte scale
Columnar Data -
warehouse
Massively Parallel Columnar Datawarehouses
• Columnar Data stores
• MPP
– Parallel Ingest
– Parallel Query
– Scale Out
– Parallel Backup
Columnar data stores
• Data alignment
and block size in
row stores vs.
column stores
• Compression
based on each
column
MPP Data warehouse parallelizes and distributes
everything
• Query
• Load
• Backup
• Restore
• Resize
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
But Data-warehouses are
• Hard to manage
• Very expensive
• Difficult to scale
• Difficult to get performance
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Parallelize and Distribute Everything
Dramatically Reduce I/O
MPP
Load
Query
Resize
Backup
Restore
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Parallelize and Distribute Everything
Dramatically Reduce I/O
MPP
Load
Query
Resize
Backup
Restore
Direct-attached storage
Large data block sizes
Column data store
Data compression
Zone maps
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Protect Operations
Simplify Provisioning
Redshift data is encrypted
Continuously backed up to S3
Automatic node recovery
Transparent disk failure
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Protect Operations
Simplify Provisioning
Redshift data is encrypted
Continuously backed up to S3
Automatic node recovery
Transparent disk failure
Create a cluster in minutes
Automatic OS and software patching
Scale up to 1.6PB with a few clicks and no downtime
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Start Small and Grow Big
Extra Large Node (XL)
3 spindles, 2TB, 15GiB RAM
2 virtual cores, 10GigE
1 node (2TB)  2-32 node cluster (64TB)
8 Extra Large Node (8XL)
24 spindles, 16TB, 120GiB RAM
16 virtual cores, 10GigE
2-100 node cluster (1.6PB)
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Easy to provision and scale
No upfront costs, pay as you go
High performance at a low price
Open and flexible with support for popular BI tools
Amazon Redshift is priced to let you analyze all your data
Price Per Hour for HS1.XL
Single Node
Effective Hourly Price
Per TB
Effective Annual Price
per TB
On-Demand $ 0.850 $ 0.425 $ 3,723
1 Year Reservation $ 0.500 $ 0.250 $ 2,190
3 Year Reservation $ 0.228 $ 0.114 $ 999
Simple Pricing
Number of Nodes x Cost per Hour
No charge for Leader Node
No upfront costs
Pay as you go
Your choice of BI Tools on the cloud
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Pre-processing
framework
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Collaboration and Sharing insights
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools
Sharing results and visualizations and scale
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools
Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
Geospatial Visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Visualization tools
Rinse Repeat every day or hour
Rinse and Repeat
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline
The complete architecture
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline
How do you start ?
Where do you start ?
• Where is your data ? (S3, SQL, NoSQL ?)
– Are you collecting all your data ?
– What is the format (structured or unstructured)
– How much is this data going to grow ?
• How do you want to process it ?
– SQL (HIVE), Scripts (Python/Ruby/Node.JS) On Hadoop ?
• How do you want to use this data
– Visualization tools
• Do you yourself or engage an AWS partner
• Write to me sinhaar@amazon.com
Thank You
sinhaar@amazon.com

Weitere ähnliche Inhalte

Was ist angesagt?

Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewAmazon Web Services
 
16h00 globant - aws globant-big-data_summit2012
16h00   globant - aws globant-big-data_summit201216h00   globant - aws globant-big-data_summit2012
16h00 globant - aws globant-big-data_summit2012infolive
 
Using Big Data to Driving Big Engagement
Using Big Data to Driving Big EngagementUsing Big Data to Driving Big Engagement
Using Big Data to Driving Big EngagementAmazon Web Services
 
Turn Big Data Into Big Value On Informatica and Amazon
Turn Big Data Into Big Value On Informatica and AmazonTurn Big Data Into Big Value On Informatica and Amazon
Turn Big Data Into Big Value On Informatica and AmazonAmazon Web Services
 
Welcome and AWS Big Data Solution Overview
Welcome and AWS Big Data Solution OverviewWelcome and AWS Big Data Solution Overview
Welcome and AWS Big Data Solution OverviewAmazon Web Services
 
Explore Your Data Using Amazon QuickSight and Build Your First Machine Learni...
Explore Your Data Using Amazon QuickSight and Build Your First Machine Learni...Explore Your Data Using Amazon QuickSight and Build Your First Machine Learni...
Explore Your Data Using Amazon QuickSight and Build Your First Machine Learni...Amazon Web Services
 
Amazon big success using big data analytics
Amazon big success using big data analyticsAmazon big success using big data analytics
Amazon big success using big data analyticsKovid Academy
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesAmazon Web Services
 
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...Amazon Web Services
 
Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at ScaleModern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at ScaleAmazon Web Services
 
Cloud Computing for Data Professionals
Cloud Computing for Data ProfessionalsCloud Computing for Data Professionals
Cloud Computing for Data ProfessionalsAnkit Rathi
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarAmazon Web Services
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudAmazon Web Services
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Amazon Web Services
 
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...Amazon Web Services
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
Preparing Data for the Lake: Data Analytics Week SF
Preparing Data for the Lake: Data Analytics Week SFPreparing Data for the Lake: Data Analytics Week SF
Preparing Data for the Lake: Data Analytics Week SFAmazon Web Services
 

Was ist angesagt? (20)

Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution Overview
 
16h00 globant - aws globant-big-data_summit2012
16h00   globant - aws globant-big-data_summit201216h00   globant - aws globant-big-data_summit2012
16h00 globant - aws globant-big-data_summit2012
 
Using Big Data to Driving Big Engagement
Using Big Data to Driving Big EngagementUsing Big Data to Driving Big Engagement
Using Big Data to Driving Big Engagement
 
Turn Big Data Into Big Value On Informatica and Amazon
Turn Big Data Into Big Value On Informatica and AmazonTurn Big Data Into Big Value On Informatica and Amazon
Turn Big Data Into Big Value On Informatica and Amazon
 
Welcome and AWS Big Data Solution Overview
Welcome and AWS Big Data Solution OverviewWelcome and AWS Big Data Solution Overview
Welcome and AWS Big Data Solution Overview
 
Explore Your Data Using Amazon QuickSight and Build Your First Machine Learni...
Explore Your Data Using Amazon QuickSight and Build Your First Machine Learni...Explore Your Data Using Amazon QuickSight and Build Your First Machine Learni...
Explore Your Data Using Amazon QuickSight and Build Your First Machine Learni...
 
Amazon big success using big data analytics
Amazon big success using big data analyticsAmazon big success using big data analytics
Amazon big success using big data analytics
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
 
Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at ScaleModern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at Scale
 
Cloud Computing for Data Professionals
Cloud Computing for Data ProfessionalsCloud Computing for Data Professionals
Cloud Computing for Data Professionals
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
 
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Preparing Data for the Lake: Data Analytics Week SF
Preparing Data for the Lake: Data Analytics Week SFPreparing Data for the Lake: Data Analytics Week SF
Preparing Data for the Lake: Data Analytics Week SF
 
2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days
 

Andere mochten auch

AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)Amazon Web Services
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Amazon Web Services
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWSAmazon Web Services
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Amazon Web Services
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 

Andere mochten auch (10)

AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 
Big Data and Analytics on AWS
Big Data and Analytics on AWS Big Data and Analytics on AWS
Big Data and Analytics on AWS
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
 

Ähnlich wie Big data on_aws in korea by abhishek sinha (lunch and learn)

AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...Amazon Web Services
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
 
Big Data Analytics, Machine Learning e Inteligência Artificial
Big Data Analytics, Machine Learning e Inteligência ArtificialBig Data Analytics, Machine Learning e Inteligência Artificial
Big Data Analytics, Machine Learning e Inteligência ArtificialAmazon Web Services LATAM
 
Driving Business Insights with a Modern Data Architecture AWS Summit SG 2017
Driving Business Insights with a Modern Data Architecture  AWS Summit SG 2017Driving Business Insights with a Modern Data Architecture  AWS Summit SG 2017
Driving Business Insights with a Modern Data Architecture AWS Summit SG 2017Amazon Web Services
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsAmazon Web Services
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Amazon Web Services
 
Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...
Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...
Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...Amazon Web Services
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K..."Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...Provectus
 
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)Amazon Web Services
 
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWSAWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWSAmazon Web Services
 
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017Amazon Web Services
 
Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at Scale Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at Scale Amazon Web Services
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Amazon Web Services
 
Value of Data Beyond Analytics by Darin Briskman
 Value of Data Beyond Analytics by Darin Briskman Value of Data Beyond Analytics by Darin Briskman
Value of Data Beyond Analytics by Darin BriskmanSameer Kenkare
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
Financial Services Analytics on AWS
Financial Services Analytics on AWSFinancial Services Analytics on AWS
Financial Services Analytics on AWSAmazon Web Services
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Amazon Web Services
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 

Ähnlich wie Big data on_aws in korea by abhishek sinha (lunch and learn) (20)

AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
Big Data Analytics, Machine Learning e Inteligência Artificial
Big Data Analytics, Machine Learning e Inteligência ArtificialBig Data Analytics, Machine Learning e Inteligência Artificial
Big Data Analytics, Machine Learning e Inteligência Artificial
 
Driving Business Insights with a Modern Data Architecture AWS Summit SG 2017
Driving Business Insights with a Modern Data Architecture  AWS Summit SG 2017Driving Business Insights with a Modern Data Architecture  AWS Summit SG 2017
Driving Business Insights with a Modern Data Architecture AWS Summit SG 2017
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
 
Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...
Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...
Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K..."Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
 
Building your Datalake on AWS
Building your Datalake on AWSBuilding your Datalake on AWS
Building your Datalake on AWS
 
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
 
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWSAWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
 
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
 
Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at Scale Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at Scale
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
 
Value of Data Beyond Analytics by Darin Briskman
 Value of Data Beyond Analytics by Darin Briskman Value of Data Beyond Analytics by Darin Briskman
Value of Data Beyond Analytics by Darin Briskman
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
Financial Services Analytics on AWS
Financial Services Analytics on AWSFinancial Services Analytics on AWS
Financial Services Analytics on AWS
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 

Mehr von Amazon Web Services Korea

AWS Modern Infra with Storage Roadshow 2023 - Day 2
AWS Modern Infra with Storage Roadshow 2023 - Day 2AWS Modern Infra with Storage Roadshow 2023 - Day 2
AWS Modern Infra with Storage Roadshow 2023 - Day 2Amazon Web Services Korea
 
AWS Modern Infra with Storage Roadshow 2023 - Day 1
AWS Modern Infra with Storage Roadshow 2023 - Day 1AWS Modern Infra with Storage Roadshow 2023 - Day 1
AWS Modern Infra with Storage Roadshow 2023 - Day 1Amazon Web Services Korea
 
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...Amazon Web Services Korea
 
Amazon DocumentDB - Architecture 및 Best Practice (Level 200) - 발표자: 장동훈, Sr. ...
Amazon DocumentDB - Architecture 및 Best Practice (Level 200) - 발표자: 장동훈, Sr. ...Amazon DocumentDB - Architecture 및 Best Practice (Level 200) - 발표자: 장동훈, Sr. ...
Amazon DocumentDB - Architecture 및 Best Practice (Level 200) - 발표자: 장동훈, Sr. ...Amazon Web Services Korea
 
Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...
Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...
Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...Amazon Web Services Korea
 
Internal Architecture of Amazon Aurora (Level 400) - 발표자: 정달영, APAC RDS Speci...
Internal Architecture of Amazon Aurora (Level 400) - 발표자: 정달영, APAC RDS Speci...Internal Architecture of Amazon Aurora (Level 400) - 발표자: 정달영, APAC RDS Speci...
Internal Architecture of Amazon Aurora (Level 400) - 발표자: 정달영, APAC RDS Speci...Amazon Web Services Korea
 
[Keynote] 슬기로운 AWS 데이터베이스 선택하기 - 발표자: 강민석, Korea Database SA Manager, WWSO, A...
[Keynote] 슬기로운 AWS 데이터베이스 선택하기 - 발표자: 강민석, Korea Database SA Manager, WWSO, A...[Keynote] 슬기로운 AWS 데이터베이스 선택하기 - 발표자: 강민석, Korea Database SA Manager, WWSO, A...
[Keynote] 슬기로운 AWS 데이터베이스 선택하기 - 발표자: 강민석, Korea Database SA Manager, WWSO, A...Amazon Web Services Korea
 
Demystify Streaming on AWS - 발표자: 이종혁, Sr Analytics Specialist, WWSO, AWS :::...
Demystify Streaming on AWS - 발표자: 이종혁, Sr Analytics Specialist, WWSO, AWS :::...Demystify Streaming on AWS - 발표자: 이종혁, Sr Analytics Specialist, WWSO, AWS :::...
Demystify Streaming on AWS - 발표자: 이종혁, Sr Analytics Specialist, WWSO, AWS :::...Amazon Web Services Korea
 
Amazon EMR - Enhancements on Cost/Performance, Serverless - 발표자: 김기영, Sr Anal...
Amazon EMR - Enhancements on Cost/Performance, Serverless - 발표자: 김기영, Sr Anal...Amazon EMR - Enhancements on Cost/Performance, Serverless - 발표자: 김기영, Sr Anal...
Amazon EMR - Enhancements on Cost/Performance, Serverless - 발표자: 김기영, Sr Anal...Amazon Web Services Korea
 
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...Amazon Web Services Korea
 
Enabling Agility with Data Governance - 발표자: 김성연, Analytics Specialist, WWSO,...
Enabling Agility with Data Governance - 발표자: 김성연, Analytics Specialist, WWSO,...Enabling Agility with Data Governance - 발표자: 김성연, Analytics Specialist, WWSO,...
Enabling Agility with Data Governance - 발표자: 김성연, Analytics Specialist, WWSO,...Amazon Web Services Korea
 
Amazon Redshift Deep Dive - Serverless, Streaming, ML, Auto Copy (New feature...
Amazon Redshift Deep Dive - Serverless, Streaming, ML, Auto Copy (New feature...Amazon Redshift Deep Dive - Serverless, Streaming, ML, Auto Copy (New feature...
Amazon Redshift Deep Dive - Serverless, Streaming, ML, Auto Copy (New feature...Amazon Web Services Korea
 
From Insights to Action, How to build and maintain a Data Driven Organization...
From Insights to Action, How to build and maintain a Data Driven Organization...From Insights to Action, How to build and maintain a Data Driven Organization...
From Insights to Action, How to build and maintain a Data Driven Organization...Amazon Web Services Korea
 
[Keynote] Accelerating Business Outcomes with AWS Data - 발표자: Saeed Gharadagh...
[Keynote] Accelerating Business Outcomes with AWS Data - 발표자: Saeed Gharadagh...[Keynote] Accelerating Business Outcomes with AWS Data - 발표자: Saeed Gharadagh...
[Keynote] Accelerating Business Outcomes with AWS Data - 발표자: Saeed Gharadagh...Amazon Web Services Korea
 
Amazon DynamoDB - Use Cases and Cost Optimization - 발표자: 이혁, DynamoDB Special...
Amazon DynamoDB - Use Cases and Cost Optimization - 발표자: 이혁, DynamoDB Special...Amazon DynamoDB - Use Cases and Cost Optimization - 발표자: 이혁, DynamoDB Special...
Amazon DynamoDB - Use Cases and Cost Optimization - 발표자: 이혁, DynamoDB Special...Amazon Web Services Korea
 
LG전자 - Amazon Aurora 및 RDS 블루/그린 배포를 이용한 데이터베이스 업그레이드 안정성 확보 - 발표자: 이은경 책임, L...
LG전자 - Amazon Aurora 및 RDS 블루/그린 배포를 이용한 데이터베이스 업그레이드 안정성 확보 - 발표자: 이은경 책임, L...LG전자 - Amazon Aurora 및 RDS 블루/그린 배포를 이용한 데이터베이스 업그레이드 안정성 확보 - 발표자: 이은경 책임, L...
LG전자 - Amazon Aurora 및 RDS 블루/그린 배포를 이용한 데이터베이스 업그레이드 안정성 확보 - 발표자: 이은경 책임, L...Amazon Web Services Korea
 
KB국민카드 - 클라우드 기반 분석 플랫폼 혁신 여정 - 발표자: 박창용 과장, 데이터전략본부, AI혁신부, KB카드│강병억, Soluti...
KB국민카드 - 클라우드 기반 분석 플랫폼 혁신 여정 - 발표자: 박창용 과장, 데이터전략본부, AI혁신부, KB카드│강병억, Soluti...KB국민카드 - 클라우드 기반 분석 플랫폼 혁신 여정 - 발표자: 박창용 과장, 데이터전략본부, AI혁신부, KB카드│강병억, Soluti...
KB국민카드 - 클라우드 기반 분석 플랫폼 혁신 여정 - 발표자: 박창용 과장, 데이터전략본부, AI혁신부, KB카드│강병억, Soluti...Amazon Web Services Korea
 
SK Telecom - 망관리 프로젝트 TANGO의 오픈소스 데이터베이스 전환 여정 - 발표자 : 박승전, Project Manager, ...
SK Telecom - 망관리 프로젝트 TANGO의 오픈소스 데이터베이스 전환 여정 - 발표자 : 박승전, Project Manager, ...SK Telecom - 망관리 프로젝트 TANGO의 오픈소스 데이터베이스 전환 여정 - 발표자 : 박승전, Project Manager, ...
SK Telecom - 망관리 프로젝트 TANGO의 오픈소스 데이터베이스 전환 여정 - 발표자 : 박승전, Project Manager, ...Amazon Web Services Korea
 
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...Amazon Web Services Korea
 
LG 이노텍 - Amazon Redshift Serverless를 활용한 데이터 분석 플랫폼 혁신 과정 - 발표자: 유재상 선임, LG이노...
LG 이노텍 - Amazon Redshift Serverless를 활용한 데이터 분석 플랫폼 혁신 과정 - 발표자: 유재상 선임, LG이노...LG 이노텍 - Amazon Redshift Serverless를 활용한 데이터 분석 플랫폼 혁신 과정 - 발표자: 유재상 선임, LG이노...
LG 이노텍 - Amazon Redshift Serverless를 활용한 데이터 분석 플랫폼 혁신 과정 - 발표자: 유재상 선임, LG이노...Amazon Web Services Korea
 

Mehr von Amazon Web Services Korea (20)

AWS Modern Infra with Storage Roadshow 2023 - Day 2
AWS Modern Infra with Storage Roadshow 2023 - Day 2AWS Modern Infra with Storage Roadshow 2023 - Day 2
AWS Modern Infra with Storage Roadshow 2023 - Day 2
 
AWS Modern Infra with Storage Roadshow 2023 - Day 1
AWS Modern Infra with Storage Roadshow 2023 - Day 1AWS Modern Infra with Storage Roadshow 2023 - Day 1
AWS Modern Infra with Storage Roadshow 2023 - Day 1
 
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...
사례로 알아보는 Database Migration Service : 데이터베이스 및 데이터 이관, 통합, 분리, 분석의 도구 - 발표자: ...
 
Amazon DocumentDB - Architecture 및 Best Practice (Level 200) - 발표자: 장동훈, Sr. ...
Amazon DocumentDB - Architecture 및 Best Practice (Level 200) - 발표자: 장동훈, Sr. ...Amazon DocumentDB - Architecture 및 Best Practice (Level 200) - 발표자: 장동훈, Sr. ...
Amazon DocumentDB - Architecture 및 Best Practice (Level 200) - 발표자: 장동훈, Sr. ...
 
Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...
Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...
Amazon Elasticache - Fully managed, Redis & Memcached Compatible Service (Lev...
 
Internal Architecture of Amazon Aurora (Level 400) - 발표자: 정달영, APAC RDS Speci...
Internal Architecture of Amazon Aurora (Level 400) - 발표자: 정달영, APAC RDS Speci...Internal Architecture of Amazon Aurora (Level 400) - 발표자: 정달영, APAC RDS Speci...
Internal Architecture of Amazon Aurora (Level 400) - 발표자: 정달영, APAC RDS Speci...
 
[Keynote] 슬기로운 AWS 데이터베이스 선택하기 - 발표자: 강민석, Korea Database SA Manager, WWSO, A...
[Keynote] 슬기로운 AWS 데이터베이스 선택하기 - 발표자: 강민석, Korea Database SA Manager, WWSO, A...[Keynote] 슬기로운 AWS 데이터베이스 선택하기 - 발표자: 강민석, Korea Database SA Manager, WWSO, A...
[Keynote] 슬기로운 AWS 데이터베이스 선택하기 - 발표자: 강민석, Korea Database SA Manager, WWSO, A...
 
Demystify Streaming on AWS - 발표자: 이종혁, Sr Analytics Specialist, WWSO, AWS :::...
Demystify Streaming on AWS - 발표자: 이종혁, Sr Analytics Specialist, WWSO, AWS :::...Demystify Streaming on AWS - 발표자: 이종혁, Sr Analytics Specialist, WWSO, AWS :::...
Demystify Streaming on AWS - 발표자: 이종혁, Sr Analytics Specialist, WWSO, AWS :::...
 
Amazon EMR - Enhancements on Cost/Performance, Serverless - 발표자: 김기영, Sr Anal...
Amazon EMR - Enhancements on Cost/Performance, Serverless - 발표자: 김기영, Sr Anal...Amazon EMR - Enhancements on Cost/Performance, Serverless - 발표자: 김기영, Sr Anal...
Amazon EMR - Enhancements on Cost/Performance, Serverless - 발표자: 김기영, Sr Anal...
 
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
 
Enabling Agility with Data Governance - 발표자: 김성연, Analytics Specialist, WWSO,...
Enabling Agility with Data Governance - 발표자: 김성연, Analytics Specialist, WWSO,...Enabling Agility with Data Governance - 발표자: 김성연, Analytics Specialist, WWSO,...
Enabling Agility with Data Governance - 발표자: 김성연, Analytics Specialist, WWSO,...
 
Amazon Redshift Deep Dive - Serverless, Streaming, ML, Auto Copy (New feature...
Amazon Redshift Deep Dive - Serverless, Streaming, ML, Auto Copy (New feature...Amazon Redshift Deep Dive - Serverless, Streaming, ML, Auto Copy (New feature...
Amazon Redshift Deep Dive - Serverless, Streaming, ML, Auto Copy (New feature...
 
From Insights to Action, How to build and maintain a Data Driven Organization...
From Insights to Action, How to build and maintain a Data Driven Organization...From Insights to Action, How to build and maintain a Data Driven Organization...
From Insights to Action, How to build and maintain a Data Driven Organization...
 
[Keynote] Accelerating Business Outcomes with AWS Data - 발표자: Saeed Gharadagh...
[Keynote] Accelerating Business Outcomes with AWS Data - 발표자: Saeed Gharadagh...[Keynote] Accelerating Business Outcomes with AWS Data - 발표자: Saeed Gharadagh...
[Keynote] Accelerating Business Outcomes with AWS Data - 발표자: Saeed Gharadagh...
 
Amazon DynamoDB - Use Cases and Cost Optimization - 발표자: 이혁, DynamoDB Special...
Amazon DynamoDB - Use Cases and Cost Optimization - 발표자: 이혁, DynamoDB Special...Amazon DynamoDB - Use Cases and Cost Optimization - 발표자: 이혁, DynamoDB Special...
Amazon DynamoDB - Use Cases and Cost Optimization - 발표자: 이혁, DynamoDB Special...
 
LG전자 - Amazon Aurora 및 RDS 블루/그린 배포를 이용한 데이터베이스 업그레이드 안정성 확보 - 발표자: 이은경 책임, L...
LG전자 - Amazon Aurora 및 RDS 블루/그린 배포를 이용한 데이터베이스 업그레이드 안정성 확보 - 발표자: 이은경 책임, L...LG전자 - Amazon Aurora 및 RDS 블루/그린 배포를 이용한 데이터베이스 업그레이드 안정성 확보 - 발표자: 이은경 책임, L...
LG전자 - Amazon Aurora 및 RDS 블루/그린 배포를 이용한 데이터베이스 업그레이드 안정성 확보 - 발표자: 이은경 책임, L...
 
KB국민카드 - 클라우드 기반 분석 플랫폼 혁신 여정 - 발표자: 박창용 과장, 데이터전략본부, AI혁신부, KB카드│강병억, Soluti...
KB국민카드 - 클라우드 기반 분석 플랫폼 혁신 여정 - 발표자: 박창용 과장, 데이터전략본부, AI혁신부, KB카드│강병억, Soluti...KB국민카드 - 클라우드 기반 분석 플랫폼 혁신 여정 - 발표자: 박창용 과장, 데이터전략본부, AI혁신부, KB카드│강병억, Soluti...
KB국민카드 - 클라우드 기반 분석 플랫폼 혁신 여정 - 발표자: 박창용 과장, 데이터전략본부, AI혁신부, KB카드│강병억, Soluti...
 
SK Telecom - 망관리 프로젝트 TANGO의 오픈소스 데이터베이스 전환 여정 - 발표자 : 박승전, Project Manager, ...
SK Telecom - 망관리 프로젝트 TANGO의 오픈소스 데이터베이스 전환 여정 - 발표자 : 박승전, Project Manager, ...SK Telecom - 망관리 프로젝트 TANGO의 오픈소스 데이터베이스 전환 여정 - 발표자 : 박승전, Project Manager, ...
SK Telecom - 망관리 프로젝트 TANGO의 오픈소스 데이터베이스 전환 여정 - 발표자 : 박승전, Project Manager, ...
 
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
코리안리 - 데이터 분석 플랫폼 구축 여정, 그 시작과 과제 - 발표자: 김석기 그룹장, 데이터비즈니스센터, 메가존클라우드 ::: AWS ...
 
LG 이노텍 - Amazon Redshift Serverless를 활용한 데이터 분석 플랫폼 혁신 과정 - 발표자: 유재상 선임, LG이노...
LG 이노텍 - Amazon Redshift Serverless를 활용한 데이터 분석 플랫폼 혁신 과정 - 발표자: 유재상 선임, LG이노...LG 이노텍 - Amazon Redshift Serverless를 활용한 데이터 분석 플랫폼 혁신 과정 - 발표자: 유재상 선임, LG이노...
LG 이노텍 - Amazon Redshift Serverless를 활용한 데이터 분석 플랫폼 혁신 과정 - 발표자: 유재상 선임, LG이노...
 

Kürzlich hochgeladen

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Kürzlich hochgeladen (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Big data on_aws in korea by abhishek sinha (lunch and learn)

  • 1. Big Data Analytics Abhishek Sinha Business Development Manager, AWS @abysinha sinhaar@amazon.com
  • 2.
  • 3. An engineer’s definition When your data sets become so large that you have to start innovating how to collect, store, organize, analyze and share it
  • 4. What does big data look like ?
  • 6. Where is this data coming from ?
  • 7. Human generated Machine generated Tweet Surf the internet Buy and sell products Upload images and videos Play games Check in at restaurants Search for cafes Find deals Watch content online Look for directions Use social media
  • 8. Human generated Machine generated Networks and security devices Mobile phones Cell phone towers Smart grids Smart meters Telematics from cars Sensors on machines Videos from traffic and security cameras
  • 9. What are people using this for ?
  • 10. Big Data Verticals and Use cases Media/Advertising Targeted Advertising Image and Video Processing Oil & Gas Seismic Analysis Retail Recommendati ons Transactions Analysis Life Sciences Genome Analysis Financial Services Monte Carlo Simulations Risk Analysis Security Anti-virus Fraud Detection Image Recognition Social Network/Gaming User Demographi cs Usage analysis In-game metrics
  • 11. Why is big data hard ?
  • 12. Generation Collection & storage Analytics & computation Collaboration & sharing
  • 13. Generation Collection & storage Analytics & computation Collaboration & sharing Lower cost, higher throughput
  • 14. Generation Collection & storage Analytics & computation Collaboration & sharing Highly constrained Lower cost, higher throughput
  • 15. Big Gap in turning data into actionable information
  • 16. Amazon Web Services helps remove constraints
  • 17. Big Data + Cloud = Awesome Combination Big data: • Potentially massive datasets • Iterative, experimental style of data manipulation and analysis • Frequently not a steady-state workload; peaks and valleys • Data is a combination of structured and unstructured data in many formats AWS Cloud: • Massive, virtually unlimited capacity • Iterative, experimental style of infrastructure deployment/usage • At its most efficient with highly variable workloads • Tools for managing structured and unstructured data
  • 18. Generation Collection & storage Analytics & computation Collaboration & sharing
  • 19.
  • 20. Data size • Global reach • Native app for almost every smartphone, SMS, web, mobile-web • 10M+ users, 15M+ venues, ~1B check-ins • Terabytes of log data
  • 21. Stack ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  • 22. Stack – Front end Application ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  • 23. Stack – Collection and Storage ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  • 24. Stack – analysis and sharing ApplicationStack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases LogsDataStack Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs mongoexport postgres dump Flume
  • 26. “Who is using our service?”
  • 27. Identified early mobile usage Invested heavily in mobile development Finding signal in the noise of logs
  • 28. 9,432,061 unique mobile devices used the Yelp mobile app. 4 million+ calls. 5 million+ directions. In January 2013
  • 30. “What kind of movies do people like ?”
  • 31. More than 25 Million Streaming Members 50 Billion Events Per Day 30 Million plays every day 2 billion hours of video in 3 months 4 million ratings per day 3 million searches Device location , time , day, week etc. Social data
  • 32.
  • 33.
  • 34. 10 TB of streaming data per day
  • 35. Data consumed in multiple ways S3 EMR Prod Cluster (EMR) Recommendati on Engine Ad-hoc Analysis Personalization
  • 37. “Who buys video games?”
  • 38. Who is Razorfish • Full service Digital Agency • Developed an Ad-Serving Platform compatible with most browsers • Clickstream analysis of data , current historical trends and segmentation of users • Segmentation is used to serve ads and cross sell • 45TB of Log data • Problems at scale – Giant Datasets – Building Infrastructure requires large continuous investment – Build for peak holiday season – Traditional Data stores are not scaling
  • 39. 3.5 billion records 13 TB of click stream logs 71 million unique cookies Per day:
  • 41. Today
  • 42. Today
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50. This happens in 8 hours everyday
  • 51. Why AWS + EMR • Prefect Clarity of Cost • No upfront infrastructure investment • No client processing contention • Without EMR/Hadoop it takes 3 days , with EMR 8 hours – Scalability 1 node x 100 hours = 100 nodes x 1 hour • Meet SLA
  • 52. Playfish improves in-game experience for its users through data mining Challenge: Must understand player usage trends across 50M month users, multiple platforms, 10s of games, and in the face of rapid growth. This drives both in-game improvements and defines what games to target next. Solution: EMR provides Playfish the flexibility to experiment and rapidly ask new questions. All usage data is stored in S3 and analysts run ad-hoc hive queries that can slice the data by time, game, and user.
  • 53.
  • 54.
  • 55. Data Driven Game Design Data is being used to understand what gamers are doing inside the game (behavioral analysis) - What features people like (rely on data instead of forum posts) - What features are abandoned - A/B testing - Monetization – In Game Analytics
  • 56. Building a big data architecture Design Patterns
  • 57. Generation Collection & storage Analytics & computation Collaboration & sharing
  • 58. Generation Collection & storage Analytics & computation Collaboration & sharing
  • 59. Getting your Data into AWS Amazon S3 Corporate Data Center • Console Upload • FTP • AWS Import Export • S3 API • Direct Connect • Storage Gateway • 3rd Party Commercial Apps • Tsunami UDP 1
  • 60. Write directly to a data source Your application Amazon S3 DynamoDB Any other data store Amazon S3 Amazon EC2 2
  • 61. Queue , pre-process and then write to data source Amazon Simple Queue Service (SQS) Amazon S3 DynamoDB Any other data store 3
  • 62. Agency Customer: Video Analytics on AWS Elastic Load Balancer Edge Servers on EC2 Workers on EC2 Logs Reports HDFS Cluster Amazon Simple Queue Service (SQS) Amazon Simple Storage Service (S3) Amazon Elastic MapReduce
  • 63. Aggregate and write to data source Flume running on EC2 Amazon S3 Any other data store HDFS 4
  • 64. What is Flume • Collection, Aggregation of streaming Event Data – Typically used for log data, sensor data , GPS data etc • Significant advantages over ad-hoc solutions – Reliable, Scalable, Manageable, Customizable and High Performance – Declarative, Dynamic Configuration – Contextual Routing – Feature rich – Fully extensible
  • 65. Typical Aggregation Flow [Client]+  Agent [ Agent]*  Destination Flume uses a multi-tier approach where multiple agents can send data to another agent which acts as a aggregator. For each agent , data can from either an agent or a client or can be sent to another agent or a sink
  • 67. Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Choose depending upon design
  • 68. Choice of storage systems (Structure and Volume) Structure LowHigh Large Small Size S3 RDS Dynamo DB NoSQL EBS 1
  • 69. Generation Collection & storage Analytics & computation Collaboration & sharing
  • 70. Hadoop based Analysis Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR
  • 71. EMR is Hadoop in the Cloud What is Amazon Elastic MapReduce (EMR)?
  • 72. A framework Splits data into pieces Lets processing occur Gathers the results
  • 80. Hadoop is… The MapReduce computational paradigm
  • 81. Hadoop is… The MapReduce computational paradigm … implemented as an Open-source, Scalable, Fault-tolerant, Distributed System
  • 82. Person Start End Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
  • 83. Person Start End Duration Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
  • 84. Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
  • 85. Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 16 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
  • 86. Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 16 Charlie 11:16:59 11:17:17 18 Charlie 11:17:24 11:17:38 14 Bob 11:23:10 11:23:25 15 Alice 16:26:46 16:26:54 8 David 17:20:28 17:20:45 17 Alice 18:16:53 18:17:00 7 Charlie 19:33:44 19:33:59 15 Bob 21:13:32 21:13:43 11 David 22:36:22 22:36:34 12 Alice 23:42:01 23:42:11 10
  • 87. Person Duration Bob 23 Charlie 16 Charlie 18 Charlie 14 Bob 15 Alice 8 David 17 Alice 7 Charlie 15 Bob 11 David 12 Alice 10
  • 88. Person Duration Bob 23 Charlie 16 Charlie 18 Charlie 14 Bob 15 Alice 8 David 17 Alice 7 Charlie 15 Bob 11 David 12 Alice 10 Person Start End Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11 map
  • 89. Person Duration Bob 23 Charlie 16 Charlie 18 Charlie 14 Bob 15 Alice 8 David 17 Alice 7 Charlie 15 Bob 11 David 12 Alice 10
  • 90. Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
  • 91. Person Total Alice 25 Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
  • 92. Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17 Person Total Bob 49 Alice 25
  • 93. Person Total Charlie 63 Bob 49 Alice 25 Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
  • 94. Person Total David 29 Charlie 63 Bob 49 Alice 25 Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
  • 95. Person Total David 29 Charlie 63 Bob 49 Alice 25
  • 96. Person Total Alice 25 Bob 49 Charlie 63 David 29 Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17 reduce
  • 97. Person Start End Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11
  • 98. Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17
  • 99. map reduce Works on one record. In this case it does “end time minus start time” In parallel over all the records Group together common records (e.g “Alice, Bob”) and add all the results
  • 100. Hadoop is… The MapReduce computational paradigm
  • 101. Hadoop is… The MapReduce computational paradigm … implemented as an Open-source, Scalable, Fault-tolerant, Distributed System
  • 103. distributed computing (with Hadoop) requires god-like talented engineers
  • 104. Launch a Hadoop cluster from the CLI ( elastic-mapreduce --create --alive --instance-type m1.xlarge --num-instances 5
  • 106. EMR makes it easy to use Hive and Pig Pig: • High-level programming language (Pig Latin) • Supports UDFs • Ideal for data flow/ETL Hive: • Data Warehouse for Hadoop • SQL-like query language (HiveQL)
  • 107. R: • Language and software environment for statistical computing and graphics • Open source EMR makes it easy to use other tools and applications Mahout: • Machine learning library • Supports recommendation mining, clustering, classification, and frequent itemset mining
  • 109. Launch a Hive cluster from the CLI (step 1/1) ./elastic-mapreduce --create --alive --name "Test Hive" --hadoop-version 0.20 --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions 0.7.1
  • 110. SQL Interface for working with data Simple way to use Hadoop Create Table statement references data location on S3 Language called HiveQL, similar to SQL An example of a query could be: SELECT COUNT(1) FROM sometable; Requires to setup a mapping to the input data Uses SerDe:s to make different input formats queryable Powerful data types (Array & Map..)
  • 111. SQL HiveQL Updates UPDATE, INSERT, DELETE INSERT, OVERWRITE TABLE Transactions Supported Not supported Indexes Supported Not supported Latency Sub-second Minutes Functions Hundreds Dozens Multi-table inserts Not supported Supported Create table as select Not valid SQL-92 Supported
  • 112. ./elastic-mapreduce –create --name "Hive job flow” --hive-script --args s3://myawsbucket/myquery.q --args -d,INPUT=s3://myawsbucket/input,- d,OUTPUT=s3://myawsbucket/output HiveQL to execute
  • 113. ./elastic-mapreduce --create --alive --name "Hive job flow” --num-instances 5 --instance-type m1.large --hive-interactive Interactive hive session
  • 114. 114 { requestBeginTime: "19191901901", requestEndTime: "19089012890", browserCookie: "xFHJK21AS6HLASLHAS", userCookie: "ajhlasH6JASLHbas8", searchPhrase: "digital cameras" adId: "jalhdahu789asashja", impresssionId: "hjakhlasuhiouasd897asdh", referrer: "http://cooking.com/recipe?id=10231", hostname: "ec2-12-12-12-12.ec2.amazonaws.com", modelId: "asdjhklasd7812hjkasdhl", processId: "12901", threadId: "112121", timers: { requestTime: "1910121", modelLookup: "1129101" } counters: { heapSpace: "1010120912012" } }
  • 115. 115 { requestBeginTime: "19191901901", requestEndTime: "19089012890", browserCookie: "xFHJK21AS6HLASLHAS", userCookie: "ajhlasH6JASLHbas8", adId: "jalhdahu789asashja", impresssionId: hjakhlasuhiouasd897asdh", clickId: "ashda8ah8asdp1uahipsd", referrer: "http://recipes.com/", directedTo: "http://cooking.com/" }
  • 116. CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string ) PARTITIONED BY (dt string) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION ‘s3://mybucketsource/tables/impressions' ;
  • 117. CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string ) PARTITIONED BY (dt string) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION ‘s3://mybucketsource/tables/impressions' ; Table structure to create (happens fast as just mapping to source)
  • 118. CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string ) PARTITIONED BY (dt string) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION ‘s3://mybucketsource/tables/impressions' ; Source data in S3
  • 119. Hadoop lowers the cost of developing a distributed system.
  • 120. hive> select * from impressions limit 5; Selecting from source data directly via Hadoop
  • 121. What about the cost of operating a distributed system?
  • 122. November traffic at amazon.com
  • 123. November traffic at amazon.com
  • 124. November traffic at amazon.com 76% 24%
  • 126. EMR is Hadoop in the Cloud What is Amazon Elastic MapReduce (EMR)?
  • 127.
  • 128. 1 instance x 100 hours = 100 instances x 1 hour
  • 129. EMR Cluster S3 Put the data into S3 Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc. Get the output from S3 Launch the cluster using the EMR console, CLI, SDK, or APIs You can also store everything in HDFS How does EMR work ?
  • 130. S3 What can you run on EMR… EMR Cluster
  • 131. Resize Nodes EMR Cluster You can easily add and remove nodes
  • 132. On and Off Fast Growth Predictable peaksVariable peaks WASTE
  • 133. Fast GrowthOn and Off Predictable peaksVariable peaks
  • 134. Your choice of tools on Hadoop/EMR Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR
  • 135. SQL based processing Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Pre-processing framework Petabyte scale Columnar Data - warehouse
  • 136. Massively Parallel Columnar Datawarehouses • Columnar Data stores • MPP – Parallel Ingest – Parallel Query – Scale Out – Parallel Backup
  • 137. Columnar data stores • Data alignment and block size in row stores vs. column stores • Compression based on each column
  • 138. MPP Data warehouse parallelizes and distributes everything • Query • Load • Backup • Restore • Resize 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  • 139. But Data-warehouses are • Hard to manage • Very expensive • Difficult to scale • Difficult to get performance
  • 140. Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud
  • 141. Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Parallelize and Distribute Everything Dramatically Reduce I/O MPP Load Query Resize Backup Restore
  • 142. Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Parallelize and Distribute Everything Dramatically Reduce I/O MPP Load Query Resize Backup Restore Direct-attached storage Large data block sizes Column data store Data compression Zone maps
  • 143. Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Protect Operations Simplify Provisioning Redshift data is encrypted Continuously backed up to S3 Automatic node recovery Transparent disk failure
  • 144. Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Protect Operations Simplify Provisioning Redshift data is encrypted Continuously backed up to S3 Automatic node recovery Transparent disk failure Create a cluster in minutes Automatic OS and software patching Scale up to 1.6PB with a few clicks and no downtime
  • 145. Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Start Small and Grow Big Extra Large Node (XL) 3 spindles, 2TB, 15GiB RAM 2 virtual cores, 10GigE 1 node (2TB)  2-32 node cluster (64TB) 8 Extra Large Node (8XL) 24 spindles, 16TB, 120GiB RAM 16 virtual cores, 10GigE 2-100 node cluster (1.6PB)
  • 146. Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud Easy to provision and scale No upfront costs, pay as you go High performance at a low price Open and flexible with support for popular BI tools
  • 147. Amazon Redshift is priced to let you analyze all your data Price Per Hour for HS1.XL Single Node Effective Hourly Price Per TB Effective Annual Price per TB On-Demand $ 0.850 $ 0.425 $ 3,723 1 Year Reservation $ 0.500 $ 0.250 $ 2,190 3 Year Reservation $ 0.228 $ 0.114 $ 999 Simple Pricing Number of Nodes x Cost per Hour No charge for Leader Node No upfront costs Pay as you go
  • 148. Your choice of BI Tools on the cloud Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Pre-processing framework
  • 149. Generation Collection & storage Analytics & computation Collaboration & sharing
  • 150. Collaboration and Sharing insights Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift
  • 151. Sharing results and visualizations Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Web App Server Visualization tools
  • 152. Sharing results and visualizations and scale Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Web App Server Visualization tools
  • 153. Sharing results and visualizations Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Business Intelligence Tools Business Intelligence Tools
  • 154. Geospatial Visualizations Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Business Intelligence Tools Business Intelligence Tools GIS tools on hadoop GIS tools Visualization tools
  • 155. Rinse Repeat every day or hour
  • 156. Rinse and Repeat Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Visualization tools Business Intelligence Tools Business Intelligence Tools GIS tools on hadoop GIS tools Amazon data pipeline
  • 157. The complete architecture Amazon SQS Amazon S3 DynamoDB Any SQL or NO SQL Store Log Aggregation tools Amazon EMR Amazon Redshift Visualization tools Business Intelligence Tools Business Intelligence Tools GIS tools on hadoop GIS tools Amazon data pipeline
  • 158. How do you start ?
  • 159. Where do you start ? • Where is your data ? (S3, SQL, NoSQL ?) – Are you collecting all your data ? – What is the format (structured or unstructured) – How much is this data going to grow ? • How do you want to process it ? – SQL (HIVE), Scripts (Python/Ruby/Node.JS) On Hadoop ? • How do you want to use this data – Visualization tools • Do you yourself or engage an AWS partner • Write to me sinhaar@amazon.com