Data warehousing costs have been continually rising with the explosion of Big Data. To help you explore the most cost-effective data warehousing techniques, learn from the cloud experts from Amazon and Informatica.
Learn more: http://www.informaticacloud.com/amazon-redshift
Amazon Redshift is a petabyte-scale cloud-based data warehouse that allows you to provision multiple database nodes on demand and offload raw data from on-premise databases for more cost effective data warehousing. Getting this data into Redshift is easy with Informatica Cloud. In this interactive webinar, you’ll learn:
-How Amazon Redshift is changing the economics of data warehousing
-Why Big Data integration and management is a strategic imperative within enterprises
-How cloud integration makes cloud data warehousing even more cost effective
At Informatica, our goal is to unlock your information potential. Join us with featured guest speakers from Amazon for this interactive webinar.
3. Informatica: The Information Management Leader
B2B Data Exchange
Informatica supports the
requirements of cross-organizational
data exchange, so users apply
familiar & trusted data integration
tools and techniques to the growing
practice of B2B data integration.
Cloud Data IntegrationEnterprise Data Integration
Complex Event Processing
Informatica received high praise for
its services from customers. For
deployments involving systems
monitoring use cases, Informatica
offers a five-day stand‐up of
RulePoint.
Ultra Messaging
In spite of the new
entrants, Informatica remains the
market leader in this highly
demanding part of the messaging
market.
Data Quality Master Data Management
Application ILM
4. Informatica Cloud: our fastest growing product line
Today’s Focus: Cloud Data Integration
4
5. Informatica Cloud and Amazon Redshift:
Enabling cost-effective data warehousing
• Redshift Connector pre-release announced in February
• General availability this month (August)
5
InformaticaCloud.com/Amazon-Redshift
7. AWS Database Services
Amazon RDS
Fully managed SQL database service for OLTP
workloads
Amazon
DynamoDB
Fully managed NoSQL service for massively
scalable, high throughput, low latency
workloads
Amazon
Redshift
Fully managed fast and powerful, petabyte-
scale data warehouse service
Amazon
ElastiCache
Fully managed Memcached-compliant in
memory caching service
8. We set out to build…
A fast and powerful, petabyte-scale data warehouse that is:
A Lot Faster
A Lot Cheaper
A Lot Simpler
Amazon Redshift
9. Data warehousing done the AWS way
• Pay as you go, no up front costs
• Fast, cheap, easy to use
• SQL
• Easy to provision
10. Common Customer Use Cases
• Reduce costs by
extending DW rather than
adding HW
• Migrate completely from
existing DW systems
• Respond faster to
business; provision in
minutes
• Improve performance by
an order of magnitude
• Make more data
available for analysis
• Access business data via
standard reporting tools
• Add analytic functionality
to applications
• Scale DW capacity as
demand grows
• Reduce HW & SW costs
by an order of magnitude
Traditional Enterprise DW Companies with Big Data SaaS Companies
11. Progress Since Launch on Feb 14, 2013
• Fastest growing service in AWS history
• Well over 1,000 customers; adding over 100 per week
• Obtained SOC1 & SOC2 certification with more in progress
• Deployed in US East (N. Virginia), US West (Oregon), EU
(Ireland) and Asia Pacific (Tokyo)
• Additional global regions coming soon
12. Amazon Redshift Customers
• 5x – 20x reduction in query times; 4x cost reduction over HIVE
• 20x – 40x reduction in query times
• Nokia: 50% reduction in costs, 2x improvement in query times
13. Amazon Redshift Customer: bit.ly
“When we want to answer a
question with Redshift, we
just write a SQL query and
get an answer within a few
minutes – if not seconds.”
- Sean O’Connor, Engineer at bit.ly
Bit.ly provides social link sharing
analytics, managing over 300
million shortens and 5 billion
clicks each month
14. 14
Amazon Redshift Customer: HasOffers
“Amazon Redshift introduces a
major opportunity to improve
the performance of our real-
time reporting, allowing us to
run queries up to 50 times faster
than our current OLAP solution.”
- Niek Sanders, VP of Engineering, HasOffers
HasOffers records and reports
billions of desktop and mobile
interactions for performance
marketers
15. Amazon Redshift Customer: Infor
“This is the formula for fast and broad
adoption, where customers can get
consistent, accurate, and useful
data fast - in weeks not months or
years.”
- Ali Shadman, SVP, Business Cloud & Upgrades, Infor
Infor is the world’s third largest
ERP vendor, serving over 70,000
customers in 194 countries
16. Amazon Redshift dramatically reduces I/O
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• With row storage you do
unnecessary I/O
• To get total amount, you
have to read everything
17. Amazon Redshift dramatically reduces I/O
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
• With column storage, you
only read the data you need
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
18. Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
• Columnar compression saves
space & reduces I/O
• Amazon Redshift analyzes
and compresses your data
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
19. Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Direct-attached storage
• Large data block sizes
• Track of the minimum
and maximum value for
each block
• Skip over blocks that
don’t contain the data
needed for a given query
• Minimize unnecessary I/O
20. Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
• Use direct-attached storage
to maximize throughput
• Hardware optimized for high
performance data
processing
• Large block sizes to make
the most of each read
• Amazon Redshift manages
durability for you
21. Amazon Redshift architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via
Amazon S3
– Parallel load from Amazon
DynamoDB
• Single node version available
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
22. Amazon Redshift runs on optimized hardware
HS1.8XL: 128 GB RAM, 16 Cores, 24 Spindles, 16 TB compressed user storage, 2 GB/sec scan rate
HS1.XL: 16 GB RAM, 2 Cores, 3 Spindles, 2 TB compressed customer storage
• Optimized for I/O intensive workloads
• High disk density
• Runs in HPC - fast network
• HS1.8XL available on Amazon EC2
23. Amazon Redshift lets you start small and grow big
Extra Large Node (HS1.XL)
3 spindles, 2 TB, 16 GB RAM, 2 cores
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
Eight Extra Large Node (HS1.8XL)
24 spindles, 16 TB, 128 GB RAM, 16 cores, 10 GigE
Cluster 2-100 Nodes (32 TB – 1.6 PB)
Note: Nodes not to scale
24. Amazon Redshift is priced to let you analyze all your data
Simple Pricing
Number of Nodes x Cost per Hour
No charge for Leader Node
No upfront costs
Pay as you go
Price Per Hour for
HS1.XL Single Node
Effective Hourly
Price Per TB
Effective Annual
Price per TB
On-Demand $ 0.850 $ 0.425 $ 3,723
1 Year Reservation $ 0.500 $ 0.250 $ 2,190
3 Year Reservation $ 0.228 $ 0.114 $ 999
25. Amazon Redshift is easy to use
• Provision in minutes
• Monitor query
performance
• Point and click resize
• Built in security
• Automatic backups
Slides not intended for redistribution.
26. Amazon Redshift has security built-in
• SSL to secure data in transit
• Encryption to secure data at rest
– AES-256; hardware accelerated
– All blocks on disks and in Amazon
S3 encrypted
• No direct access to compute
nodes
• Amazon VPC support
Slides not intended for redistribution.
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
Security
Group
JDBC/ODBC
27. Amazon Redshift continuously backs up your data and
recovers from failures
• Replication within the cluster and backup to Amazon S3 to maintain
multiple copies of data at all times
• Backups to Amazon S3 are continuous, automatic, and incremental
– Designed for eleven nines of durability
• Continuous monitoring and automated recovery from failures of drives
and nodes
• Able to restore snapshots to any Availability Zone within a region
Slides not intended for redistribution.
28. Amazon Redshift works with your existing analysis tools
More coming soon…
JDBC/ODBC
Amazon Redshift
29. Amazon Redshift integrates with multiple data sources
Amazon Elastic
MapReduce
Amazon
DynamoDB
Amazon Elastic
Compute Cloud
(EC2)
AWS Storage
Gateway
Service
Amazon Simple
Storage Service
(S3)
Corporate
Data Center
Amazon Relational
Database Service
(RDS)
Amazon
Redshift
34. Best practices to remember…
• The Amazon S3 bucket that holds the data files must be
created in the same region as your cluster
• Files are deleted from Amazon S3 bucket when upload is
complete
• Choose a batch size where the number of batches
matches the number of slices in your cluster
• Each XL node has 2 slices, each 8XL node has 16
• If you have a 2 node XL cluster and 40,000 rows of data,
choose a batch size of 10,000
• The Informatica Cloud Redshift connector can maximize
Amazon’s parallel processing capabilities this way
35. Informatica Cloud Amazon Redshift demonstration
Firewall
Informatica Cloud
Secure Agent
Metadata Mappings
Authenticate and retrieve Data
Synchronization Task
1
1
Retrieve Account Data2
2
3 Perform lookup on SLA level
3
4
4
Put Account Data & SLA Level into
Flat File
5 Transferred compressed Flat File
5
6 Initiate load from Amazon S3
6
7 Load data into Amazon Redshift
7
36. PowerCenter Mappings and Informatica Cloud
• If you want to reuse your existing PowerCenter mappings
with Informatica Cloud and Redshift you have 2 options:
• Use the PowerCenter Repository Manager to export your
existing workflows and import them into Informatica Cloud
using the PowerCenter Tasks feature
Or…
• Keep your existing mappings in PowerCenter and stage the
data
• Create a DSS task in Informatica Cloud to move the data to
Redshift from the staging area
• This task can be managed from PowerCenter
1
2
38. Next Steps
• Get started with Amazon Redshift
• Get started with Informatica Cloud
• InformaticaCloud.com
• Learn more about our Redshift Connector
• InformaticaCloud.com/Amazon-Redshift
38
Announced RedshiftProvision multiple database nodes on demandStart large petabyte-scale data warehousing projects soonerOffload raw data from on-premise databases for cost effective processing
Use Amazon Redshift for easy scalabilityMigrate completely from existing DW to Amazon RedshiftAnalyze data that was previously too expensive to put into a DWDeploy Redshift because provisioning existing DW systems takes monthsReplace HIVE with Amazon Redshift if they were using HIVE to save money
Encryption enhancements
Airbnb: 5x – 20x reduction in query times; 4x reduction in cost over HIVEAccordant Media: 20x – 40x reduction in query timesMeteor Entertainment: Queries across millions of rows running in < 10sNokia: 50% reduction in costs, 2x improvement in query times
Queries across billions of rows running in < 1 min
Using Amazon Redshift to power its upcoming SkyVault productFully managed by Infor to enable customers to run business analyticsChose Redshift for performance, cost, ease-of-use, and scalability
Read only the data you need
Read only the data you need
Read only the data you need
Read only the data you need
Read only the data you need
Informatica Cloud is powered by the Vibe, the same technology that powers the virtual data machine that runs the secure agent. Thus, you use Informatica Cloud to store the various metadata mappings, and upon run-time, the data moves directly from source to target through the execution of the Vibe Secure Agent.
Vibe is the industry’s first and only embeddable virtual data machine to access, aggregate and manage data – regardless of data type, source, volume, compute platform or user. It lets you map once, and deploy anywhere. So you can take your logic that may have defined on-premise, then move it to the cloud. And then move it to Hadoop, or embed it in an application– without recoding.This makes your architecture faster, more flexible, and futureproof.Business BenefitFive time faster turn-around from business idea to solutionAdapt the technology to your business, not vice-versaUtilize all your data, regardless of location, type or volumeIT BenefitFive times faster project deliveryEliminate skills gaps for adopting new technologies and approachesReduce cost of maintaining complex assortment of technologies