The data warehouse is likely your most expensive CAPEX and OPEX expense -- and if you haven't checked your warehouse capacity & utlization, it's likely running low.
Thanks to Big Data & the advent of Hadoop, it no longer makes economic sense to process bulk data transformations (often called ELT -- Extract, Load & Transform) using data warehouse compute.
Join others who have already offloaded storage & processing from Teradata, Oracle, Netezza & DB2 onto Hadoop to save millions by avoiding upgrades!
Offloading makes your data warehouse run faster for critical end-user queries & frees up storage for Big Data -- but how do you make the jump? What transformations are costing you the most? What data in your warehouse are you not using?
Learn how you can:
Find dormant data. Up to 50% of the data in your data warehouse and data marts is never queried by business users -- but you need the right tools to find it.
Identifty transformations to offload. Quickly find out which ELT transformations you should shift to Hadoop.
Manage data movement & processing to Hadoop. Easily collect, process & distribute data in Hadoop with an intuitive graphical user interface. No coding or scripting required.
Deliver faster Hadoop performance per node. Find out how capabilities in the Apache core can help you accelerate batch Hadoop processing by up to 30% on existing hardware with no code changes, & without risk.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Offload the Data Warehouse in the Age of Hadoop
1. Santosh Chitakki, Vice President, Products at Appfluent
schitakki@appfluent.com
Steve Totman, Director of Strategy at Syncsort
@steventotman, stotman@syncsort.com
Presentation + Demo
2. The Data Warehouse Vision: A Single Version of The Truth
Data
Mart
Oracle
File
XML
ERP
Mainframe
Real-Time
ETL
ETL
Enterprise
Data
Data
Mart
Warehouse
Data
Mart
2
3. The Data Warehouse Reality:
• Small, sample of structured data, somewhat available
• Takes months to make any changes/additions
• Costs millions every year
Data
Mart
Oracle
File
XML
ERP
ETL
ETL
Enterprise
Data
ELT
New
Reports
Data
Mart
Warehouse
Mainframe
Dead Data
SLA’s
Data
Mart
Real-Time
New
Column
Granular
History
3
4. ELT Processing Is Driving Exponential Database Costs
The True Cost of ELT
Queries
(Analytics)
$$$
Manual coding/scripting costs
Ongoing manual tuning costs
Higher storage costs
Hurts query performance
Hinders business agility
Transformations
(ELT)
And What if…?
• Batch window is delayed or needs to be re-run?
• Demands increase causing more overlap between queries & batch window?
• A critical business requirement results in longer/heavier queries?
4
5. Dormant Data Makes the Problem Even Worse
Hot
Warm
Cold Data
Transformations (ELT) of unused data
Storage capacity for dormant data
Majority of data in data warehouse is unused/dormant
ETL/ELT processes for unused data unnecessarily consuming CPU capacity
Dormant data consuming unnecessary storage capacity
Eliminate batch loads not needed
Load and store unused data – for active archival in Hadoop
5
6. The Impact of ELT & Dormant Data
Missing
SLA’s
Data
Retention
Windows
Lack of
Agility
Constant
Upgrades
• Slow response times
• With 40-60% of capacity used for ELT less
resources and storage available for end user
reports.
• Only Freshest Data is stored “on-line”
• Historical data archived (as low as 3 months)
• Granularity is lost Hot / Warm / Cold / Dead
• 6 months (average) to add a new data
source / column & generate a new report
• Best resources on SQL tuning not new SQL
creation.
• Data volume growth absorbs all resources to
keep existing analysis running / perform
upgrades
• Exploration of data a wish list item
6
7. Offloading The Data Warehouse to Hadoop
Before
Data Sources
ETL
Data Warehouse
ETL
ELT
After
Data Sources
ETL
Business
Intelligence
Analytic
Query &
Reporting
Data Warehouse
ETL / ELT
Syncsort Confidential and Proprietary - do not copy or distribute
Analytic Query & Reporting
7
8. 20% of ETL Jobs Can Consume of to 80% of Resources
ETL is “T” intensive
– Sort, Join, Merge, Aggregate, Partition
Mappings start simple
– Performance demands add complexity
– business logic gets “distributed”
“Spaghetti” architecture
– Impossible to govern
– Prohibitively expensive to maintain
High Impact start with the greatest pain – focus on
the 20%
8
9. The Opportunity
Transform the economics of data
Cost of managing 1TB of data
$15,000 – $80,000
EDW
But there’s more…
Scalability for longer data retention
Performance SLAs
Business agility
$2000 –
$6,000
Hadoop
10. Why Appfluent?
Appfluent transforms the economics of Big Data and Hadoop.
We are the only company that can completely analyze how
data is used to reduce costs and optimize performance.
12. Why Syncsort?
For 40 years we have been helping companies solve their big data
issues…even before they knew the name Big Data!
• Speed leader in Big Data Processing
• Fastest sort technology in the
market
Our customers are achieving the
impossible, every day!
• Powering 50% of mainframes’ sort
• First-to-market, fully integrated
approach to Hadoop ETL
• A history of innovation
• 25+ Issued & Pending Patents
• Large global customer base
• 15,000+ deployments in 68 countries
Key Partners
12
13. Syncsort DMX-h – Enabling the Enterprise Data Hub
Blazing Performance. Iron Security. Disruptive Economics
• Access – One tool to access all your
data, even mainframe
• Offload – Migrate complex ELT
workloads to Hadoop without coding
• Accelerate – Seamlessly optimize new
& existing batch workloads in Hadoop
PLUS…
Access
Offload &
Deploy
Accelerate
• Smarter Architecture – ETL engine
runs natively within MapReduce
• Smarter Productivity – Use Case
Accelerators for common ETL tasks
• Smarter Security – Enterprise-grade
security
13
14. How to Offload Workload & Data
1
•
•
Identify costly transformations
Identify dormant data
2
•
•
•
Rewrite transformations in DMX-H
Identify performance opportunities
Move dormant data ELT to Hadoop
3
•
•
Run costliest transformations
Store and manage dormant data
4
•
Repeat regularly for maximum
results
15. 1. Identify
Expensive
Transformations
Unused
Data
Cold
Historical Data
Costly
End-user Activity
• Identify expensive transformations such as
ELT to offload to Hadoop.
• Identify unused Tables to find useless
transformations loading them, move to Hadoop or
purge.
• Identify unused historical data (by date functions
used) and move loading & data to Hadoop.
• Discover costly end-user activity and re-direct
workloads to Hadoop.
16. Costly End-User Activity
Find relevant resource consuming end-user workloads and offload
data-sets and activity to Hadoop.
Example: Identify SAS data extracts (i.e. SAS
queries with with no Where Clause)
SAS Data Extracts Identified
Consuming 300 hours
of server time.
Identify data sets associated with data
extracts. Replicate identified data in
Hadoop and offload associated SAS
workload.
16
17. Expensive Transformations
Identify expensive transformations such as ELT to offload to Hadoop.
ELT process – consuming 65% of
CPU Time and 66% of I/O.
Drill on process to identify expensive
transformations to offload.
18. Unused Data
Identify unused Tables to move to Hadoop and offload batch
loads for unused data into Hadoop.
87% of Tables
Unused.
Largest Unused Table
(2 billion records).
Unused columns in Tables.
19. 2. Access & Move Virtually Any Data
One Tool to Quickly and Securely Move All Your Data,
Big or Small. No Coding, No Scripting
Connect to Any Source & Target
•
•
•
RDBMS
Mainframe
Files
•
•
•
Cloud
Appliances
XML
Extract & Load to/from Hadoop
• Extract data & load into the cluster natively from
Hadoop or execute “off-cluster” on ETL server
• Load data warehouses directly from Hadoop. No
need for temporary landing areas.
PLUS… Mainframe Connectivity
• Directly read mainframe data
• Parse & translate
• Load into HDFS
Pre-process & Compress
• Cleanse, validate, and partition for parallel
loading
• Compress for storage savings
19
20. 3. Offload Heavy Transformations to Hadoop
Easily Replicate & Optimize Existing Workloads in Hadoop.
No Coding. No Scripting.
Develop MapReduce ETL processes
without writing code
Leverage existing ETL skills
Develop and test locally in Windows.
Deploy in Hadoop
Use Case Accelerators to fast-track
development
Sort
Join
+
Aggregate
Copy
Merge
File-based metadata: create once, reuse many times!
Development accelerators for CDC
and other common data flows
20
22. Appfluent Offload Success
Large Financial Organization
Situation
• IBM DB2 Enterprise Data Warehouse (EDW) growing too quickly
• DB2 EDW upgrade/expansion too expensive
• Found cost per terabyte of Hadoop is 5x less than DB2 (fully burdened)
Solution
• Created business program called ‘Data Warehouse Modernization’
• Deployed Cloudera to extend EDW capacity
• Used Appfluent to find migration candidates to move to Hadoop
Benefits
• Capped DB2 EDW at 200TB capacity and not expanded it since
• Saved $MM that would have been spent on additional DB2
• Positioned to handle faster rates of data growth in the future
23. Offloading the EDW at Leading Financial Organization
Elapsed Time (m)
400
• Offload ELT processing from Teradata into
CDH using DMX-h
• Implement flexible architecture for staging
and change data capture
• Ability to pull data directly from Mainframe
• No coding. Easier to maintain & reuse
• Enable developers with a broader set of skills
to build complex ETL workflows
HiveQL
300
360 min
200
DMX-h
100
15 min
0
DMXh
HiveQL
0
4
4 Man weeks
12 Man weeks
8
12
16
Impact on Loans Application Project:
Cut development time by 1/3
Reduced complexity. From 140 HiveQL scripts to 12
DMX-h graphical jobs
Eliminated need for Java user defined functions
24x faster!
Development Effort (Weeks)
23
24. Three Quick Takeaways
+
1. ELT and dormant data are driving data
warehouse cost and capacity constraints
2. Offloading heavy transformations and “cold”
data to Hadoop provides fast savings at
minimum risk
3. Follow these 3 steps:
a.
b.
c.
Identify dormant data and pinpoint heavy
ELT workloads. Focus on top 20%
Access and move data to Hadoop
Deploy new workloads in Hadoop.
24
25. The Data Warehouse Vision: A Single Version of The
Truth
Data
Mart
Oracle
File
XML
ERP
Mainframe
Real-Time
ETL
ETL
Enterprise
Data
Data
Mart
Warehouse
Data
Mart
25
26. Next Steps
Sign up for a Data Warehouse Offload assessment!
http://bit.ly/DW-assessment
Our experts will help you:
Collect critical information about your EDW environment
Identify migration candidates & determine feasibility
Develop an offload plan & establish business case
26
26
Jennifer to address this slide- announce the session and introduce the speakers. Instruct on Q& A format.
Back when I started my career in Data Warehousing in the 90’s this is what the business was promised.An Enterprise data warehouse would bring together data from every different source system across an organization to create a single, trusted, source of information. Data would be extracted transformed and loaded into the warehouse using ETL tools – these would be used instead of hand coding SQL or COBOL or other scripts because they would provide a graphical user interface that allowed anyone to develop flows and no need for rocket scientists scalability to handle the growing data volumesmetadata to enable re-use and sharingand connectivity to the different sources and targetsETL would then be used to move data from the EDW to marts and delivered to reporting tools.
So Here’s the reality of Data warehouses today – as one customer recently described it to me their Data Warehouse has become like a huge oil tanker – slow moving and incredibly difficult to change direction. Because of data volume growth the majority of ETL tools commercial and open source were unable to handle the processing within the batch windows – as a result the only engine capable of handling the data volumes were the database engines thanks to their optimizers. So Transformation was pushed into the source, target and especially the enterprise data warehouse databases as hand coded SQL or BTEQ– this so called ELT meant that many ETL tools became little more than expensive schedulers. The usage of ELT resulted in a spaghetti like architecture and was clearly visible to end users by the fact that requests for new reports or the addition of a new column from the warehouse team involves on average a 6 month delay. With so much hand coded SQL adding a new column becomes incredibly complex – it requires adding to the enterprise data model, updating in the warehouse schema, all the existing ELT scripts need to be modified and SLA’s get abandoned.
As you can see in the chart – as ELT has grown then end user reporting and analytics have had to compete for the database storage and capacity – Databases are great when you have the classic use for SQL – Big Data Input, Big Data input, Small result set – exactly what you want to create an aggregated view in a reporting tool but SQL is not ideal for ETL where it’s typically Big input, Big Input, Even bigger output.At first there was less contention as the analysts and warehouse business users ran queries during the day and ELT could be run at night during the overnight batch window but as data volumes increase the batch runs started running into the day and then during the day – today many companies have to do more ELT than can fit into their overnight batch window so they are always trying to catch up and if a load fails it can literally be months before they can recover. It’s also creates a death spiral because you move your best resources to tuning the ELT SQL to improve performance so your less skilled resources hand code new ELT which then needs to be tuned by your best resources. Every step of which hinders agility and increases cost...
Steve you certainly bring up excellent points on how ELT processes are driving up data warehousing costs. Our experience in analyzing data usage at large organizations shows that a significant amount of data is not being used – but is continuously loaded on a daily basis. Dormant data not only is taking up storage capacity, but the bigger impact is the processing capacity in terms of CPU and I/O that is wasted on running ELT on the data warehouse - to load data that the business does not actively use. Admittedly, in many situations – organizations are required by regulatory reasons to maintain a history of data – even if it is not being used. So the best approach here to significantly cut data warehousing costs is to : Eliminate batch loads for data that is not used and not needed. More importantly offload the ELT processes for unused data that needs to be maintained – do it all on on Hadoop and actively archive that unused data on Hadoop. This way you can recover all the wasted capacity from your expensive data warehouse systems.
Thanks santoshSo just to summarize there are 4 dimensions to this problem.First you’ll see that your missing SLA’s – as ELT competes with end use queries and analytics in the warehouseNext you’ll see that the warehouse team implements a data retention window – this is because there’s not enough space and it’s not cost effective to store all the data people want – so instead of the entire history you keep a rolling retention window sometimes as small as a few days or weeksOn average today it takes 6 months to add a new report or a column to the warehouse – customers describe this as the onion effect because each layer gets added because nobody wants to change the layer beneath but when you have to it makes everyone involved cry.Then finally you have the constant upgrade cycle – because of data growth the second you’ve completed an upgrade your already planning for your next one – but the tough thing is selling this to your CFO – if you have to explain you need to spend another $3Million on the warehouse and he asks why and you have to explain it’s so the same report that ran yesterday will still run tomorrow – that’s not a good business case
So as we’ve discussed there’s the reality of what happens today in most data warehouses – the before seen here where ETL and ELT in the database are the norm. But as Teradata’s CEO Mike Koehler remarked on a recent earnings call they have found that ETL consumes about 20 to 40% of the workload of their Teradata data warehouses, with some outliers below and above that range. Teradata thinks that 20% of the 20% to 40% of ETL workload being done on Teradata is a good candidate for moving to Hadoop.Now I personally have been involved with ETL my entire career for over 15 years now and in my experience the ELT workload of most data warehouse databases is at least double that so between 40 and 60% and many of the customers we’re working with aren’t looking to move 20% but rather 100% of that ELT into Hadoop but even if you could free 20% of your capacity – that still means you could postpone any major multi million dollar upgrades of Teradata, Db2, Oracle etc.. For a long time.So we’re seeing more and more customers adopt an architecture where the staging area – the dirty secret of every data warehouse which is were the drops from data sources get stored and a lot of the heavy lifting as ELT occurs gets migrated to an enterprise data hub in Hadoop and then the result is moved to the existing data warehouse now with more capacity or direct to reporting tools.
Now what’s really interesting about ETL and ELT is that the workload tends to be very transformation intensive – sorts, joins, merges, aggregations, partitioning, compression etc..But the 80/20 rule applies – 20% of your ETL and ELT is what consumes 80% of your batch window, resources, tuning etc…The screen shot on the right is actually from a real customer and this diagram (which they called the battlestargalacticaCylonmothership diagram because of the way it looks from a distance is actually their nightly batch run sequence and every box on the diagram is a Teradata ELT SQL script of several thousand lines of code. They actually found that 10% of their flows consumed 90% of their batch window so its’ not that you have to migrate everything – you just start with the 20% and you’ll see a huge amount of benefit immediately
At the end of the day, time and resources consumed by inefficient processes have significant tangible costs.But Hadoop is quickly becoming a disruptive technology that presents a tremendous opportunity for enterprises. The economics of Hadoop when compared to the Enterprise Data Warehouse is quite remarkable. Today the cost of a Terabyte on the Data warehouse can vary from $15k on the low end to more than $80k per Terabyte of fully burdened costs per Terabyte. Enterprises are finding that the cost on Hadoop can be 10 times LESS expensive than the data warehouse.So the question is How can you take advantage of this opportunity and where and how do you begin this process? Appfluent and Synscort together have the complete solution you need.
Before we discuss and demonstrate the solution – let us briefly introduce Appfluent and Synscort. Appfluent is a software company whose mission is to transform the economics of Big Data and Hadoop. Appfluent is the only company that can completely analyze how data is used and enables large enterprises across various vertical industries to reduce costs and optimize performance.
The Appfluent Visibility product gives you the ability to asses and analyze expensive transformations and workloads as well as identify unused data – that can serve as the blueprint to begin the process of offloading your data warehouse to Hadoop. The product non-intrusively monitors and correlates users’ application activity and ELT processes with data usage and the associated resource consumption. The solution provides this visibility across multiple platforms including Teradata, Oracle/Exadata, DB2/Netezza and Hadoop.
So by now some of you may be wondering Who is Syncsort. We are a leading Big Data company, dedicated to help our customers to collect, process and distribute extreme data volumes. We provide the fastest sort technology and the fastest Data processing engine in the market, and most recently we released the first truly integrated approach to extract, transform and load data with Hadoop and even on the cloud.Now, if you have a mainframe in your organizations, then you probably know Syncsort, because we run on nearly 50% of the world’s mainframes, we’re the most trusted 3rd party software for mainframes. But our customers have been using us for over 10 years to accelerate ETL and ELT processing – our product has a unique Optimizer (similar to a database SQL optimizer) designed specifically to accelerate ETL and ELT processing. Our customers are companies who deal with some of the largest and most sophisticated data volumes – that’s the reason they’ve come to us, because we solve data problems that no one else can.
Every organization is trying to technically and yet economically build infrastructure to keep up with modern data by storing and managing it in a single system, regardless of it’s format. The name people are giving to this is an Enterprise Data Hub and in most cases it’s based on Hadoop, but to deliver on the business requirements for data, an Enterprise Data Hub requires components to Access, Offload and Accelerate Data while also providing Extract Transform and Load (ETL) like functionality along with user productivity that doesn’t require a rocket scientist to do simple tasks and complete enterprise level security. Syncsort enables all this whether your running on Hadoop, Cloud, Mainframes, Unix, Windows or Linux and thanks to it’s unique transformation optimizer can scale with no manual tuning.
Now that you know a little about Appfluent and Syncsort – lets look at the process for offloading the data warehouse. You begin with using Appfluent to identify expensive transformations as well as dormant data that are loaded unnecessarily into the warehouse. Once you have identified what can be offloaded – keeping in mind the 80-20 rule where you focus your efforts on identifying the 20% of processing/data that is impacting 80% of your capacity constraints….. You can use Synscort to re-write the expensive transformations in DMX-H on Hadoop before loading the data into the data warehouse. You can also move the dormant data to Hadoop and use DMX-h for the transforming and loading this data- if you need to keep updating the unused data. This way – you can eliminate all of the ELT related to unused data from the data warehouse and run it on Hadoop and store that data on Hadoop. Finally – this is typically not a one time event. You can view Hadoop as an extension of your data warehouse and they will co-exist for the forseeable future. You can repeat this process continually to maximize performance and minimize costs of your overall infrastructure.
Before we go into a demonstration for the solution, lets take a look at some of the features that Appfluent provides to get started. Appfluent’s software parses all the activity on your data warehouse at very granular levels of detail. This enable you to obtain actionable information using the Appfluent Visibility web application. You can identify all of the ELT processes that are most expensive on your system that can be offloaded. Second, since all the SQL activity is parsed, you can identify unused data at a Table and Column level of granularity over specified time periods. Appfluent also parses the data functions being used to query data so you can assess the amount of history being queried by users – to guide your data retention policies. And finally, in addition to expensive ELT transformations, you can also identify end-user workloads and associated data sets that can be run just as well on Hadoop – freeing up capacity on your data warehouse.
Lets take a look at some real-world examples. In this example, Appfluent was used to identify expensive data extracts being performed by users running SAS on a high-end data warehouse system. As you can see, the Appfluent Visibility web app was used to select applications that have the name ‘sas’ and focus on workloads that had no ‘constraints’ – meaning only data extracts. What we found were ‘SAS’ was generated from 5 servers, and … just 42 unique SQL statements were consuming over 300 hours of server time. You can then use Appfluent to easily drill down on this information – and find details such as what data sets were involved and which users were associated with this activity.What we found was that, this activity was related to just 7 Tables, and accessed by a handful of SAS users which Appfluent identified. In this way you can identify data sets to offload to Hadoop and re-direct the application activity to Hadoop. – enabling you to recover wasted data warehouse capacity.
The next example shows expensive ELT transformations. In this case, the ELT processes only constituted less than 2% of their query workload – but was consuming over 60% of CPU and I/O capacity. Think about this skew for a moment! Appfluent can identify the most expensive ELT by both resource consumption and complexity of ELT – for example by Number of Joins, Sub Queries and other inefficiencies – and provide details about the ELT to enable you to begin the offloading process.
Finally, here is an example of identifying Unused or Dormant data. You can identify unused Databases, Schemas, Tables and even specific fields within Tables – over specified time periods that are relevant for you. In this case, large Tables were not only unused, but more data was continuing to be loaded into these Tables on a daily basis – taking up wasted ELT processing capacity and unnecessary storage capacity. These 3 examples hopefully gave you a brief glimpse of how Appfluent provides the first step in exposing relevant information that can be used as a blue-print to begin offloading your data warehouse. Syncsort will now discuss the next two steps in this process.
ThanksSantoshThe second stage in the framework for off-loading data and workloads into Hadoop is Access & Move.Once you’ve identified the data you then have to move it – while Hadoop provides a number of different utilities to move dataThe reality is you will need to use multiple different tools and they don’t have a graphical user interface so you’ll end up manually coding all the scripts and for many critical sources e.g. mainframe – Hadoop offers no connectivitySyncsort provides one solution, that can access data regardless of where it resides - for example we have native high performance connectors to Teradata, Db2, Oracle, IBM Mainframes, Cloud, SalesForce etc..These connectors allow you to extract data and load it natively into the hadoop cluster on each node – or load the data warehouse or marts directly in parallel from Hadoop.We also see a lot of customers are pre-processing and compressing the data before loading into Hadoop – one customers comScore who loads 1.5 Trillion events that’s about 40% of the internet page views through our product DMX-h into Hadoop and Greenplum literally saves Terabytes of storage every month just by sorting the data prior to compression.
Once the data is in Hadoop, you will need a way to easily replicate the same workloads that previously ran in the DWH – typically sorts, joins, CDC, aggregations – but now in Hadoop. Now, sure you can manually write tons of scripts with HiveQL, Pig and Java, but that means you will have to re-train a lot of your staff to scale the development process. A steep learning curve awaits you, so getting productive will take some time. Besides Why re-invent the wheel, when you can easily leverage your existing staff and skills? Syncsort helps you get results quickly and with minimum effort with an intuitive graphical user interface where you can create sophisticated data flows without writing a single line of code. You can even develop and test locally in Windows before deploying into Hadoop. In addition, we provide a set of Use Case Accelerators for common ETL use cases such as CDC, connectivity, aggregations and more.Finally, once you offload from expensive legacy and data warehouses, you need enterprise-grade tools to manage, secure and operationalize the enterprise data hub. With Syncsort you have file-based metadata, this means you can build once and reuse many times. We also provide full integration with management tools such as Cloudera Manager and Hadoop Job tracker – to easily deploy, monitor and administer your Hadoop cluster. And of course, iron security with leading support for Kerberos.When you put all these pieces together, it is what really makes this solution enterprise-ready!
Now Santosh and Jeff from Syncsort will do a quick demo of the combined solution
Now that you have seen a brief demo on how you can use Appfluent and Syncsort to offload your data warehouse, lets talk about some customers who have done this successfully in production systems. A large financial organization we worked found that their data growth and business needs had begun to grow at a rate that made it economically unsustainable to continue adding more capacity to their Enterprise Data Warehouse. Once they determined that managing data on Hadoop would be more than 5 times cheaper than what it cost them on their data warehouse….they decided to cap the the existing capacity on the data wareahouse and implemented a strategy to deploy Hadoop to extend their data warehouse. They started a data warehouse modernization project – and systematically began analyzing and identify data sets and expensive transformations – using Appfluent – and offloaded to Cloudera. The result was that they successfully capped the existing capacity on the data warehouse. They estimated that if they had not done so – they would have had to spend in excess of $15 million on additional capacity over a 18 month period. Instead the Hadoop environment which is now an extension of their data warehouse costs 6-8 times less in total cost of ownership per Terabyte.
This is anther financial institutions one of the largest in the world – the bank had a significant amount of data hosted and batch processed on Teradata. But for them like many Teradata customers – the cost was becoming unsustainable and they were faced with yet another multi million dollar upgrade. So having heard about Hadoop and the significantly lower cost per Gb of data they decided to migrate a loan marketing application to Cloudera’s distribution of Hadoop.While it proved the viability and massive cost savings of the Hadoop platform, they have hundreds more applications that need to be migrated. The loan application they moved across was initially using Hive and HQL and resulted in meeting the SLA but had much slower performance than Teradata and many maintainability concernsThe bank sought tools that could leverage existing staff skills (ETL) to facilitate the migration of the remaining applications and avoid the need to add significant staff with new skills (MapReduce). TheResults were striking - Significantly less development time was required for the DMX-h implementation of the Loan project. 12 man weeks for HiveQL implementation, 4 for DMXhSimplified process with over 140 HIVEQL scripts replaced with twelve graphicalDMX-h jobsMost importantly they Reduced the processing time from6 hrs to 15 minutes
So there are three key takeawaysYou should be aware of the warehouse cost and capacity impacts from ELT and dormant data and the way it impacts your end usersOff-loading ELT and un-used data from your EDW to Hadoop has been proven as the lowest risk highest return first project for your new hadoop cluster and the cost savings can justify further Hadoop investment and more moon-shot like projects.It’s 3 simple steps – Identify, Access and Deploy
By following these simple steps you can really use an Enterprise Data Hub based on Hadoop and your Enterprise Data Warehouse together with Syncsort and Appfluent to deliver something even better than the original vision of the Enterprise Data Warehouse Today.