Learn how Big Data solutions from Excelerate Systems are driving nextgen DataWarehouse optimization.... In other words - if you have BIG data - come and talk to us
1. Mobile, Big Data, Cloud, Security, Virtualization
http://www.exceleratesystems.com
David Bennett - CEO
2. 2
• Founded in 2008
• Excelerate Systems is a leading Company in the Americas focusing on Big
Data, Cloud, IT Operations and Security.
• With Offices in the US, Mexico, Chile and France as well as individual
contributors in Brasil, Uruguay, Argentina, Canada, Spain, China and India
we have a global delivery capability.
• 125 customers in 25 countries
6. 6
Storage Only Grid (original raw data)
Instrumentation
Collection
RDBMS (aggregated data)
BI Reports + Interactive Apps
Mostly Append
ETL Compute Grid
1. Moving Data To
Compute Doesn’t Scale
3. Can’t Explore Original High
Fidelity Raw Data
2. Archiving
= Premature
Data Death
The Problems with Current Data Systems
7. 7
The Solution: A Combined Storage/Compute Layer
Hadoop: Storage + Compute Grid
Instrumentation
Collection
RDBMS (aggregated data)
BI Reports + Interactive Apps
3. Data Exploration &
Advanced Analytics
2. Keep Data
Alive For Ever
(Active Archive)
1. Scalable Throughput
For ETL & Aggregation
(ETL Acceleration)
Mostly Append
8. So What is Apache Hadoop ?
• A scalable fault-tolerant distributed system for data storage and
processing (open source under the Apache license).
• Core Hadoop has two main systems:
• Hadoop Distributed File System: self-healing high-bandwidth clustered
storage.
• MapReduce: distributed fault-tolerant resource management and scheduling
coupled with a scalable data programming abstraction.
• Key business values:
• Flexibility – Store any data, Run any analysis.
• Scalability – Start at 1TB/3-nodes grow to petabytes/1000s of nodes.
• Economics – Cost per TB at a fraction of traditional options.
8
9. The Hadoop Big Bang
9
• Fastest sort of a TB, 62secs
over 1,460 nodes
• Sorted a PB in 16.25hours
over 3,658 nodes
Hadoop World 2009,
500 attendees
10. The Key Benefit: Agility/Flexibility
10
Schema-on-Read (Hadoop):Schema-on-Write (RDBMS):
• Schema must be created before
any data can be loaded.
• An explicit load operation has to
take place which transforms data
to DB internal serialization format.
• New columns must be added
explicitly before new data for such
columns can be loaded into the
database.
• OLAP is Fast
• Standards/Governance
• Data is simply copied to the file store,
no transformation is needed.
• A SerDe (Serializer/Deserlizer) is
applied during read time to extract
the required columns (late binding)
• New data can start flowing anytime
and will appear retroactively once the
SerDe is updated to parse it.
• Load is Fast
• Flexibility/Agility
Pros
11. Scalability: Scalable Software Development
11
Grows without requiring developers to
re-architect their algorithms/application.
AUTO SCALE
12. Economics: Return on Byte
• Return on Byte (ROB) = value to be extracted from that
byte divided by the cost of storing that byte
• If ROB is < 1 then it will be buried into tape wasteland, thus
we need more economical active storage.
12
Low ROB
High ROB
17. CDH in the Enterprise Data Stack
Logs Files Web Data
Relational
Databases
IDEs
BI /
Analytics
Enterprise
Reporting
Enterprise Data
Warehouse
Online Serving
Systems
Cloudera
Manager
SYSTEM
OPERATORS
ENGINEERS ANALYSTS BUSINESS USERS
Web/Mobile
Applications
CUSTOMERS
Sqoop
Sqoop
Sqoop
FlumeFlumeFlume
Modeling
Tools
DATA SCIENTISTS
DATA
ARCHITECTS
Meta Data/
ETL Tools
ODBC, JDBC,
NFS, HTTP
17
18. HBase versus HDFS
HDFS: HBase:
Use For:
• Dimension tables which are updated
frequently and require random low-
latency lookups.
Use For:
• Fact tables that are mostly append only
and require sequential full table scans.
Optimized For:
• Large Files
• Sequential Access (Hi Throughput)
• Append Only
Optimized For:
• Small Records
• Random Access (Lo Latency)
• Atomic Record Updates
Not Suitable For:
• Low Latency Interactive OLAP.
18
20. 1. FLEXIBILITY
STORE ANY DATA
RUN ANY ANALYSIS
KEEP’S PACE WITH THE RATE OF CHANGE OF INCOMING DATA
2. SCALABILITY
PROVEN GROWTH TO PBS/1,000s OF NODES
NO NEED TO REWRITE QUERIES, AUTOMATICALLY SCALES
KEEP’S PACE WITH THE RATE OF GROWTH OF INCOMING DATA
3. ECONOMICS
COST PER TB AT A FRACTION OF OTHER OPTIONS
KEEP ALL OF YOUR DATA ALIVE IN AN ACTIVE ARCHIVE
POWERING THE DATA BEATS ALGORITHM MOVEMENT
20
Core Benefits of the Platform for Big Data
21. How do I start?
21
I
II
III
IV
4 Options
Cloudera cluster up and running in the Cloud in 24 hours.
Use and Excelerate Systems Data Scientist to set customer’s
Data strategy..
Get an on-premise Cloudera Cluster up and running in 5
days with 5 nodes and upto 10TB of Data..
Training: Customers who invest in training are generally
more successful than those who do not.
22. Cloudera from Excelerate Systems
22
There is a worldwide shortage of Big Data skills,
especially in Latin America. Excelerate Systems has
invested heavily in building a global network of
certified specialists in Cloudera who can design,
implement, configure, develop and Support Big Data
solutions. No other company in the region has these
skills yet.
Excelerate Systems is Cloudera’s Primary partner in
the region.
23.
24. • 8 Certified Cloudera Developers
• 6 Certified Cloudera Administrators
• 2 Hbase developers
• 2 Hadoop Developers
• 2 Data Scientists
Excelerate Systems Big Data Resources
25. 25
Questions and next steps
David Bennett, CEO David.bennett@exceleratesystems.net
Victor Pichardo, President, Victor.pichardo@exceleratesystems.net
Alex Campos, Systems Engineer, alex.campos@exceleratesystems.net
Plus consulting Resources in various countries