This document summarizes a webinar presented by Talend and Caserta Concepts on the big data ecosystem. The webinar discussed how Talend provides an open source integration platform that scales to handle large data volumes and complex processes. It also overviewed Caserta Concepts' expertise in data management, big data analytics, and industries like financial services. The webinar covered topics like traditional vs big data, Hadoop and NoSQL technologies, and common integration patterns between traditional data warehouses and big data platforms.
Scaling Integration with Talend's Open Source Platform
1. The Big Data Ecosystem
Talend & Caserta Concepts Webinar
Ciaran Dynes
Director, Product Management & Product Marketing, Talend
Joe Caserta
Founder & President, Caserta Concepts
2. Integration at Any Scale
Talend is the only integration vendor that enables
your business to scale through:
An open source-based solution supported by
a vast community and enterprise-class services
An innovative, unified platform that scales data,
application and business processes of any complexity
A usage-based subscription model delivering
$
a fast return on investment
3. Talend - Integration at Any Scale
Talend offers true
scalability for
• Any integration challenge
• Any data volume
• Any project size
Talend enables
integration
convergence
4. Working with Leading Vendors
Platforms/Hadoop Appliance NoSQL
Data Management Analytics
System Integrators
System Integrators play a vital role in providing expertise
5. The Big Data Ecosystem
Talend & Caserta Concepts Webinar
Joe Caserta
Founder & President, Caserta Concepts
Ciaran Dynes
Director, Product Management & Product Marketing, Talend
6. Joe Caserta Timeline
2012
Partnered with Big Data vendors Laser focus on Big Data solutions for
Cloudera, HortonWorks, Datameer, Financial Sector & eCommerce
more… 2010
Formalized Talend Alliance
2009 Partnership – System Integrators
Launched Big Data practice
2004
Co-author, with Ralph Kimball, The
Launched Training practice, teaching Data Warehouse ETL Toolkit (Wiley)
data concepts world-wide
2001
Web log analytics solution published
Founded Caserta Concepts in NYC
in Intelligent Enterprise
1996
Began consulting career as Dedicated to Data Warehousing,
programmer/data modeler Business Intelligence since 1996
1986
25+ years hands-on experience
building database solutions
7. Caserta Concepts
• Technology services company with expertise in data analysis:
• Data Management
• Big Data & Analytics
• With core focus in the following industries:
• Financial Services
• Insurance / Healthcare
• eCommerce / Higher Education
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Consulting, Writing, Education
8. Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting
Big Data
Analytics
Data Warehousing/
ETL/Data Integration
BI/Visualization/
Analytics
Master Data Management
10. The Good Old Days: Traditional Data Warehousing
Metadata
Standard Reports
Web Logs
Ad-hoc Query Tools
External Extract
Data Sources Optimized
Load
Transform Data Mining
Data
Warehouse
Relational
Systems/ERP
MDD/OLAP
Closed-loop
Legacy feedback Analytical Applications
Systems applications
Data Marts
(The data warehouse?)
11. What is “Big Data”?
• A collection of data sets so large and complex that
it becomes difficult to process using on-hand
database management tools or traditional data
processing applications.
• Challenges include capture, storage, search,
sharing, transfer, analysis, and visualization.
• Relational databases were designed for
applications, we use only a small fraction of their
capabilities in analytics applications.
• Enforcing a relational structure upon our data is
not always what we want.
12. What’s the Difference?
Traditional Data Big Data
Very accurate transactional data. Lots of data with value that can
Analyzed by humans only be attained by deep analytics
Measured in terabytes Measured in petabytes
Structured data Structured/Unstructured data
Input by human “system users” Created by everybody, plus all of
our machine friends
Oracle, SAP, etc. Open source, Hadoop
HW/SW investment measured in HW/SW investment measured in
$10M $10K
Recording facts Harvesting insights
13. Try to keep up: This slide is already obsolete
14. So where does the data warehouse come in?
• Will Big Data replace the data warehouse?
• Yes – however there is much evolution ahead: real time
integrations, interactive queries
• Data Warehousing principles still apply to Big Data
• Data Quality
• Master Data
• Data architecture
• How do we leverage our existing investment?
15. Enterprise Technical Ecosystem
Traditional BI
ERP
ETL Traditional
EDW
Finance
Ad-Hoc/Canned
ETL Reporting
Legacy
Big Data Cluster Big Data BI
NoSQL
Database Cassandra
Search/Data
Analytics
Mahout MapReduce Pig/Hive
N1 N2 N3 N4 N5
Hadoop Distributed File System (HDFS)
Horizontally Scalable Environment - Optimized for Analytics Canned Reporting
16. Extending EDW with Hadoop
•Eliminate barrier of imposing relational structure on data.
•Storage is fast, durable and cheap: Don’t throw away data that
can be valuable in the future
•Processing power
• Hadoop scales linearly, don’t worry about the data set getting
too big
•Machine learning
•Ad-Hoc reporting by non-technical users requires traditional
methods or additional application
17. Design Pattern #1: Hadoop Staging/Warehouse
feed relational EDW (Composite Warehouse)
• Hadoop serves as the staging ground for all data
- Eliminate barrier of imposing relational structure on data.
- Storage is fast, durable and cheap: Don’t throw away data that can be
valuable in the future
• Data scientists will work in the Hadoop environment to analyze, and mine structured
and unstructured data using Pig, Hive, and Mahout (machine learning)
• Data required for interactive reporting and traditional ad-hoc analysis is sent to
downstream relational EDW
Source Systems
Mahout MapReduce Pig/Hive
Traditional DW
N1 N2 N3 N4 N5
Hadoop Distributed File System (HDFS)
18. Design Pattern #2: NoSQL Enhanced EDW
•Not all structured data lends itself to being stored relationally:
• Relationships: Graph Databases
• Sparse Data: Columnar Databases
•Very Large Datasets:
• NoSQL databases are capable of scaling far beyond relational databases while
maintaining performance
• Ultra-performance key value stores and columnar databases can be very useful in
storing certain types of high volume data for analytic purposes
• Just don’t expect the ad-hoc flexibility of a relational database!
- Web analytics
Mahout MapReduce Pig/Hive Cassandra - Ad Impressions
(columnar)
N1 N2 N3 N4 N5
Hadoop Distributed File System (HDFS) - Networks
Titan
- Recommender
(graph)
- Path optimization
Traditional DW
19. Design Pattern #3: Add analytics to your NoSQL
cluster
• If your application is already based on a NoSQL technology, consider
building analytic site.
• The analytic site is constantly streamed fresh transactions leveraging
Cassandra's native replication
• Aggregates and analytic views are materialized with Pig/Hive map/reduce,
since the work is done on the cluster no load is placed on the applications.
This analytic data is in turn replicated throughout the cluster
Site 1
Cassandra
Pig/Hive
Cassandra
MapReduce
Analytics
Site
Site 2 Canned Reporting
Cassandra
Remember, NoSQL
schemas are
Traditional “optimized to a
DW query”, not ad-hoc
20. Emerging Tools
Hive, although an excellent tool for data
analysis is too slow for interactive
queries. Recent projects have increased
speed dramatically 10-100x.
• Google Dremel
• Apache/MapR Drill
• Hortonworks Stinger
• Cloudera Impala
21. Commonly Used Technologies
• Amazon Elastic MapReduce (EMR): Web service to access EC2/S3, pay-as-
you-go hosted Hadoop Infrastructure
• Hadoop Distribution: Cloudera; MapR; Hortonworks
• Apache Projects
• Whirr: Used to launch/kill computing clusters
• Kafka: Publish-subscribe messaging system
• Mahout: Distributed machine learning
• Hive: Map data to structures and use SQL-like queries
• HBase: No-SQL/non-relational database, real-time read/write
• Cassandra: Like HBase, no single point of failure
• Chuckwa/Flume: Large-scale log collection
• Pig: Procedural programming language, from Yahoo
• Sqoop: “SQL-to-Hadoop”, like BCP for Hadoop
• Zookeeper: Used to manage & adminster Hadoop
• Solr: Full-text/Faceted Search
• MongoDB: Document-oriented database
• Languages: Python, SciPy, Java
23. Parting Thought
Polyglot Persistence – “where any decent sized
enterprise will have a variety of different data storage
technologies for different kinds of data. There will still
be large amounts of it managed in relational stores,
but increasingly we'll be first asking how we want to
manipulate the data and only then figuring out what
technology is the best bet for it.”
-- Martin Fowler
Purpose of the slide: Mission / Vision StatementKey themes:Talend’s mission is to enable our customers to innovate faster at a lower cost.We are disrupting the traditional integration market by delivering an: open source-based solution, innovative unified platform, usage-based subscription modelMore from the Talend boilerplate:Talend provides integration that truly scales. From small projects to enterprise-wide implementations, Talend’s highly scalable data, application and business process integration platform maximizes the value of an organization’s information assets and optimizes return on investment through a usage-based subscription model. Ready for big data environments, Talend’s flexible architecture easily adapts to future IT platforms. And a common set of easy-to-use tools implemented across all Talend products enable teams to scale developer skillsets, too.
Purpose of the slide: IntroduceTalend’s solution – Integration At Any ScaleTalking points:Talend is disrupting the integration market to address these integration challenges by providing a differentiated solution that provides “Integration at Any Scale”With Talend, your business can scale to meet any integration challenge, any data volume, or any project size.We will discuss HOW this is done in a moment, but the main point here is what we call “Integration Convergence”Integration Convergence is the ability to address data, application and process integration needs with the same platformThe benefit to you, is that your resources are more efficient and you lower your cost of operationsTalend provides integration that truly scales. From small projects to enterprise-wide implementations, Talend’s highly scalable data, application and business process integration platform maximizes the value of an organization’s information assets and optimizes return on investment through a usage-based subscription model. Ready for big data environments, Talend’s flexible architecture easily adapts to future IT platforms.
Endeca bought by Oracle – “agile information management”SSPS bought by IBMRadian6 bought by SalesforceDataStax – cassandraKarmasphere – data analysis platform for HadoopCouchbase – NoSQL – Membase and CouchbaseClarabridge – text analytics
Alternative NoSQL: Hbase, Cassandra, Druid, VoltDB
Endeca bought by Oracle – “agile information management”SSPS bought by IBMRadian6 bought by SalesforceDataStax – cassandraKarmasphere – data analysis platform for HadoopCouchbase – NoSQL – Membase and CouchbaseClarabridge – text analytics