Weitere ähnliche Inhalte Ähnlich wie Bay Area Hadoop User Group (20) Bay Area Hadoop User Group1. Accelerated Analytics for the Big Data Fabric
Bay Area Hadoop User Group
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
2. AGENDA
The Big Data Fabric
Big Data Preparation – An Everyday Challenge
Use-Case Scenario – Call Volume Analysis
Solution Requirements
Solution Workflow
Phase I - Data Preparation & Visualization
Phase II - Pentaho MapReduce & Orchestration
Summary
2
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
3. The Big Data Fabric
Data Integration Big Analytics
Pentaho Business Analytics 3rd Party Tools
R
Visualization Dashboards 3rd Party BI Tools
Interactive Analysis Reports Applications
Data Integration Scheduling
Job Orchestration High Performance
Workflow Visual IDE
Hadoop Analytic Databases
NoSQL Databases
Big Data Mgmt
3
4. Preparing Big Data for Analysis
is an Everyday Challenge
• Very technical skills required
• Divide between M-R developers & analysts
• Beyond the reach of many organizations
4
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
5. Pentaho Visual MapReduce
Accessible by any ETL
developer, business analyst or data
scientist
Executes inside Hadoop as a native
Java MapReduce task
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
5
6. Pentaho Reporting & Analytics
Batch Reporting
and Ad Hoc Query
Data Visualization, Discovery
and Analysis
Hadoop NoSQL Hybrid
6
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
7. Use Case Scenario – Call Volume Analysis
• VOIP service provider has excess capacity and is
considering expansion to consumer markets
• Business Analyst: what are the top 10 states for
inbound calls on Fridays, Saturdays and Sundays?
• Research data available:
– Call records – date/timestamp & destination phone #
?
– NANP (North American Numbering Plan) data – area
code by country, state & time zone
7
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
8. Solution Requirements
• Data Preparation
– Access the call records in HDFS
– Extract the destination area code for each call
– Read the area code reference data
– Lookup country, state and time zone by area code, append to each
record
– Filter out records (non-U.S. calls, calls made on M-Tu-W-Th)
– Load to a relational database
– Generate metadata
• Analysis
– Explore data multi-dimensionally
– Find the top-10 states by inbound call volume
– Navigate via a geospatial interface
• Deployment
– Deploy in MapReduce to handle larger data volumes
8
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
9. Solution Workflow
• Phase I - Business Analysts
– Use a data extract to prepare and validate their analyses
– Iterate over requirements with executives and stake-holders
• Phase II - MapReduce Developers/Analysts
– Create production Pentaho MapReduce transformations
– Manage the deployment and orchestration between the
Hadoop cluster and the production database
9
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
10. Data Preparation (Phase I)
• The data pipeline implements the data preparation logic
• Each component has a “personality”– access, calculate, join, filter …
• Free-form design
– As many or as few inputs, transformations and outputs as needed
• Schema contract exists only between connected components
• Pipelined, multi-threaded for performance
• 100% Java-based for deployment flexibility
10
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
11. Data Pipeline – Input from HDFS
11
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
12. Data Pipeline - Calculator
12
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
13. Data Pipeline – Stream Lookup
13
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
14. Data Pipeline – Row Filter
14
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
15. Data Pipeline – Table Output
15
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
19. Deployment to Hadoop (Phase II)
• To process a larger set of data we can deploy the data pipeline via
MapReduce
– Input and output streams are encoded in key-value pairs
– Two specialized components provide an interface:
– A special job component deploys the data pipeline to the Hadoop
cluster:
19
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
20. Pentaho MapReduce – Inputs/Outputs
The core logic of the data pipeline is
identical … only the ends change
........
20
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
21. Pentaho MapReduce – Orchestration
21
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
22. Instant Analytics (Roadmap)
Choose a Big Data Source,
Answer a Few Questions,
Publish to Pentaho
Report, Explore and
Analyze
Customize Model
(Optional)
22
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
23. SUMMARY
1. The Big Data Fabric encompasses a large collection of Hadoop
distributions, NoSQL and analytical databases
2. A component-based approach to data access and integration can:
– Allow business analysts and data scientists to perform their own data
preparation
– Result in more rapid validation of business requirements & metrics
– Be used to create data pipelines that can be deployed directly to a
cluster, enabling analytics against much larger data sets
– Support orchestration across environments
23
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
24. Summary
24
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
25. Thank You
Join the conversation. You can find us on:
http://blog.pentaho.com
@Pentaho
Facebook.com/Pentaho
Pentaho Business Analytics
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Hinweis der Redaktion Leveraging PDI to incorporate Big Data into your data fabric provides immediate access to analytics, examples: Batch and Ad Hoc reporting directly against Big Data Data sources using familiar BI tools with no coding – Report Designer, Interactive Reporting Agile framework to quickly generate/house/manage data marts for interactive analysis, data discovery, etc.