SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
Big Data Meetup 
October Page 1 
29, 2014 
C-BAG 
Chennai
C-BAG 
Chennai Big Data Analytic Group 
C-BAG is an open group formed in the interest of creating a good BIG 
DATA Environment. 
C-BAG is conducting weekly and monthly online/offline free sessions, 
creating awareness on the BIG DATA technologies and support BIG DATA 
initiatives. C-BAGs aim is to be a one stop place for all BIG DATA queries, 
discussions and support ! 
Contact Us : 
chennaibigdataanalyticgroup@gmail.com 
Page 2
Speakers 
Page 3 
About Dhruv Kumar 
Solutions Architect 
Concurrent Inc. 
Dhruv Kumar has over six years of 
diverse software development 
experience in Big Data, Web and High 
Performance Computing applications. 
Prior to joining Concurrent, he worked at 
Terracotta as a Software Engineer. He 
has a MS degree in Computer 
Engineering from the University of 
Massachusetts-Amherst. 
About Vinay Shukla 
Director of Product Management 
Hortonworks 
Vinay Shukla is a seasoned Enterprise 
Software professional with extensive 
experience in Product management, 
Product development and Project 
management. Prior to Hortonworks, Vinay 
has worked as security architect, product 
manager, developer and project manager. 
Vinay admits to being a caffeine addict 
and spends his free time on a Yoga mat 
and on Hikes.
Hortonworks enables adoption of Apache Hadoop with Hortonworks Data Platform 
• Founded in 2011 
• Original 24 architects, developers, 
operators of Hadoop from Yahoo! 
• Leaders in Hadoop community 
• 500+ employees 
Page 4 
Customer Momentum 
• 300+ customers in seven quarters, growing at 75+/quarter 
• Two thirds of customers come from F1000 
Partner Momentum 
• Over 1000 Partners, Hundreds of Certified Solutions 
• Some key 
partners include: 
Hortonworks and Hadoop at Scale 
• HDP in production on largest clusters on planet 
• Most +1000 node clusters
Page 5 
The Forrester Wave™ 
Big Data Hadoop Solutions 
Q1 2014 
A Leader in Hadoop 
“Hortonworks loves and lives 
open source innovation” 
World Class Support and Services. 
Hortonworks' Customer Support received a 
maximum score and was significantly higher than 
both Cloudera and MapR
HDP IS Apache Hadoop 
There is ONE Enterprise Hadoop: everything else is a vendor derivation 
HDP 2.2 
October 
2014 
HDP 2.1 
April 
2014 
Page 6 
0.98.0 
1.4.0 
0.5.0 
0.60 
0.4.0 
Tez 
Slider 
4.10.0 
4.7.2 
Hortonworks Data Platform 2.2 
Hadoop 
&YARN 
Pig 
Hive & HCatalog 
HBase 
Sqoop 
4.0.0 
Oozie 
3.4.5 
Zookeeper 
1.5.1 
Ambari 
Storm 
Flume 
Knox 
Phoenix 
Accumulo 
2.2.0 
0.12.0 
0.12.0 
2.4.0 0.12.1 
Data 
Management 
0.13.0 
0.96.1 
0.9.1 1.4.4 
1.3.1 
1.4.4 
3.3.2 
3.4.5 
0.4.0 
4.0.0 
1.5.1 
Falcon 
Ranger 
Spark 
Kafka 
0.14.0 
0.14.0 
0.98.4 
1.6.1 
4.2 0.9.3 
1.2.0 
0.6.0 
0.8.1 
1.4.5 
1.5.0 
1.7.0 
4.1.0 
0.5.0 
0.4.0 
2.6.0 
* version numbers are targets and subject to change at time of general availability in accordance with ASF release process 
HDP 2.0 
October 
2013 
Solr 
0.5.1 
Data Access Governance 
& Integration Operations Security
The Modern Data Architecture w/ HDP 
Page 7
Enterprise Goals for the Modern Data Architecture 
Page 8 
• Consolidate siloed data sets structured 
and unstructured 
• Central data set on a single cluster 
• Multiple workloads across batch 
interactive and real time 
• Central services for security, governance 
and operation 
• Preserve existing investment in current 
tools and platforms 
• Single view of the customer, product, 
supply chain 
DATA SYSTEM APPLICATIONS 
Business 
Analytics 
Custom 
Applications 
Packaged 
Applications 
RDBMS 
EDW 
MPP 
Batch Interactive Real-Time 
YARN: Data Operating System 
1 ° ° ° ° ° ° ° ° ° 
° 
° ° ° ° ° ° ° ° N 
CRM 
ERP 
Other 
1 ° ° ° 
° ° ° HDFS 
(Hadoop Distributed File System) 
SOURCES 
EXISTING( 
Systems( 
Clickstream( Web(( 
&Social( 
Geoloca9on( Sensor(( 
&(Machine( 
Server(( 
Logs( 
Unstructured(
1. Unlock New Applications from New Types of Data 
INDUSTRY USE CASE Sentiment 
Page 9 
& Web 
Clickstream 
& Behavior 
Machine 
& Sensor Geographic Server Logs Structured & 
Unstructured 
Financial Services 
New Account Risk Screens ✔ ✔ 
Trading Risk ✔ 
Insurance Underwriting ✔ ✔ ✔ 
Telecom 
Call Detail Records (CDR) ✔ ✔ 
Infrastructure Investment ✔ ✔ 
Real-time Bandwidth Allocation ✔ ✔ ✔ 
Retail 
360° View of the Customer ✔ ✔ ✔ 
Localized, Personalized Promotions ✔ 
Website Optimization ✔ 
Manufacturing 
Supply Chain and Logistics ✔ 
Assembly Line Quality Assurance ✔ 
Crowd-sourced Quality Assurance ✔ 
Healthcare 
Use Genomic Data in Medial Trials ✔ ✔ ✔ 
Monitor Patient Vitals in Real-Time 
Pharmaceuticals 
Recruit and Retain Patients for Drug Trials ✔ ✔ 
Improve Prescription Adherence ✔ ✔ ✔ ✔ 
Oil & Gas 
Unify Exploration & Production Data ✔ ✔ ✔ ✔ 
Monitor Rig Safety in Real-Time ✔ ✔ ✔ 
Government 
ETL Offload/Federal Budgetary Pressures ✔ ✔ 
Sentiment Analysis for Government Programs ✔
..to shift from reactive to proactive interactions 
A shift in Advertising 
From mass branding …to 1x1 Targeting 
A shift in Financial Services 
From Educated Investing …to Automated Algorithms 
A shift in Healthcare 
From mass treatment …to Designer Medicine 
A shift in Retail 
A shift in Telco 
Page 10 
HDP and Hadoop allow 
organizations to shift 
interactions from… 
Reactive 
Post Transaction 
Proactive 
Pre Decision 
…to Real-t From static branding ime Personalization 
From break then fix …to repair before break
2. Or to realize a dramatic cost savings… 
EDW Optimization 
Page 11 
✚ 
OPERATIONS 
50% 
ANALYTICS 
20% 
ETL PROCESS 
30% 
OPERATIONS 
50% ANALYTICS 
50% 
Current Reality 
EDW at capacity: some usage 
from low value workloads 
Older data archived, unavailable 
for ongoing exploration 
Source data often discarded 
Hadoop 
Parse, Cleanse 
Apply Structure, Transform 
Augment w/ Hadoop 
Free up EDW resources from 
low value tasks 
Keep 100% of source data and historical 
data for ongoing exploration 
Mine data for value after loading it 
because of schema-on-read
2. Or to realize a dramatic cost savings… 
EDW Optimization 
Page 12 
✚ 
OPERATIONS 
50% 
ANALYTICS 
20% 
ETL PROCESS 
30% 
OPERATIONS 
50% ANALYTICS 
50% 
Current Reality 
EDW at capacity: some usage 
from low value workloads 
Older data archived, unavailable 
for ongoing exploration 
Source data often discarded 
Augment w/ Hadoop 
Free up EDW resources from 
low value tasks 
Keep 100% of source data and historical 
data for ongoing exploration 
Mine data for value after loading it 
because of schema-on-read 
Commodity Compute & Storage 
Hadoop Enables Scalable Compute & 
Storage at a Compelling Cost Structure 
Cloud Storage 
Engineered System 
MPP 
SAN 
HADOOP 
NAS 
$0 $20,000 $40,000 $60,000 $80,000 $180,000 
Fully-loaded Cost Per Raw TB of Data (Min–Max Cost) 
Hadoop 
Parse, Cleanse 
Apply Structure, Transform 
Storage Costs/Compute Costs 
from $19/GB to $0.23/GB
3. Data Lake: An architectural shift 
SCALE 
Page 13 
SCOPE 
Unlocking the Data Lake 
( 
RDBMS 
MPP 
EDW 
Data Lake 
Enabled by YARN 
• Single data repository, 
shared infrastructure 
• Multiple biz apps 
accessing all the data 
• Enable a shift from 
reactive to proactive 
interactions 
• Gain new insight across 
the entire enterprise 
New Analytic Apps 
or IT Optimization 
HDP 2.1 
Governance 
& Integration 
Security 
Operations 
Data Access 
YARN 
Data Management
Case Study: 12 month Hadoop evolution at TrueCar 
Data Platform Capabilities 
Page 14 
June 2013 
Begin 
Hadoop 
Execution 
July 2013 
Hortonworks 
Partnership 
12 months execution plan 
May ‘14 
IPO 
Aug 2013 
Training 
& Dev 
Begins 
Nov 2013 
Production 
Cluster 
60 Nodes 
2 PB 
Jan 2014 
40% Dev 
Staff 
Perficient 
Dec 2013 
Three 
Production 
Apps 
(3 total) 
Feb 2014 
Three More 
Production 
Apps 
(6 total) 
12 Month Results at TrueCAR 
• Six Production Hadoop Applications 
• Sixty nodes/2PB data 
• Storage Costs/Compute Costs 
from $19/GB to $0.23/GB 
“We addressed our data platform capabilities 
strategically as a pre-cursor to IPO.”
DRIVING 
INNOVATION 
THROUGH 
DREDUACINGT DEAVELOPMENT TIME FOR 
PRODUCTION-GRADE HADOOP APPLICATIONS 
Dhruv Kumar 
Solutions Architect, Concurrent Inc
GET TO KNOW CONCURRENT 
2 
Leader in Application Infrastructure for Big Data! 
• Building enterprise software to simplify Big Data application 
development and management 
Products and Technology! 
• CASCADING 
Open Source - The most widely used application infrastructure for 
building Big Data apps with over 200,000 downloads each month. 
8000 deployments worldwide. 
• DRIVEN 
Enterprise data application management for Big Data apps 
Proven — Simple, Reliable, Robust! 
• Thousands of enterprises rely on Concurrent to provide their data 
application infrastructure. 
Founded: 2008 
HQ: San Francisco, CA 
! 
CEO: Gary Nakamura 
CTO, Founder: Chris Wensel 
! 
! 
www.concurrentinc.com
BIG DATA APPLICATION INFRASTRUCTURE 
3 
“It’s all about the apps”" 
There needs to be a comprehensive solution for building, deploying, running 
and managing this new class of enterprise applications. 
Business Strategy Connecting Business and Data 
Data & Technology 
Challenges! 
! 
Skill sets, systems integration, 
standard op procedure and 
operational visibility
DATA APPLICATIONS - ENTERPRISE NEEDS 
Enterprise Data Application Infrastructure! 
! 
• Need reliable, reusable tooling to quickly build and consistently deliver 
data products 
4 
! 
• Need the degrees of freedom to solve problems ranging from simple to 
complex with existing skill sets 
! 
• Need the flexibility to easily adapt an application to meet business needs 
(latency, scale, SLA), without having to rewrite the application 
! 
• Need operational visibility for entire data application lifecycle
WORD COUNT EXAMPLE WITH CASCADING 
5 
! 
! 
String docPath = args[ 0 ];! 
String wcPath = args[ 1 ];! 
Properties properties = new Properties();! 
AppProps.setApplicationJarClass( properties, Main.class );! 
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );! 
! 
configuration 
integration 
! 
// create source and sink taps! 
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );! 
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );! 
! 
processing 
// specify a regex to split "document" text lines into token stream! 
Fields token = new Fields( "token" );! 
Fields text = new Fields( "text" );! 
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );! 
// only returns "token"! 
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );! 
// determine the word counts! 
Pipe wcPipe = new Pipe( "wc", docPipe );! 
wcPipe = new GroupBy( wcPipe, token );! 
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );! 
scheduling 
! 
// connect the taps, pipes, etc., into a flow definition! 
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )! 
.addSource( docPipe, docTap )! 
.addTailSink( wcPipe, wcTap );! 
// create the Flow! 
Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work! 
wcFlow.complete(); // <<-- Runs jobs on Cluster
SOME COMMON PROCESSING PATTERNS 
• Functions 
• Filters 
• Joins 
‣ Inner / Outer / Mixed 
‣ Asymmetrical / Symmetrical 
• Merge (Union) 
• Grouping 
‣ Secondary Sorting 
‣ Unique (Distinct) 
• Aggregations 
‣ Count, Average, etc 
6 
filter 
filter 
function 
function filter function 
data 
Pipeline 
Split Join 
Merge 
data 
Topology
CASCADING API 
• Java API 
• Separates business logic from integration 
• Testable at every lifecycle stage 
• Works with any JVM language 
• Many integration adapters 
7 
Processing API Integration API 
Process Planner 
Scheduler API 
Scheduler 
Apache Hadoop 
Cascading 
Data Stores 
Scripting 
Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
FRAMEWORK AND PROGRAMMING LANGUAGE 
INDEPENDENCE 
Cascading Domain Specific Languages (DSLs) 
8 
SQL Clojure Ruby 
New Fabrics 
Tez Storm 
Supported Fabrics and Data Stores 
Mainframe DB / DW In-Memory Data Stores Hadoop 
! 
• Any JVM language can 
use Cascading API 
• Cascading applications 
that run on MapReduce 
will also run on Apache 
Spark, Storm, and …
THE STANDARD FOR DATA APPLICATION DEVELOPMENT 
9 
www.cascading.org 
Build data apps 
that are 
scale-free! 
!!! 
Design principles 
ensure best practices at 
any scale 
Test-Driven 
Development! 
! 
Efficiently test code and 
process local files before 
deploying on a cluster 
Staffing 
Bottleneck! 
! 
Use existing Java, SQL, 
modeling skill sets 
Application 
Portability! 
! 
! 
Write once, then run on 
different computation 
fabrics 
Operational 
Complexity! 
! 
Simple - Package up into 
one jar and hand to 
operations 
Systems 
Integration! 
! 
! 
Hadoop never lives alone. 
Easily integrate to existing 
systems 
! 
Proven application development 
framework for building data apps 
Application platform that addresses:
CASCADING DATA APPLICATIONS 
10 
Enterprise IT! 
Extract Transform Load 
Log File Analysis 
Systems Integration 
Operations Analysis 
! 
Corporate Apps! 
HR Analytics 
Employee Behavioral Analysis 
Customer Support | eCRM 
Business Reporting 
! 
Telecom! 
Data processing of Open Data 
Geospatial Indexing 
Consumer Mobile Apps 
Location based services 
Marketing / Retail! 
Mobile, Social, Search Analytics 
Funnel Analysis 
Revenue Attribution 
Customer Experiments 
Ad Optimization 
Retail Recommenders 
! 
Consumer / Entertainment! 
Music Recommendation 
Comparison Shopping 
Restaurant Rankings 
Real Estate 
Rental Listings 
Travel Search & Forecast 
! 
! 
Finance! 
Fraud and Anomaly Detection 
Fraud Experiments 
Customer Analytics 
Insurance Risk Metric 
! 
Health / Biotech! 
Aggregate Metrics For Govt 
Person Biometrics 
Veterinary Diagnostics 
Next-Gen Genomics 
Argonomics 
Environmental Maps 
!
STRONG ORGANIC GROWTH 
11 
200,000+ downloads / month! 
8000+ Deployments!
BUSINESSES DEPEND ON US 
• 30000 Jobs per day! 
• Makes complex analysis of very large data sets simple! 
• Machine learning, linear algebra to improve! 
• User experience! 
• Ad quality (matching users and ad effectiveness)! 
• All revenue applications are running on Cascading/Scalding! 
12 
TWITTER
BUSINESSES DEPEND ON US 
• Cascading Java API! 
• Data normalization and cleansing of search and click-through logs for 
use by analytics tools, Hive analysts! 
• Easy to operationalize heavy lifting of data in one framework 
13
BUSINESSES DEPEND ON US 
• Cascalog (Clojure)! 
• Weather pattern modeling to protect growers against loss! 
• ETL against 20+ datasets daily! 
• Machine learning to create models! 
• Purchased by Monsanto for $930M US 
14
BROAD SUPPORT 
15 
Hadoop ecosystem supports Cascading!
… AND INCLUDES RICH SET OF EXTENSIONS 
16 
http://www.cascading.org/extensions/
WORD COUNT DEMO ON HDP 
17
SUMMARY - BUILD ROBUST DATA APPS RIGHT THE FIRST TIME 
WITH CASCADING 
• Cascading framework enables developers to intuitively create data applications that scale 
and are robust, future-proof, supporting new execution fabrics without requiring a code rewrite 
! 
• Driven — an application visualization product — provides rich insights into how your 
applications executes, improving developer productivity by 10x 
! 
• Cascading 3.0 opens up the query planner — write apps once, run on any fabric 
18 
Concurrent offers training classes for Cascading & Scalding
CONTACT INFORMATION 
Dhruv Kumar! 
Solutions Architect! 
Concurrent Inc.! 
dkumar@concurrentinc.com
DRIVING 
INNOVATION 
THROUGH 
DTHAANKT YAOU 
Dhruv Kumar
APPENDIX 
21
USE LINGUAL TO MIGRATE ITERATIVE ETL TASKS TO SPARK 
• Lingual is an extension to Cascading that 
executes ANSI SQL queries as Cascading apps 
! 
• Supports integrating with any data source that can 
be accessed through JDBC — Cascading Tap 
can be created for any source supporting JDBC 
! 
• Great for migration of data, integrating with non- 
Big Data assets — extends life of existing IT 
assets in an organization 
22 
CLI / Shell Enterprise Java 
Provider API JDBC API Lingual API 
Query Planner 
Cascading 
Apache Hadoop 
Lingual 
Data Stores 
Catalog
SCALDING 
• Scalding is a language binding to Cascading for Scala 
23 
• The name Scalding comes from the combining of SCALa and cascaDING 
! 
• Scalding is great for Scala developers; can crisply write constructs for matrix 
math… 
! 
• Scalding has very large commercial deployments at: 
• Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality 
• Ebay - Use cases include search analytics and other production data pipelines
PATTERN ENABLES MIGRATING YOUR MODELS TO SPARK 
24 
• Pattern is an open source project that allows to leverage Predictive Model 
Markup Language (PMML) models and translate them into Cascading 
apps. 
• PMML is an XML-based popular analytics framework that allows applications to describe data mining and 
machine learning algorithms 
• PMML models from popular analytics frameworks can be reused and 
deployed within Cascading workflows! 
• Vendor frameworks - SAS, IBM SPSS, MicroStrategy, Oracle 
• Open source frameworks - R, Weka, KNIME, RapidMiner 
• Pattern is great for migrating your model scoring to Hadoop from your 
decision systems
PATTERN: ALGOS IMPLEMENTED 
• Hierarchical Clustering 
• K-Means Clustering 
• Linear Regression 
• Logistic Regression 
• Random Forest 
! 
algorithms extended based on customer use cases – 
25 Confidential
BUILDING AND RUNNING PMML MODELS 
LINGUAL 
Data PMML 
LINGUAL 
26 Confidential 
Model 
Producer 
Model Explore data and build model 
using Regression, clustering, etc. 
Training 
Scoring 
New 
Data 
PMML model 
Measure and improve model 
Post 
Processing 
Model 
Consumer 
Data 
Data 
scores 
PATTERN 
ETL, prepare data 
ETL, prepare data

Más contenido relacionado

Was ist angesagt?

Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Hortonworks
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudHortonworks
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHortonworks
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Hortonworks
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Hortonworks
 
State of the Union with Shaun Connolly
State of the Union with Shaun ConnollyState of the Union with Shaun Connolly
State of the Union with Shaun ConnollyHortonworks
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationHortonworks
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalHortonworks
 
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and HortonworksPowering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and HortonworksHortonworks
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Hortonworks
 
Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks
 
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHortonworks
 
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...Hortonworks
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceHortonworks
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopHortonworks
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchHortonworks
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Hortonworks
 
Enterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageEnterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageHortonworks
 

Was ist angesagt? (20)

Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open Cloud
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
State of the Union with Shaun Connolly
State of the Union with Shaun ConnollyState of the Union with Shaun Connolly
State of the Union with Shaun Connolly
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop Implementation
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and HortonworksPowering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
 
Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014
 
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
 
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop Search
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
 
Enterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageEnterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble Storage
 

Ähnlich wie C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Cascading

Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldCA Technologies
 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack EuropeHortonworks
 
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Hortonworks
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
 
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoption2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoptionHortonworks
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Barijaxconf
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
Hortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User GroupHortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User GroupMats Johansson
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHortonworks
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks
 
Transform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksHortonworks
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big DataInfochimps, a CSC Big Data Business
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata Hortonworks
 
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with Hadoop
Big Data Made Easy:  A Simple, Scalable Solution for Getting Started with HadoopBig Data Made Easy:  A Simple, Scalable Solution for Getting Started with Hadoop
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with HadoopPrecisely
 
Revolution in Business Analytics-Zika Virus Example
Revolution in Business Analytics-Zika Virus ExampleRevolution in Business Analytics-Zika Virus Example
Revolution in Business Analytics-Zika Virus ExampleBardess Group
 

Ähnlich wie C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Cascading (20)

Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven World
 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack Europe
 
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoption2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Meetup oslo hortonworks HDP
Meetup oslo hortonworks HDPMeetup oslo hortonworks HDP
Meetup oslo hortonworks HDP
 
Hortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User GroupHortonworks Hadoop @ Oslo Hadoop User Group
Hortonworks Hadoop @ Oslo Hadoop User Group
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Ramesh kutumbaka resume
Ramesh kutumbaka resumeRamesh kutumbaka resume
Ramesh kutumbaka resume
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica Webinar
 
Transform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and Hortonworks
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
 
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with Hadoop
Big Data Made Easy:  A Simple, Scalable Solution for Getting Started with HadoopBig Data Made Easy:  A Simple, Scalable Solution for Getting Started with Hadoop
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with Hadoop
 
Revolution in Business Analytics-Zika Virus Example
Revolution in Business Analytics-Zika Virus ExampleRevolution in Business Analytics-Zika Virus Example
Revolution in Business Analytics-Zika Virus Example
 

Mehr von Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

Mehr von Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Último

Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Muhammad Tiham Siddiqui
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechProduct School
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosErol GIRAUDY
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud DataEric D. Schabell
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1DianaGray10
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Alkin Tezuysal
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and businessFrancesco Corti
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxNeo4j
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarThousandEyes
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applicationsnooralam814309
 

Último (20)

Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
 
SheDev 2024
SheDev 2024SheDev 2024
SheDev 2024
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenarios
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and business
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? Webinar
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applications
 

C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Cascading

  • 1. Big Data Meetup October Page 1 29, 2014 C-BAG Chennai
  • 2. C-BAG Chennai Big Data Analytic Group C-BAG is an open group formed in the interest of creating a good BIG DATA Environment. C-BAG is conducting weekly and monthly online/offline free sessions, creating awareness on the BIG DATA technologies and support BIG DATA initiatives. C-BAGs aim is to be a one stop place for all BIG DATA queries, discussions and support ! Contact Us : chennaibigdataanalyticgroup@gmail.com Page 2
  • 3. Speakers Page 3 About Dhruv Kumar Solutions Architect Concurrent Inc. Dhruv Kumar has over six years of diverse software development experience in Big Data, Web and High Performance Computing applications. Prior to joining Concurrent, he worked at Terracotta as a Software Engineer. He has a MS degree in Computer Engineering from the University of Massachusetts-Amherst. About Vinay Shukla Director of Product Management Hortonworks Vinay Shukla is a seasoned Enterprise Software professional with extensive experience in Product management, Product development and Project management. Prior to Hortonworks, Vinay has worked as security architect, product manager, developer and project manager. Vinay admits to being a caffeine addict and spends his free time on a Yoga mat and on Hikes.
  • 4. Hortonworks enables adoption of Apache Hadoop with Hortonworks Data Platform • Founded in 2011 • Original 24 architects, developers, operators of Hadoop from Yahoo! • Leaders in Hadoop community • 500+ employees Page 4 Customer Momentum • 300+ customers in seven quarters, growing at 75+/quarter • Two thirds of customers come from F1000 Partner Momentum • Over 1000 Partners, Hundreds of Certified Solutions • Some key partners include: Hortonworks and Hadoop at Scale • HDP in production on largest clusters on planet • Most +1000 node clusters
  • 5. Page 5 The Forrester Wave™ Big Data Hadoop Solutions Q1 2014 A Leader in Hadoop “Hortonworks loves and lives open source innovation” World Class Support and Services. Hortonworks' Customer Support received a maximum score and was significantly higher than both Cloudera and MapR
  • 6. HDP IS Apache Hadoop There is ONE Enterprise Hadoop: everything else is a vendor derivation HDP 2.2 October 2014 HDP 2.1 April 2014 Page 6 0.98.0 1.4.0 0.5.0 0.60 0.4.0 Tez Slider 4.10.0 4.7.2 Hortonworks Data Platform 2.2 Hadoop &YARN Pig Hive & HCatalog HBase Sqoop 4.0.0 Oozie 3.4.5 Zookeeper 1.5.1 Ambari Storm Flume Knox Phoenix Accumulo 2.2.0 0.12.0 0.12.0 2.4.0 0.12.1 Data Management 0.13.0 0.96.1 0.9.1 1.4.4 1.3.1 1.4.4 3.3.2 3.4.5 0.4.0 4.0.0 1.5.1 Falcon Ranger Spark Kafka 0.14.0 0.14.0 0.98.4 1.6.1 4.2 0.9.3 1.2.0 0.6.0 0.8.1 1.4.5 1.5.0 1.7.0 4.1.0 0.5.0 0.4.0 2.6.0 * version numbers are targets and subject to change at time of general availability in accordance with ASF release process HDP 2.0 October 2013 Solr 0.5.1 Data Access Governance & Integration Operations Security
  • 7. The Modern Data Architecture w/ HDP Page 7
  • 8. Enterprise Goals for the Modern Data Architecture Page 8 • Consolidate siloed data sets structured and unstructured • Central data set on a single cluster • Multiple workloads across batch interactive and real time • Central services for security, governance and operation • Preserve existing investment in current tools and platforms • Single view of the customer, product, supply chain DATA SYSTEM APPLICATIONS Business Analytics Custom Applications Packaged Applications RDBMS EDW MPP Batch Interactive Real-Time YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N CRM ERP Other 1 ° ° ° ° ° ° HDFS (Hadoop Distributed File System) SOURCES EXISTING( Systems( Clickstream( Web(( &Social( Geoloca9on( Sensor(( &(Machine( Server(( Logs( Unstructured(
  • 9. 1. Unlock New Applications from New Types of Data INDUSTRY USE CASE Sentiment Page 9 & Web Clickstream & Behavior Machine & Sensor Geographic Server Logs Structured & Unstructured Financial Services New Account Risk Screens ✔ ✔ Trading Risk ✔ Insurance Underwriting ✔ ✔ ✔ Telecom Call Detail Records (CDR) ✔ ✔ Infrastructure Investment ✔ ✔ Real-time Bandwidth Allocation ✔ ✔ ✔ Retail 360° View of the Customer ✔ ✔ ✔ Localized, Personalized Promotions ✔ Website Optimization ✔ Manufacturing Supply Chain and Logistics ✔ Assembly Line Quality Assurance ✔ Crowd-sourced Quality Assurance ✔ Healthcare Use Genomic Data in Medial Trials ✔ ✔ ✔ Monitor Patient Vitals in Real-Time Pharmaceuticals Recruit and Retain Patients for Drug Trials ✔ ✔ Improve Prescription Adherence ✔ ✔ ✔ ✔ Oil & Gas Unify Exploration & Production Data ✔ ✔ ✔ ✔ Monitor Rig Safety in Real-Time ✔ ✔ ✔ Government ETL Offload/Federal Budgetary Pressures ✔ ✔ Sentiment Analysis for Government Programs ✔
  • 10. ..to shift from reactive to proactive interactions A shift in Advertising From mass branding …to 1x1 Targeting A shift in Financial Services From Educated Investing …to Automated Algorithms A shift in Healthcare From mass treatment …to Designer Medicine A shift in Retail A shift in Telco Page 10 HDP and Hadoop allow organizations to shift interactions from… Reactive Post Transaction Proactive Pre Decision …to Real-t From static branding ime Personalization From break then fix …to repair before break
  • 11. 2. Or to realize a dramatic cost savings… EDW Optimization Page 11 ✚ OPERATIONS 50% ANALYTICS 20% ETL PROCESS 30% OPERATIONS 50% ANALYTICS 50% Current Reality EDW at capacity: some usage from low value workloads Older data archived, unavailable for ongoing exploration Source data often discarded Hadoop Parse, Cleanse Apply Structure, Transform Augment w/ Hadoop Free up EDW resources from low value tasks Keep 100% of source data and historical data for ongoing exploration Mine data for value after loading it because of schema-on-read
  • 12. 2. Or to realize a dramatic cost savings… EDW Optimization Page 12 ✚ OPERATIONS 50% ANALYTICS 20% ETL PROCESS 30% OPERATIONS 50% ANALYTICS 50% Current Reality EDW at capacity: some usage from low value workloads Older data archived, unavailable for ongoing exploration Source data often discarded Augment w/ Hadoop Free up EDW resources from low value tasks Keep 100% of source data and historical data for ongoing exploration Mine data for value after loading it because of schema-on-read Commodity Compute & Storage Hadoop Enables Scalable Compute & Storage at a Compelling Cost Structure Cloud Storage Engineered System MPP SAN HADOOP NAS $0 $20,000 $40,000 $60,000 $80,000 $180,000 Fully-loaded Cost Per Raw TB of Data (Min–Max Cost) Hadoop Parse, Cleanse Apply Structure, Transform Storage Costs/Compute Costs from $19/GB to $0.23/GB
  • 13. 3. Data Lake: An architectural shift SCALE Page 13 SCOPE Unlocking the Data Lake ( RDBMS MPP EDW Data Lake Enabled by YARN • Single data repository, shared infrastructure • Multiple biz apps accessing all the data • Enable a shift from reactive to proactive interactions • Gain new insight across the entire enterprise New Analytic Apps or IT Optimization HDP 2.1 Governance & Integration Security Operations Data Access YARN Data Management
  • 14. Case Study: 12 month Hadoop evolution at TrueCar Data Platform Capabilities Page 14 June 2013 Begin Hadoop Execution July 2013 Hortonworks Partnership 12 months execution plan May ‘14 IPO Aug 2013 Training & Dev Begins Nov 2013 Production Cluster 60 Nodes 2 PB Jan 2014 40% Dev Staff Perficient Dec 2013 Three Production Apps (3 total) Feb 2014 Three More Production Apps (6 total) 12 Month Results at TrueCAR • Six Production Hadoop Applications • Sixty nodes/2PB data • Storage Costs/Compute Costs from $19/GB to $0.23/GB “We addressed our data platform capabilities strategically as a pre-cursor to IPO.”
  • 15. DRIVING INNOVATION THROUGH DREDUACINGT DEAVELOPMENT TIME FOR PRODUCTION-GRADE HADOOP APPLICATIONS Dhruv Kumar Solutions Architect, Concurrent Inc
  • 16. GET TO KNOW CONCURRENT 2 Leader in Application Infrastructure for Big Data! • Building enterprise software to simplify Big Data application development and management Products and Technology! • CASCADING Open Source - The most widely used application infrastructure for building Big Data apps with over 200,000 downloads each month. 8000 deployments worldwide. • DRIVEN Enterprise data application management for Big Data apps Proven — Simple, Reliable, Robust! • Thousands of enterprises rely on Concurrent to provide their data application infrastructure. Founded: 2008 HQ: San Francisco, CA ! CEO: Gary Nakamura CTO, Founder: Chris Wensel ! ! www.concurrentinc.com
  • 17. BIG DATA APPLICATION INFRASTRUCTURE 3 “It’s all about the apps”" There needs to be a comprehensive solution for building, deploying, running and managing this new class of enterprise applications. Business Strategy Connecting Business and Data Data & Technology Challenges! ! Skill sets, systems integration, standard op procedure and operational visibility
  • 18. DATA APPLICATIONS - ENTERPRISE NEEDS Enterprise Data Application Infrastructure! ! • Need reliable, reusable tooling to quickly build and consistently deliver data products 4 ! • Need the degrees of freedom to solve problems ranging from simple to complex with existing skill sets ! • Need the flexibility to easily adapt an application to meet business needs (latency, scale, SLA), without having to rewrite the application ! • Need operational visibility for entire data application lifecycle
  • 19. WORD COUNT EXAMPLE WITH CASCADING 5 ! ! String docPath = args[ 0 ];! String wcPath = args[ 1 ];! Properties properties = new Properties();! AppProps.setApplicationJarClass( properties, Main.class );! HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );! ! configuration integration ! // create source and sink taps! Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );! Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );! ! processing // specify a regex to split "document" text lines into token stream! Fields token = new Fields( "token" );! Fields text = new Fields( "text" );! RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );! // only returns "token"! Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );! // determine the word counts! Pipe wcPipe = new Pipe( "wc", docPipe );! wcPipe = new GroupBy( wcPipe, token );! wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );! scheduling ! // connect the taps, pipes, etc., into a flow definition! FlowDef flowDef = FlowDef.flowDef().setName( "wc" )! .addSource( docPipe, docTap )! .addTailSink( wcPipe, wcTap );! // create the Flow! Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work! wcFlow.complete(); // <<-- Runs jobs on Cluster
  • 20. SOME COMMON PROCESSING PATTERNS • Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical • Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct) • Aggregations ‣ Count, Average, etc 6 filter filter function function filter function data Pipeline Split Join Merge data Topology
  • 21. CASCADING API • Java API • Separates business logic from integration • Testable at every lifecycle stage • Works with any JVM language • Many integration adapters 7 Processing API Integration API Process Planner Scheduler API Scheduler Apache Hadoop Cascading Data Stores Scripting Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
  • 22. FRAMEWORK AND PROGRAMMING LANGUAGE INDEPENDENCE Cascading Domain Specific Languages (DSLs) 8 SQL Clojure Ruby New Fabrics Tez Storm Supported Fabrics and Data Stores Mainframe DB / DW In-Memory Data Stores Hadoop ! • Any JVM language can use Cascading API • Cascading applications that run on MapReduce will also run on Apache Spark, Storm, and …
  • 23. THE STANDARD FOR DATA APPLICATION DEVELOPMENT 9 www.cascading.org Build data apps that are scale-free! !!! Design principles ensure best practices at any scale Test-Driven Development! ! Efficiently test code and process local files before deploying on a cluster Staffing Bottleneck! ! Use existing Java, SQL, modeling skill sets Application Portability! ! ! Write once, then run on different computation fabrics Operational Complexity! ! Simple - Package up into one jar and hand to operations Systems Integration! ! ! Hadoop never lives alone. Easily integrate to existing systems ! Proven application development framework for building data apps Application platform that addresses:
  • 24. CASCADING DATA APPLICATIONS 10 Enterprise IT! Extract Transform Load Log File Analysis Systems Integration Operations Analysis ! Corporate Apps! HR Analytics Employee Behavioral Analysis Customer Support | eCRM Business Reporting ! Telecom! Data processing of Open Data Geospatial Indexing Consumer Mobile Apps Location based services Marketing / Retail! Mobile, Social, Search Analytics Funnel Analysis Revenue Attribution Customer Experiments Ad Optimization Retail Recommenders ! Consumer / Entertainment! Music Recommendation Comparison Shopping Restaurant Rankings Real Estate Rental Listings Travel Search & Forecast ! ! Finance! Fraud and Anomaly Detection Fraud Experiments Customer Analytics Insurance Risk Metric ! Health / Biotech! Aggregate Metrics For Govt Person Biometrics Veterinary Diagnostics Next-Gen Genomics Argonomics Environmental Maps !
  • 25. STRONG ORGANIC GROWTH 11 200,000+ downloads / month! 8000+ Deployments!
  • 26. BUSINESSES DEPEND ON US • 30000 Jobs per day! • Makes complex analysis of very large data sets simple! • Machine learning, linear algebra to improve! • User experience! • Ad quality (matching users and ad effectiveness)! • All revenue applications are running on Cascading/Scalding! 12 TWITTER
  • 27. BUSINESSES DEPEND ON US • Cascading Java API! • Data normalization and cleansing of search and click-through logs for use by analytics tools, Hive analysts! • Easy to operationalize heavy lifting of data in one framework 13
  • 28. BUSINESSES DEPEND ON US • Cascalog (Clojure)! • Weather pattern modeling to protect growers against loss! • ETL against 20+ datasets daily! • Machine learning to create models! • Purchased by Monsanto for $930M US 14
  • 29. BROAD SUPPORT 15 Hadoop ecosystem supports Cascading!
  • 30. … AND INCLUDES RICH SET OF EXTENSIONS 16 http://www.cascading.org/extensions/
  • 31. WORD COUNT DEMO ON HDP 17
  • 32. SUMMARY - BUILD ROBUST DATA APPS RIGHT THE FIRST TIME WITH CASCADING • Cascading framework enables developers to intuitively create data applications that scale and are robust, future-proof, supporting new execution fabrics without requiring a code rewrite ! • Driven — an application visualization product — provides rich insights into how your applications executes, improving developer productivity by 10x ! • Cascading 3.0 opens up the query planner — write apps once, run on any fabric 18 Concurrent offers training classes for Cascading & Scalding
  • 33. CONTACT INFORMATION Dhruv Kumar! Solutions Architect! Concurrent Inc.! dkumar@concurrentinc.com
  • 34. DRIVING INNOVATION THROUGH DTHAANKT YAOU Dhruv Kumar
  • 36. USE LINGUAL TO MIGRATE ITERATIVE ETL TASKS TO SPARK • Lingual is an extension to Cascading that executes ANSI SQL queries as Cascading apps ! • Supports integrating with any data source that can be accessed through JDBC — Cascading Tap can be created for any source supporting JDBC ! • Great for migration of data, integrating with non- Big Data assets — extends life of existing IT assets in an organization 22 CLI / Shell Enterprise Java Provider API JDBC API Lingual API Query Planner Cascading Apache Hadoop Lingual Data Stores Catalog
  • 37. SCALDING • Scalding is a language binding to Cascading for Scala 23 • The name Scalding comes from the combining of SCALa and cascaDING ! • Scalding is great for Scala developers; can crisply write constructs for matrix math… ! • Scalding has very large commercial deployments at: • Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality • Ebay - Use cases include search analytics and other production data pipelines
  • 38. PATTERN ENABLES MIGRATING YOUR MODELS TO SPARK 24 • Pattern is an open source project that allows to leverage Predictive Model Markup Language (PMML) models and translate them into Cascading apps. • PMML is an XML-based popular analytics framework that allows applications to describe data mining and machine learning algorithms • PMML models from popular analytics frameworks can be reused and deployed within Cascading workflows! • Vendor frameworks - SAS, IBM SPSS, MicroStrategy, Oracle • Open source frameworks - R, Weka, KNIME, RapidMiner • Pattern is great for migrating your model scoring to Hadoop from your decision systems
  • 39. PATTERN: ALGOS IMPLEMENTED • Hierarchical Clustering • K-Means Clustering • Linear Regression • Logistic Regression • Random Forest ! algorithms extended based on customer use cases – 25 Confidential
  • 40. BUILDING AND RUNNING PMML MODELS LINGUAL Data PMML LINGUAL 26 Confidential Model Producer Model Explore data and build model using Regression, clustering, etc. Training Scoring New Data PMML model Measure and improve model Post Processing Model Consumer Data Data scores PATTERN ETL, prepare data ETL, prepare data