The story of how solving one problem the OpenSource way
opened doors to so much more. Talk presented by Pranav Prakash and Hari Prasanna at OSDConf 2014, New Delhi.
How an open source project led to new opportunities and an entire ecosystem
1. TO INFINITY AND BEYOND
Pranav Prakash
in.linkedin.com/in/prakashpranav
Search @LinkedIn
Hari Prasanna
in.linkedin.com/in/mostlycached
BigData @LinkedIn
The story of how solving one problem the OpenSource way
opened doors to so much more
13. From a single tool to an ecosystem
• Breaking away from the initial problem statement
• The Google factor - GFS(2003), BigTable(2006), Pregel(2009) leading to
HDFS, HBase and Giraph
• The thrill and chaos of working with alpha software - from dealing with
compatibility issues to being a part of active development
• Interoperability between various systems
• Ever widening scope of the project and leveraging other tools in the
ecosystem
16. • Features:
• Column based storage
• Horizontal scalability
• Low latency reads
• MapReduce support
• SQL Support with Phoenix
• Coprocessors and secondary indexes
• RDBMS vs HBase
• Use cases
• Facebook messages
• Monitoring with openTSDB
HBase
17. Vanilla MapReduce
!
!
!
!
!
Higher Abstractions
• Pig - data flow language
• Hive - SQL to MapReduce adapter
• Cascading - Pipeline primitives and other powerful abstractions
• Even higher abstractions with Cascalog(cascading + prolog), PigPen(clojure for pig) and Pig libraries like
datafu
Java MapReduce
Having run through how the MapReduce program works, the next step is to express it
in code. We need three things: a map function, a reduce function, and some code to
run the job. The map function is represented by the Mapper class, which declares an
abstract map() method. Example 2-3 shows the implementation of our map method.
Example 2-3. Mapper for maximum temperature example
import java.io.IOException;
Figure 2-1. MapReduce logical data flow
Data Processing
18. • Data collection, aggregation and forwarding with
Kafka, Flume, Scribe.
• Real time stream processing with Storm to enable
online machine learning, real time analytics in
twitter, groupon.
• Graph processing a trillion edges in facebook with
Apache Giraph
19. • Quickstarting with the cloudera distribution
• Getting one step through the door - SlideShare’s journey
• Can your app survive without it? - Raising your bar
• Programmer, Administrator, DBA, Data Scientist - what
hat are you wearing today?
• The road ahead
• Keeping track of the developments and giving back
Leveraging “Big Data”
20. • Scientific Research - Scihadoop, decoding DNA
• Finance - Fraud Detection, Algorithmic trading, Risk
Management
• Web - Network Analysis, Recommendation Engines,
Personalization
• Government - Election campaigns, intelligence
systems
• Supply chain optimization, Weather forecasting
In the Wild