SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Emerging
Technologies/Frameworks
in Big Data
Rahul Jain
@rahuldausa
Meetup Sep 2015
About Me
• Independent Big data/Search Consultant
• 8+ years of learning experience.
• Worked (got a chance) on High volume
distributed applications.
• Still a learner (and beginner)
Quick Questionnaire
How many people know/heard Apache Parquet ?
How many people know/heard Apache Drill ?
How many people Know/heard Apache Flink ?
What we are going to
learn/see today ?
• Columnar Storage (overview)
• Apache Parquet (with Demo)
• Dremel (Basic overview)
• Apache Drill (with Demo)
• Apache Flink (with Demo)
Let’s discuss
Columnar Storage
Lets say we have a Employee table
RowId EmpId Lastname Firstname Salary
001 10 Smith Joe 40000
002 12 Jones Mary 50000
003 11 Johnson Cathy 44000
004 22 Jones Bob 55000
table storage in row oriented system
In Row-oriented systems,
It will be stored as
001:10,Smith,Joe,40000;
002:12,Jones,Mary,50000;
003:11,Johnson,Cathy,44000;
004:22,Jones,Bob,55000;
RowId EmpId Lastname Firstname Salary
001 10 Smith Joe 40000
002 12 Jones Mary 50000
003 11 Johnson Cathy 44000
004 22 Jones Bob 55000
table storage in column oriented system
In Row-oriented systems,
It will be stored as
001:10,Smith,Joe,40000;
002:12,Jones,Mary,50000;
003:11,Johnson,Cathy,44000;
004:22,Jones,Bob,55000;
RowId EmpId Lastname Firstname Salary
001 10 Smith Joe 40000
002 12 Jones Mary 50000
003 11 Johnson Cathy 44000
004 22 Jones Bob 55000
But In Column-oriented systems,
It will be stored as
10:001,12:002,11:003,22:004;
Smith:001,Jones:002,Johnson:003,Jones:004;
Joe:001,Mary:002,Cathy:003,Bob:004;
40000:001,50000:002,44000:003,55000:004;
Row vs Column Storage
Row-oriented storage
001:10,Smith,Joe,40000;002:12,Jones,Mary,50000;003:11,Johnson,Cathy,440
00;004:22,Jones,Bob,55000;
Column-oriented storage
10:001,12:002,11:003,22:004;Smith:001,Jones:002,Johnson:003,Jones:004;Joe:001,
Mary:002,Cathy:003,Bob:004;40000:001,50000:002,44000:003,55000:004;
Apache Parquet
(Columnar Storage for Hadoop ecosystem)
About Apache Parquet
• Columnar based Storage format
• Initially started by Twitter and Cloudera
• stores nested data structures in a flat columnar format using a technique
outlined in the Dremel paper from Google.
• Can store very-2 large dataset with very high compression rate.
• Due to compression, less IO and Faster Processing.
• Provides high level APIs in Java
• Integration with Hadoop and its eco-system
• http://parquet.apache.org
Parquet Design
• required: exactly one occurrence
• optional: 0 or 1 occurrence
• repeated: 0 or more occurrences
For e.g, an address book schema:
message AddressBook {
required string owner;
repeated string ownerPhoneNumbers;
repeated group contacts {
required string name;
optional string phoneNumber;
}
}
Size Comparison
$ du -sch test.*
407M test.csv (1 million records, 4 columns)
70M test.csv.gz (~83% reduction)
35M test.parquet (~92% reduction)
Let’s discuss first
Dremel: Interactive Analysis of
Web-Scale Datasets
What is Dremel
• A Published a Paper in 2010 by Google
• Interactive Analysis of Web-Scale Datasets
– An adhoc query on a very large scale dataset (in Petabytes)
– Near Real time
– MR (Map-Reduce) works but that is meant for Batch Processing
• SQL like Query Interface
• Nested Data (with a Column storage representation)
• Paper:
– http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf
• Projects (Implementation):
– Google Big Query (Cloud based)
– Apache Drill (Open source)
Why Dremel: Speed Matters
Credit: http://www.slideshare.net/robertlz/dremel-interactive-analysis-of-webscale-datasets
Widely used inside Google
Credit: http://www.slideshare.net/robertlz/dremel-interactive-analysis-of-webscale-datasets
Tree based structure
Credit: http://www.alberton.info/images/articles/papers/dremel1.png
Column striped representation
Credit: http://www.alberton.info/images/articles/papers/dremel2.png
Query Processing
Credit: http://farm9.staticflickr.com/8426/7843420938_9cb23a4cb0_b.jpg
Let’s move to
Apache Drill
About Apache Drill
• Based on Google’s Dremel Paper
• Supports data-intensive distributed applications for
interactive analysis of large-scale datasets
• Have a Datastore aware optimizer
– which constructs the query plan based on datastore’s
processing capabilities.
• Supports Data locality.
• http://drill.apache.org/
So Why Drill?
• Flexible Data Model
• Fixed Schema(Avro)/Dynamic Schema(JSON)/Schema less SQL
• Schema can be discovered on the Fly
• Built-in optimistic query execution engine.
• Doesn’t require a particular storage or execution system
(Map-Reduce, Spark, Tez)
• Better Performance and Manageability
• Cluster of commodity servers
• Daemon (drillbit) on each data node
• Works with Hadoop, CSV, JSON, Avro/Parquet, MongoDB, HBase,
Solr etc.
Query any non-relational datastore
Distributed SQL query engine
Credit: http://www.slideshare.net/MapRTechnologies/drill-highperformancesqlenginewithjsondatamodel
Designed to support wide set of
use-cases
Credit: http://www.slideshare.net/MapRTechnologies/drill-highperformancesqlenginewithjsondatamodel
Querying
CSV:
0: jdbc:drill:> select count(*) from dfs.`/tmp/test.csv`;
+-----------+
| EXPR$0 |
+-----------+
| 10000001 |
+-----------+
1 row selected (5.771 seconds)
Parquet:
0: jdbc:drill:> select count(*) from dfs.`/tmp/test.parquet`;
+-----------+
| EXPR$0 |
+-----------+
| 10000001 |
+-----------+
1 row selected (0.257 seconds)
Drill Shell
./bin/drill-embedded
It will start Drill in Embedded Mode. You will see output like this,
org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29
01:25:26...
apache drill 1.0.0
"say hello to my little drill"
0: jdbc:drill:zk=local>
For windows: This will start the shell with Drill in embedded Mode.
./bin/sqlline.bat –u "jdbc:drill:schema=dfs;zk=local"
Terminology
• Drillbit
– Drillbit runs on each data node in the cluster, Drill
maximizes data locality during query execution.
Movement of data over the network or between
nodes is minimized or eliminated when possible.
Drill Configuration
drill.exec:{
cluster-id: "<cluster_name>",
zk.connect:
"<zkhostname1>:<port>,<zkhostname2>:<port>,<zkhostname3>:<port>“
}
Configuration: $DRILL_HOME/conf/drill-override.conf
Default configuration:
drill.exec: {
cluster-id: "drillbits1",
zk.connect: "localhost:2181"
}
Starting Drill in Distributed Mode
./bin/drillbit.sh restart
./bin/drillbit.sh [--config <conf-dir>] (start|stop|status|restart|autorestart)
It will restart the Drillbit service.
Tip:
Check the hostname on Drillbit is listening. For e.g.
2015-09-05 03:21:20,070 [main] INFO o.apache.drill.exec.server.Drillbit - Drillbit
environment: host.name=192.168.0.101
This will start the drill shell on local machine based on configuration provided in
drill-overide.conf
Start the shell:
./bin/drill-localhost (if drillbit listening on localhost)
otherwise
./bin/sqlline -u "jdbc:drill:drillbit=192.168.0.101"
Verify it once; and try a sample
0: jdbc:drill:zk=local> select * from sys.drillbits;
+----------------+------------+---------------+------------+----------+
| hostname | user_port | control_port | data_port | current |
+----------------+------------+---------------+------------+----------+
| 192.168.0.101 | 31010 | 31011 | 31012 | true |
+----------------+------------+---------------+------------+----------+
0: jdbc:drill:zk=local> select count(*) from `dfs`.`$DRILL_HOME/sample-
data/nation.parquet`;
+---------+
| EXPR$0 |
+---------+
| 25 |
+---------+
1 row selected (1.752 seconds)
Drill – Web Client
A Storage Plugin can be added/Enabled
Let’s move to
Apache Flink
About Apache Flink
• Open source framework for Big Data Analytics
• Distributed Streaming dataflow engine
• Runs Computing In-Memory.
• Executes programs in data-parallel and pipelined manner.
• Most popular for running Stream Data Processing.
• Provides high level APIs in
• Java
• Scala
• Python
• Integration with Hadoop and its eco-system and can read existing data of HDFS or
HBase.
• https://flink.apache.org
So Why Flink?
Credit: Compiled based on several articles,Blogs, Stackoverflow posts added in references page.
• Share a lot of Similarities with relational DBMS
• Data is serialized in byte buffers and processed a lot in binary representation
• So allows Fine grained memory control
• Uses a Pipeline based Processing Model with Cost based Optimizer to choose
the execution strategy.
• optimized for cyclic or iterative processes by using iterative transformations
on collections
• achieved by an optimization of join algorithms, operator chaining and
reusing of partitioning and sorting.
• Flink streaming processes data streams as true streams, i.e., data elements
are immediately "pipelined" though a streaming program as soon as they
arrive
• also has its own memory management system separate from Java’s garbage
collector.
Credit: http://www.slideshare.net/stephanewen1/apache-flink-overview
Flink vs Spark
(they looks to be pretty similar)
Apache Flink:
case class Word (word: String, frequency: Int)
val counts = text
.flatMap {line => line.split(" ").map(word => Word(word,1))}
.groupBy("word").sum("frequency")
Apache Spark:
val counts = text
.flatMap(line => line.split(" ")).map(word => (word, 1))
.reduceByKey{case (x, y) => x + y}
But….
Apache Spark: is batch processing framework that can
approximate stream processing (called as micro-batching)
Apache Flink: is primarily a stream processing framework that
can look like a batch processor.
Credit: http://www.slideshare.net/stephanewen1/apache-flink-overview
Credit: http://www.slideshare.net/stephanewen1/apache-flink-overview
Flink – Web Client
Arguments to program separated by spaces
Flink – Web Client
References
• https://flink.apache.org/
• https://www.quora.com/What-are-the-differences-between-Apache-Spark-and-Apache-Flink
• http://stackoverflow.com/questions/28082581/what-is-the-differences-between-apache-spark-
and-apache-flink
• http://statrgy.com/2015/06/01/best-data-processing-engine-flink-vs-spark/
• http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-
large-scale-machine-learning
• http://www.infoworld.com/article/2919602/hadoop/flink-hadoops-new-contender-for-
mapreduce-spark.html
• http://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.html
Thanks!
@rahuldausa on twitter and slideshare
http://www.linkedin.com/in/rahuldausa

Weitere ähnliche Inhalte

Was ist angesagt?

Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streamingdatamantra
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache KafkaJoe Stein
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira
 
Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into OverdriveTodd Palino
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Knoldus Inc.
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache KafkaJoe Stein
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructuremattlieber
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server TalkEvan Chan
 

Was ist angesagt? (20)

Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streaming
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache Kafka
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 

Andere mochten auch

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperRahul Jain
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to ScalaRahul Jain
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremRahul Jain
 
Scalable Internet Architecture
Scalable Internet ArchitectureScalable Internet Architecture
Scalable Internet ArchitectureTheo Schlossnagle
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJim Plush
 
lec21.ppt
lec21.pptlec21.ppt
lec21.pptbutest
 
解決正確的問題 - 如何讓數據發揮影響力?
解決正確的問題 - 如何讓數據發揮影響力?解決正確的問題 - 如何讓數據發揮影響力?
解決正確的問題 - 如何讓數據發揮影響力?Pei-shen (James) Wu
 

Andere mochten auch (20)

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
 
Scalable Internet Architecture
Scalable Internet ArchitectureScalable Internet Architecture
Scalable Internet Architecture
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Snort
SnortSnort
Snort
 
lec21.ppt
lec21.pptlec21.ppt
lec21.ppt
 
解決正確的問題 - 如何讓數據發揮影響力?
解決正確的問題 - 如何讓數據發揮影響力?解決正確的問題 - 如何讓數據發揮影響力?
解決正確的問題 - 如何讓數據發揮影響力?
 

Ähnlich wie Emerging technologies /frameworks in Big Data

Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataTrieu Nguyen
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehousehadoopsphere
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at AlibabaMichael Stack
 
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...Cloudera, Inc.
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterData Con LA
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemasterAthemaster Co., Ltd.
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningSwiss Big Data User Group
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 

Ähnlich wie Emerging technologies /frameworks in Big Data (20)

Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouse
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemaster
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Data Science
Data ScienceData Science
Data Science
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 

Kürzlich hochgeladen

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Emerging technologies /frameworks in Big Data

  • 1. Emerging Technologies/Frameworks in Big Data Rahul Jain @rahuldausa Meetup Sep 2015
  • 2. About Me • Independent Big data/Search Consultant • 8+ years of learning experience. • Worked (got a chance) on High volume distributed applications. • Still a learner (and beginner)
  • 3. Quick Questionnaire How many people know/heard Apache Parquet ? How many people know/heard Apache Drill ? How many people Know/heard Apache Flink ?
  • 4. What we are going to learn/see today ? • Columnar Storage (overview) • Apache Parquet (with Demo) • Dremel (Basic overview) • Apache Drill (with Demo) • Apache Flink (with Demo)
  • 6. Lets say we have a Employee table RowId EmpId Lastname Firstname Salary 001 10 Smith Joe 40000 002 12 Jones Mary 50000 003 11 Johnson Cathy 44000 004 22 Jones Bob 55000
  • 7. table storage in row oriented system In Row-oriented systems, It will be stored as 001:10,Smith,Joe,40000; 002:12,Jones,Mary,50000; 003:11,Johnson,Cathy,44000; 004:22,Jones,Bob,55000; RowId EmpId Lastname Firstname Salary 001 10 Smith Joe 40000 002 12 Jones Mary 50000 003 11 Johnson Cathy 44000 004 22 Jones Bob 55000
  • 8. table storage in column oriented system In Row-oriented systems, It will be stored as 001:10,Smith,Joe,40000; 002:12,Jones,Mary,50000; 003:11,Johnson,Cathy,44000; 004:22,Jones,Bob,55000; RowId EmpId Lastname Firstname Salary 001 10 Smith Joe 40000 002 12 Jones Mary 50000 003 11 Johnson Cathy 44000 004 22 Jones Bob 55000 But In Column-oriented systems, It will be stored as 10:001,12:002,11:003,22:004; Smith:001,Jones:002,Johnson:003,Jones:004; Joe:001,Mary:002,Cathy:003,Bob:004; 40000:001,50000:002,44000:003,55000:004;
  • 9. Row vs Column Storage Row-oriented storage 001:10,Smith,Joe,40000;002:12,Jones,Mary,50000;003:11,Johnson,Cathy,440 00;004:22,Jones,Bob,55000; Column-oriented storage 10:001,12:002,11:003,22:004;Smith:001,Jones:002,Johnson:003,Jones:004;Joe:001, Mary:002,Cathy:003,Bob:004;40000:001,50000:002,44000:003,55000:004;
  • 10. Apache Parquet (Columnar Storage for Hadoop ecosystem)
  • 11. About Apache Parquet • Columnar based Storage format • Initially started by Twitter and Cloudera • stores nested data structures in a flat columnar format using a technique outlined in the Dremel paper from Google. • Can store very-2 large dataset with very high compression rate. • Due to compression, less IO and Faster Processing. • Provides high level APIs in Java • Integration with Hadoop and its eco-system • http://parquet.apache.org
  • 12. Parquet Design • required: exactly one occurrence • optional: 0 or 1 occurrence • repeated: 0 or more occurrences For e.g, an address book schema: message AddressBook { required string owner; repeated string ownerPhoneNumbers; repeated group contacts { required string name; optional string phoneNumber; } }
  • 13. Size Comparison $ du -sch test.* 407M test.csv (1 million records, 4 columns) 70M test.csv.gz (~83% reduction) 35M test.parquet (~92% reduction)
  • 14. Let’s discuss first Dremel: Interactive Analysis of Web-Scale Datasets
  • 15. What is Dremel • A Published a Paper in 2010 by Google • Interactive Analysis of Web-Scale Datasets – An adhoc query on a very large scale dataset (in Petabytes) – Near Real time – MR (Map-Reduce) works but that is meant for Batch Processing • SQL like Query Interface • Nested Data (with a Column storage representation) • Paper: – http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf • Projects (Implementation): – Google Big Query (Cloud based) – Apache Drill (Open source)
  • 16. Why Dremel: Speed Matters Credit: http://www.slideshare.net/robertlz/dremel-interactive-analysis-of-webscale-datasets
  • 17. Widely used inside Google Credit: http://www.slideshare.net/robertlz/dremel-interactive-analysis-of-webscale-datasets
  • 18. Tree based structure Credit: http://www.alberton.info/images/articles/papers/dremel1.png
  • 19. Column striped representation Credit: http://www.alberton.info/images/articles/papers/dremel2.png
  • 22. About Apache Drill • Based on Google’s Dremel Paper • Supports data-intensive distributed applications for interactive analysis of large-scale datasets • Have a Datastore aware optimizer – which constructs the query plan based on datastore’s processing capabilities. • Supports Data locality. • http://drill.apache.org/
  • 23. So Why Drill? • Flexible Data Model • Fixed Schema(Avro)/Dynamic Schema(JSON)/Schema less SQL • Schema can be discovered on the Fly • Built-in optimistic query execution engine. • Doesn’t require a particular storage or execution system (Map-Reduce, Spark, Tez) • Better Performance and Manageability • Cluster of commodity servers • Daemon (drillbit) on each data node • Works with Hadoop, CSV, JSON, Avro/Parquet, MongoDB, HBase, Solr etc.
  • 25. Distributed SQL query engine Credit: http://www.slideshare.net/MapRTechnologies/drill-highperformancesqlenginewithjsondatamodel
  • 26. Designed to support wide set of use-cases Credit: http://www.slideshare.net/MapRTechnologies/drill-highperformancesqlenginewithjsondatamodel
  • 27. Querying CSV: 0: jdbc:drill:> select count(*) from dfs.`/tmp/test.csv`; +-----------+ | EXPR$0 | +-----------+ | 10000001 | +-----------+ 1 row selected (5.771 seconds) Parquet: 0: jdbc:drill:> select count(*) from dfs.`/tmp/test.parquet`; +-----------+ | EXPR$0 | +-----------+ | 10000001 | +-----------+ 1 row selected (0.257 seconds)
  • 28. Drill Shell ./bin/drill-embedded It will start Drill in Embedded Mode. You will see output like this, org.glassfish.jersey.server.ApplicationHandler initialize INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26... apache drill 1.0.0 "say hello to my little drill" 0: jdbc:drill:zk=local> For windows: This will start the shell with Drill in embedded Mode. ./bin/sqlline.bat –u "jdbc:drill:schema=dfs;zk=local"
  • 29. Terminology • Drillbit – Drillbit runs on each data node in the cluster, Drill maximizes data locality during query execution. Movement of data over the network or between nodes is minimized or eliminated when possible.
  • 30. Drill Configuration drill.exec:{ cluster-id: "<cluster_name>", zk.connect: "<zkhostname1>:<port>,<zkhostname2>:<port>,<zkhostname3>:<port>“ } Configuration: $DRILL_HOME/conf/drill-override.conf Default configuration: drill.exec: { cluster-id: "drillbits1", zk.connect: "localhost:2181" }
  • 31. Starting Drill in Distributed Mode ./bin/drillbit.sh restart ./bin/drillbit.sh [--config <conf-dir>] (start|stop|status|restart|autorestart) It will restart the Drillbit service. Tip: Check the hostname on Drillbit is listening. For e.g. 2015-09-05 03:21:20,070 [main] INFO o.apache.drill.exec.server.Drillbit - Drillbit environment: host.name=192.168.0.101 This will start the drill shell on local machine based on configuration provided in drill-overide.conf Start the shell: ./bin/drill-localhost (if drillbit listening on localhost) otherwise ./bin/sqlline -u "jdbc:drill:drillbit=192.168.0.101"
  • 32. Verify it once; and try a sample 0: jdbc:drill:zk=local> select * from sys.drillbits; +----------------+------------+---------------+------------+----------+ | hostname | user_port | control_port | data_port | current | +----------------+------------+---------------+------------+----------+ | 192.168.0.101 | 31010 | 31011 | 31012 | true | +----------------+------------+---------------+------------+----------+ 0: jdbc:drill:zk=local> select count(*) from `dfs`.`$DRILL_HOME/sample- data/nation.parquet`; +---------+ | EXPR$0 | +---------+ | 25 | +---------+ 1 row selected (1.752 seconds)
  • 33. Drill – Web Client A Storage Plugin can be added/Enabled
  • 35. About Apache Flink • Open source framework for Big Data Analytics • Distributed Streaming dataflow engine • Runs Computing In-Memory. • Executes programs in data-parallel and pipelined manner. • Most popular for running Stream Data Processing. • Provides high level APIs in • Java • Scala • Python • Integration with Hadoop and its eco-system and can read existing data of HDFS or HBase. • https://flink.apache.org
  • 36. So Why Flink? Credit: Compiled based on several articles,Blogs, Stackoverflow posts added in references page. • Share a lot of Similarities with relational DBMS • Data is serialized in byte buffers and processed a lot in binary representation • So allows Fine grained memory control • Uses a Pipeline based Processing Model with Cost based Optimizer to choose the execution strategy. • optimized for cyclic or iterative processes by using iterative transformations on collections • achieved by an optimization of join algorithms, operator chaining and reusing of partitioning and sorting. • Flink streaming processes data streams as true streams, i.e., data elements are immediately "pipelined" though a streaming program as soon as they arrive • also has its own memory management system separate from Java’s garbage collector.
  • 38. Flink vs Spark (they looks to be pretty similar) Apache Flink: case class Word (word: String, frequency: Int) val counts = text .flatMap {line => line.split(" ").map(word => Word(word,1))} .groupBy("word").sum("frequency") Apache Spark: val counts = text .flatMap(line => line.split(" ")).map(word => (word, 1)) .reduceByKey{case (x, y) => x + y}
  • 39. But…. Apache Spark: is batch processing framework that can approximate stream processing (called as micro-batching) Apache Flink: is primarily a stream processing framework that can look like a batch processor.
  • 42. Flink – Web Client Arguments to program separated by spaces
  • 43. Flink – Web Client
  • 44. References • https://flink.apache.org/ • https://www.quora.com/What-are-the-differences-between-Apache-Spark-and-Apache-Flink • http://stackoverflow.com/questions/28082581/what-is-the-differences-between-apache-spark- and-apache-flink • http://statrgy.com/2015/06/01/best-data-processing-engine-flink-vs-spark/ • http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for- large-scale-machine-learning • http://www.infoworld.com/article/2919602/hadoop/flink-hadoops-new-contender-for- mapreduce-spark.html • http://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.html
  • 45. Thanks! @rahuldausa on twitter and slideshare http://www.linkedin.com/in/rahuldausa