SlideShare ist ein Scribd-Unternehmen logo
1 von 50
An Overview of Big data & Hadoop
Prepared & presented by
Tony Nguyen
July 2014
Presentation outline
This presentation gives Big data concepts and an
overview of different Big Data technologies
Understand different tools and use the right tools for
DW and ETL
How does current BI/DW fit to the Big Data context?
How do Microsoft BI and Hadoop get married?
What is big data?
Refers to any collection of data sets so large
and complex i.e. hundreds of Petabytes
Why is Big Data concerned?
• 2 billion internet users in the world today,
• 7.3 billion active cell phones in 2014
• 7TB of data is processed by Twitter everyday
• 500TB of data is processed by Facebook everyday
• With massive quantity of data, businesses need fast,
reliable, deeper data insight
Big Data Technologies
What is Hadoop?
 refers an ecosystem which includes large
scale distributed filesystem in order to store
and process big data across multiple storage
servers.
 Hadoop technologies include MapReduce &
Hadoop Distributed Filesytem (HDFS)
Who are the major Hadoop vendors?
IBM InfoSphere BigInsights : IBM packs Hadoop with
its products including Text analytics, Social Data
Analytics Accelerator, Big SQL, Big R
Clourera: pack Hadoop core components with its well-
known analytic SQL product named Impala and
provides enterprise support. Current Clourera Hadoop
versions includes CDH4.7 and CDH5.1
Hortonworks: a company is formed by Yahoo and
Benchmark Capital, Hortonworks makes Hadoop
ready for enterprise with the latest version of HDP 2.1
Microsoft: contributes HDInsight as Hadoop on
Windows platform
HDFS
 The Hadoop distributed file system
(HDFS) is a distributed, scalable, and
portable file-system written in Java for the
Hadoop framework.
 It is designed to run across low-cost
commodity hardware
MapReduce
 MapReduce is a programming model and an
associated implementation for processing
and generating large data sets with a
parallel, distributed algorithm on a cluster.
 From Hadoop version 2.1, Yet Another
MapReduce (YARN) was introduced.
Core components on the top of Hadoop
1. Hive (Facebook)
2. Pig (Yahoo)
3. Hbase
4. HCatalog
5. Knox
6. ZooKeeper
7. Sqoop
Pig
1. Originally developed by Yahoo
2. Best used for large data set ETL
3. Dataflow scripting language called PigLatin, a High-level
language designed to remove the complexities of coding
MapReduce applications.
4. Pig converts its operators into MapReduce code.
5. Instead of needing Java programming skills and an
understanding of the MapReduce coding infrastructure,
people with little programming skills, can simply invoke
SORT or FILTER operators without having to code a
MapReduce application to accomplish those tasks.
Hive
 Originally developed by facebook in 2007
 Hive is a data warehouse built on the top of
Hadoop file system (HDFS) and allowing
developers use SQL-like scripts (called Hive
SQL or HQL) to create databases & tables.
 Hive translates the SQL-like scripts into the
MapReduce algorithm to store and process large
data sets.
 The short learning curve as BI developers use
familiar SQL-like scripts
Hive (Cont’d)
UPDATE or DELETE a record isn't allowed in Hive,
but INSERT INTO is acceptable.
A way to work around this limitation is to use
partitions: if you're getting different batches of ids
separately, you could redesign your table so that it is
partitioned by id, and then you would be able to
easily drop partitions for the ids you want to get rid
of.
Hbase
 HBase is a column-oriented database management system that
runs on top of HDFS
 The database that is modelled after Google’s BigTable
technology. HBase was created for hosting very large tables
with billions of rows and millions of columns.
 An HBase system comprises a set of tables. Each table contains
rows and columns, much like a traditional database
 HBase provides random, real time access to your Big Data.
 Does not support a structured query language like SQL
 Referred as NoSQL technology (NoSQL means Not Only SQL)
as HBase is not intended to replace your traditional RDBMS
HCatalog
1. HCatalog is a table and storage management layer
for Hadoop that enables users with different data
processing tools – Apache Pig, Apache MapReduce,
and Apache Hive – to more easily read and write data
on the grid
2. Frees the user from having to know where the data is
stored, with the table abstraction
3. Enables notifications of data availability
4. Provides visibility for data cleaning and archiving
tools
Knox
A system that provides a single point of
authentication and access for Apache Hadoop
services in a cluster. The goal of the project is
to simplify Hadoop security for users who
access the cluster data and execute jobs, and
for operators who control access and manage
the cluster.
Zookeeper
Apache ZooKeeper provides
operational services for a Hadoop
cluster, including high availability,
naming service, notifying system,
message queue.
Sqoop
Sqoop provides a way to import and export data to
and from relational database tables (for example, SQL
Server) and HDFS.
Eight Hadoop SQL databases
Apache Hive
Impala
Presto (Facebook)
Shark
Apache Drill
EMC/Pivotal HAWQ
BigSQL by IBM
Apache Pheonix (for HBase)
Apache Tajo
Three popular open source Hadoop-based
SQL databases
1. Impala (Cloudera)
2. Stinger (Hortonworks) –(aka Hive 11, Hive
12, Hive 13 or Hive-on-Tez)
3. Presto (Facebook)
Impala
1. Developed by Cloudera in 2012
2. SQL query engine that runs natively in Apache Hadoop
3. Query data uses SELECT, JOIN, and aggregate
functions – in real time
4. Access directly to HDFS and use MPP computation
instead of MapReduce. Therefore, provide nearly real
time data access
5. The entire process happen on memory, therefore it
eliminates the latency of Disk IO that happen extensively
during MapReduce job.
MPP vs MapReduce
Both are distributed data processing systems but difference are as follows:
MPP MapReduce
used on expensive, specialized
hardware tuned for CPU, storage
and network performance
deployed to clusters of commodity
servers that in turn use commodity
disks
Faster Slower
In memory computation Disk I/O computation
Queried with SQL Java code
Declarative query Imperative code (machine code)
SQL is easier and more productive More difficult for IT processional
Stinger
1. Refers to new versions of Hive (versions
0.11 - 0.13) to overcome the performance
barrier of MapReduce computation
2. More SQL compliance for Hive SQL
http://hortonworks.com/labs/stinger/
Stinger’s Hive SQL new features
Presto
1. Respond to Cloudera Impala, Facebook introduced
Presto in 2012
2. Presto is similar in approach to Impala in that it is
designed to provide an interactive experience whilst
still using your existing datasets stored in Hadoop.
It provides:
 JDBC Drivers
 ANSI-SQL syntax support (presumably ANSI-92)
 A set of ‘connectors’ used to read data from existing data sources. Connectors
include: HDFS, Hive, and Cassandra.
 Interop with the Hive metastore for schema sharing
How Hive, Impala and Presto work?
Comparison of Hive, Impala, Presto and
Stinger
Hive Impala Presto Stinger
Year 2007 2012 Developing Developing
Orginal developer Facebook Cloudera Facebook hortonworks
Main Purpose Data warehouse Enable analysts and data
scientists to directly interact
with any data stored in
Hadoop. Offload self-service
business intelligence to
Hadoop.
RDBMS RDBMS
Computation
approach
MapReduce Massively parallel processing
(MPP) architecture
MPP MPP
Performance low fast fast fast
Latency High low latency low latency low latency
Language SQL like script ANSI-92 SQL support with
user-defined functions (UDFs)
SQL including RANK,
LEAD, LAG
SQL like script
Interfaces CLI, Web, ODBC,
JDBC
ODBC, JDBC , impala-shell,
web JDBC JDBC
High availability
Hadoop 2.0/CDH4
has HA on hdfs level
Yes
Hadoop 2.0/CDH4 has
HA on hdfs level
Hadoop 2.0/CDH4
has HA on hdfs
level
Replication Yes supported between two CDH 5
clusters
Unknown Unknown
Hive pros and cons
Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/
Advantage Disadvantage
It’s been around 5 years. You could say it
is matured and proven solution.
Since it is using MapReduce, It’s carrying
all the drawbacks which MapReduce has
such as expensive shuffle phase as well
as huge IO operations
Runs on proven MapReduce framework Hive still not support multiple reducers that
make queries like Group By and Order By
lot slower
Good support for user defined functions Lot slower compare to other competitors.
It can be mapped to HBase and other
systems easily
Impala pros and cons
Advantage Disadvantage
Lighting speed and promise near real
time adhoc query processing.
No fault tolerance for running queries.
If a query failed on a node, the query
has to be reissued, It can’t resume
from where it fails.
The computation happen in memory,
that reduce enormous amount of
latency and Disk IO
Latest version supports UDF
Open source, Apache licensed
Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/
Presto pros and cons
Advantage Disadvantage
Lighting fast and promise near real time
interactive querying.
It’s a new born baby. Need to wait and watch
since there were some interesting active
developments going on.
Used extensively in Facebook. So it is proven
and stable.
As of now support only Hive managed tables.
Though the website claim one can query
hbase also, the feature still under
development.
Open Source and there is a strong momentum
behind it ever since it’s been open sourced.
Still no UDF support yet. This is the most
requested feature to be added.
It is also using Distributed query processing
engine. So it eliminates all the latency and
DiskIO issues with traditional MapReduce.
Well documented. Perhaps this is the first open
source software from Facebook that got a
dedicated website from day 1.
Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/
Performance comparison
Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar
May 29, 2014
Performance comparison (cont’d)
Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar
May 29, 2014
Performance comparison (cont’d)
Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar
May 29, 2014
Performance comparison (cont’d)
Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar
May 29, 2014
Comments on Impala
Among Impala, Hive and Presto, it seems that
Impala is a matured SQL in Hadoop
Impala appears to be the winner in term of
performance and matured level
Hadoop DW/BI Solutions
Combining Hadoop and SQL Server tools
Both Hadoop and SQL Server have strengths and
weaknesses
Combining Hadoop and SQL Server tools will
overcome strengths and weaknesses of each
technology
SQL Server vs SQL on Hadoop
SQL Server SQL on Hadoop
SQL Server enforces data
quality and consistency better
(unique index, key and
foreign key)
Lack of data quality
enforcement
There is scalability limit Better for scaling and
processing massive data
Deployment options
Hadoop on Premise
Hadoop in the Cloud
1. Infrastructure as a Service (IAAS) – providers of IaaS
offer computers – physical or (more often) virtual
machines
2. Platform as a Service (PAAS) - including operating
system, programming language execution environment,
database, and web server.
3. Software as a service (SaaS) - provide access to
application software and databases
Deployment options scorecard
Why move Hadoop to cloud?
Save time and money
Scalability
Microsoft BI get married with Hadoop
Move Microsoft BI to cloud
Use right ETL tools
SSIS – existing skills in organisation, need
transformation, performance tuning is impartant
Pig – use when very large data set, take advantage
of the scalability of Hadoop, IT staff is comfortable
learning a new language
Sqoop –Little need to transform the data, easy to
use, IT staff isn’t comfortable with SSIS or Pig, load
sql table directly to Hadoop.
SQL Server Parallel Data Warehouse –
- A high performance & expensive solution
SQL Server Parallel Data Warehouse is the MPP edition of SQL
Server.
Unlike the Standard, Enterprise or Data Center editions, PDW is
actually a hardware and software bundle rather than just a piece of
software. Microsoft call it a database "appliance".
It isn't a substitute for SSIS, SSAS and SSRS. It's Microsoft's
answer for customers needing to process 10s or 100s of terabytes
who want the ability to scale out large workloads across multiple
servers, large storage arrays and many processors.
It includes:
◦ Microsoft PolyBase
◦ Microsoft Analytics Platform System (APS)
◦ Run on the top of Hadoop
SQL Server Parallel Data Warehouse (con’d)
SQL Server Parallel Data Warehouse (cont’d)
References
Microsoft Big Data Solutions, Wiley, February 2014
Microsoft SQL 2012 Server with Hadoop, Debarchan
Sarkar, published by Packt Publishing Ltd 2013
Cloudera.com
Hortonworks.com
Hadoop.apache.org
Microsoft.com/bigdata
Impala.io
Prestodb.io
Hive.apache.org
Q & A

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 

Was ist angesagt? (20)

Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 

Andere mochten auch

Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLliuknag
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Amazon Web Services
 

Andere mochten auch (6)

Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 

Ähnlich wie Overview of big data & hadoop v1

Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 

Ähnlich wie Overview of big data & hadoop v1 (20)

Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Big data
Big dataBig data
Big data
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
paper
paperpaper
paper
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 

Overview of big data & hadoop v1

  • 1. An Overview of Big data & Hadoop Prepared & presented by Tony Nguyen July 2014
  • 2. Presentation outline This presentation gives Big data concepts and an overview of different Big Data technologies Understand different tools and use the right tools for DW and ETL How does current BI/DW fit to the Big Data context? How do Microsoft BI and Hadoop get married?
  • 3. What is big data? Refers to any collection of data sets so large and complex i.e. hundreds of Petabytes
  • 4. Why is Big Data concerned? • 2 billion internet users in the world today, • 7.3 billion active cell phones in 2014 • 7TB of data is processed by Twitter everyday • 500TB of data is processed by Facebook everyday • With massive quantity of data, businesses need fast, reliable, deeper data insight
  • 6. What is Hadoop?  refers an ecosystem which includes large scale distributed filesystem in order to store and process big data across multiple storage servers.  Hadoop technologies include MapReduce & Hadoop Distributed Filesytem (HDFS)
  • 7. Who are the major Hadoop vendors? IBM InfoSphere BigInsights : IBM packs Hadoop with its products including Text analytics, Social Data Analytics Accelerator, Big SQL, Big R Clourera: pack Hadoop core components with its well- known analytic SQL product named Impala and provides enterprise support. Current Clourera Hadoop versions includes CDH4.7 and CDH5.1 Hortonworks: a company is formed by Yahoo and Benchmark Capital, Hortonworks makes Hadoop ready for enterprise with the latest version of HDP 2.1 Microsoft: contributes HDInsight as Hadoop on Windows platform
  • 8. HDFS  The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework.  It is designed to run across low-cost commodity hardware
  • 9. MapReduce  MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.  From Hadoop version 2.1, Yet Another MapReduce (YARN) was introduced.
  • 10.
  • 11. Core components on the top of Hadoop 1. Hive (Facebook) 2. Pig (Yahoo) 3. Hbase 4. HCatalog 5. Knox 6. ZooKeeper 7. Sqoop
  • 12. Pig 1. Originally developed by Yahoo 2. Best used for large data set ETL 3. Dataflow scripting language called PigLatin, a High-level language designed to remove the complexities of coding MapReduce applications. 4. Pig converts its operators into MapReduce code. 5. Instead of needing Java programming skills and an understanding of the MapReduce coding infrastructure, people with little programming skills, can simply invoke SORT or FILTER operators without having to code a MapReduce application to accomplish those tasks.
  • 13. Hive  Originally developed by facebook in 2007  Hive is a data warehouse built on the top of Hadoop file system (HDFS) and allowing developers use SQL-like scripts (called Hive SQL or HQL) to create databases & tables.  Hive translates the SQL-like scripts into the MapReduce algorithm to store and process large data sets.  The short learning curve as BI developers use familiar SQL-like scripts
  • 14. Hive (Cont’d) UPDATE or DELETE a record isn't allowed in Hive, but INSERT INTO is acceptable. A way to work around this limitation is to use partitions: if you're getting different batches of ids separately, you could redesign your table so that it is partitioned by id, and then you would be able to easily drop partitions for the ids you want to get rid of.
  • 15. Hbase  HBase is a column-oriented database management system that runs on top of HDFS  The database that is modelled after Google’s BigTable technology. HBase was created for hosting very large tables with billions of rows and millions of columns.  An HBase system comprises a set of tables. Each table contains rows and columns, much like a traditional database  HBase provides random, real time access to your Big Data.  Does not support a structured query language like SQL  Referred as NoSQL technology (NoSQL means Not Only SQL) as HBase is not intended to replace your traditional RDBMS
  • 16. HCatalog 1. HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on the grid 2. Frees the user from having to know where the data is stored, with the table abstraction 3. Enables notifications of data availability 4. Provides visibility for data cleaning and archiving tools
  • 17. Knox A system that provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access and manage the cluster.
  • 18. Zookeeper Apache ZooKeeper provides operational services for a Hadoop cluster, including high availability, naming service, notifying system, message queue.
  • 19. Sqoop Sqoop provides a way to import and export data to and from relational database tables (for example, SQL Server) and HDFS.
  • 20. Eight Hadoop SQL databases Apache Hive Impala Presto (Facebook) Shark Apache Drill EMC/Pivotal HAWQ BigSQL by IBM Apache Pheonix (for HBase) Apache Tajo
  • 21. Three popular open source Hadoop-based SQL databases 1. Impala (Cloudera) 2. Stinger (Hortonworks) –(aka Hive 11, Hive 12, Hive 13 or Hive-on-Tez) 3. Presto (Facebook)
  • 22. Impala 1. Developed by Cloudera in 2012 2. SQL query engine that runs natively in Apache Hadoop 3. Query data uses SELECT, JOIN, and aggregate functions – in real time 4. Access directly to HDFS and use MPP computation instead of MapReduce. Therefore, provide nearly real time data access 5. The entire process happen on memory, therefore it eliminates the latency of Disk IO that happen extensively during MapReduce job.
  • 23. MPP vs MapReduce Both are distributed data processing systems but difference are as follows: MPP MapReduce used on expensive, specialized hardware tuned for CPU, storage and network performance deployed to clusters of commodity servers that in turn use commodity disks Faster Slower In memory computation Disk I/O computation Queried with SQL Java code Declarative query Imperative code (machine code) SQL is easier and more productive More difficult for IT processional
  • 24. Stinger 1. Refers to new versions of Hive (versions 0.11 - 0.13) to overcome the performance barrier of MapReduce computation 2. More SQL compliance for Hive SQL http://hortonworks.com/labs/stinger/
  • 25. Stinger’s Hive SQL new features
  • 26. Presto 1. Respond to Cloudera Impala, Facebook introduced Presto in 2012 2. Presto is similar in approach to Impala in that it is designed to provide an interactive experience whilst still using your existing datasets stored in Hadoop. It provides:  JDBC Drivers  ANSI-SQL syntax support (presumably ANSI-92)  A set of ‘connectors’ used to read data from existing data sources. Connectors include: HDFS, Hive, and Cassandra.  Interop with the Hive metastore for schema sharing
  • 27. How Hive, Impala and Presto work?
  • 28. Comparison of Hive, Impala, Presto and Stinger Hive Impala Presto Stinger Year 2007 2012 Developing Developing Orginal developer Facebook Cloudera Facebook hortonworks Main Purpose Data warehouse Enable analysts and data scientists to directly interact with any data stored in Hadoop. Offload self-service business intelligence to Hadoop. RDBMS RDBMS Computation approach MapReduce Massively parallel processing (MPP) architecture MPP MPP Performance low fast fast fast Latency High low latency low latency low latency Language SQL like script ANSI-92 SQL support with user-defined functions (UDFs) SQL including RANK, LEAD, LAG SQL like script Interfaces CLI, Web, ODBC, JDBC ODBC, JDBC , impala-shell, web JDBC JDBC High availability Hadoop 2.0/CDH4 has HA on hdfs level Yes Hadoop 2.0/CDH4 has HA on hdfs level Hadoop 2.0/CDH4 has HA on hdfs level Replication Yes supported between two CDH 5 clusters Unknown Unknown
  • 29. Hive pros and cons Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/ Advantage Disadvantage It’s been around 5 years. You could say it is matured and proven solution. Since it is using MapReduce, It’s carrying all the drawbacks which MapReduce has such as expensive shuffle phase as well as huge IO operations Runs on proven MapReduce framework Hive still not support multiple reducers that make queries like Group By and Order By lot slower Good support for user defined functions Lot slower compare to other competitors. It can be mapped to HBase and other systems easily
  • 30. Impala pros and cons Advantage Disadvantage Lighting speed and promise near real time adhoc query processing. No fault tolerance for running queries. If a query failed on a node, the query has to be reissued, It can’t resume from where it fails. The computation happen in memory, that reduce enormous amount of latency and Disk IO Latest version supports UDF Open source, Apache licensed Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/
  • 31. Presto pros and cons Advantage Disadvantage Lighting fast and promise near real time interactive querying. It’s a new born baby. Need to wait and watch since there were some interesting active developments going on. Used extensively in Facebook. So it is proven and stable. As of now support only Hive managed tables. Though the website claim one can query hbase also, the feature still under development. Open Source and there is a strong momentum behind it ever since it’s been open sourced. Still no UDF support yet. This is the most requested feature to be added. It is also using Distributed query processing engine. So it eliminates all the latency and DiskIO issues with traditional MapReduce. Well documented. Perhaps this is the first open source software from Facebook that got a dedicated website from day 1. Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/
  • 32. Performance comparison Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar May 29, 2014
  • 33. Performance comparison (cont’d) Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar May 29, 2014
  • 34. Performance comparison (cont’d) Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar May 29, 2014
  • 35. Performance comparison (cont’d) Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar May 29, 2014
  • 36. Comments on Impala Among Impala, Hive and Presto, it seems that Impala is a matured SQL in Hadoop Impala appears to be the winner in term of performance and matured level
  • 38. Combining Hadoop and SQL Server tools Both Hadoop and SQL Server have strengths and weaknesses Combining Hadoop and SQL Server tools will overcome strengths and weaknesses of each technology
  • 39. SQL Server vs SQL on Hadoop SQL Server SQL on Hadoop SQL Server enforces data quality and consistency better (unique index, key and foreign key) Lack of data quality enforcement There is scalability limit Better for scaling and processing massive data
  • 40. Deployment options Hadoop on Premise Hadoop in the Cloud 1. Infrastructure as a Service (IAAS) – providers of IaaS offer computers – physical or (more often) virtual machines 2. Platform as a Service (PAAS) - including operating system, programming language execution environment, database, and web server. 3. Software as a service (SaaS) - provide access to application software and databases
  • 42. Why move Hadoop to cloud? Save time and money Scalability
  • 43. Microsoft BI get married with Hadoop
  • 44. Move Microsoft BI to cloud
  • 45. Use right ETL tools SSIS – existing skills in organisation, need transformation, performance tuning is impartant Pig – use when very large data set, take advantage of the scalability of Hadoop, IT staff is comfortable learning a new language Sqoop –Little need to transform the data, easy to use, IT staff isn’t comfortable with SSIS or Pig, load sql table directly to Hadoop.
  • 46. SQL Server Parallel Data Warehouse – - A high performance & expensive solution SQL Server Parallel Data Warehouse is the MPP edition of SQL Server. Unlike the Standard, Enterprise or Data Center editions, PDW is actually a hardware and software bundle rather than just a piece of software. Microsoft call it a database "appliance". It isn't a substitute for SSIS, SSAS and SSRS. It's Microsoft's answer for customers needing to process 10s or 100s of terabytes who want the ability to scale out large workloads across multiple servers, large storage arrays and many processors. It includes: ◦ Microsoft PolyBase ◦ Microsoft Analytics Platform System (APS) ◦ Run on the top of Hadoop
  • 47. SQL Server Parallel Data Warehouse (con’d)
  • 48. SQL Server Parallel Data Warehouse (cont’d)
  • 49. References Microsoft Big Data Solutions, Wiley, February 2014 Microsoft SQL 2012 Server with Hadoop, Debarchan Sarkar, published by Packt Publishing Ltd 2013 Cloudera.com Hortonworks.com Hadoop.apache.org Microsoft.com/bigdata Impala.io Prestodb.io Hive.apache.org
  • 50. Q & A