1. An Overview of Big data & Hadoop
Prepared & presented by
Tony Nguyen
July 2014
2. Presentation outline
This presentation gives Big data concepts and an
overview of different Big Data technologies
Understand different tools and use the right tools for
DW and ETL
How does current BI/DW fit to the Big Data context?
How do Microsoft BI and Hadoop get married?
3. What is big data?
Refers to any collection of data sets so large
and complex i.e. hundreds of Petabytes
4. Why is Big Data concerned?
• 2 billion internet users in the world today,
• 7.3 billion active cell phones in 2014
• 7TB of data is processed by Twitter everyday
• 500TB of data is processed by Facebook everyday
• With massive quantity of data, businesses need fast,
reliable, deeper data insight
6. What is Hadoop?
refers an ecosystem which includes large
scale distributed filesystem in order to store
and process big data across multiple storage
servers.
Hadoop technologies include MapReduce &
Hadoop Distributed Filesytem (HDFS)
7. Who are the major Hadoop vendors?
IBM InfoSphere BigInsights : IBM packs Hadoop with
its products including Text analytics, Social Data
Analytics Accelerator, Big SQL, Big R
Clourera: pack Hadoop core components with its well-
known analytic SQL product named Impala and
provides enterprise support. Current Clourera Hadoop
versions includes CDH4.7 and CDH5.1
Hortonworks: a company is formed by Yahoo and
Benchmark Capital, Hortonworks makes Hadoop
ready for enterprise with the latest version of HDP 2.1
Microsoft: contributes HDInsight as Hadoop on
Windows platform
8. HDFS
The Hadoop distributed file system
(HDFS) is a distributed, scalable, and
portable file-system written in Java for the
Hadoop framework.
It is designed to run across low-cost
commodity hardware
9. MapReduce
MapReduce is a programming model and an
associated implementation for processing
and generating large data sets with a
parallel, distributed algorithm on a cluster.
From Hadoop version 2.1, Yet Another
MapReduce (YARN) was introduced.
10.
11. Core components on the top of Hadoop
1. Hive (Facebook)
2. Pig (Yahoo)
3. Hbase
4. HCatalog
5. Knox
6. ZooKeeper
7. Sqoop
12. Pig
1. Originally developed by Yahoo
2. Best used for large data set ETL
3. Dataflow scripting language called PigLatin, a High-level
language designed to remove the complexities of coding
MapReduce applications.
4. Pig converts its operators into MapReduce code.
5. Instead of needing Java programming skills and an
understanding of the MapReduce coding infrastructure,
people with little programming skills, can simply invoke
SORT or FILTER operators without having to code a
MapReduce application to accomplish those tasks.
13. Hive
Originally developed by facebook in 2007
Hive is a data warehouse built on the top of
Hadoop file system (HDFS) and allowing
developers use SQL-like scripts (called Hive
SQL or HQL) to create databases & tables.
Hive translates the SQL-like scripts into the
MapReduce algorithm to store and process large
data sets.
The short learning curve as BI developers use
familiar SQL-like scripts
14. Hive (Cont’d)
UPDATE or DELETE a record isn't allowed in Hive,
but INSERT INTO is acceptable.
A way to work around this limitation is to use
partitions: if you're getting different batches of ids
separately, you could redesign your table so that it is
partitioned by id, and then you would be able to
easily drop partitions for the ids you want to get rid
of.
15. Hbase
HBase is a column-oriented database management system that
runs on top of HDFS
The database that is modelled after Google’s BigTable
technology. HBase was created for hosting very large tables
with billions of rows and millions of columns.
An HBase system comprises a set of tables. Each table contains
rows and columns, much like a traditional database
HBase provides random, real time access to your Big Data.
Does not support a structured query language like SQL
Referred as NoSQL technology (NoSQL means Not Only SQL)
as HBase is not intended to replace your traditional RDBMS
16. HCatalog
1. HCatalog is a table and storage management layer
for Hadoop that enables users with different data
processing tools – Apache Pig, Apache MapReduce,
and Apache Hive – to more easily read and write data
on the grid
2. Frees the user from having to know where the data is
stored, with the table abstraction
3. Enables notifications of data availability
4. Provides visibility for data cleaning and archiving
tools
17. Knox
A system that provides a single point of
authentication and access for Apache Hadoop
services in a cluster. The goal of the project is
to simplify Hadoop security for users who
access the cluster data and execute jobs, and
for operators who control access and manage
the cluster.
21. Three popular open source Hadoop-based
SQL databases
1. Impala (Cloudera)
2. Stinger (Hortonworks) –(aka Hive 11, Hive
12, Hive 13 or Hive-on-Tez)
3. Presto (Facebook)
22. Impala
1. Developed by Cloudera in 2012
2. SQL query engine that runs natively in Apache Hadoop
3. Query data uses SELECT, JOIN, and aggregate
functions – in real time
4. Access directly to HDFS and use MPP computation
instead of MapReduce. Therefore, provide nearly real
time data access
5. The entire process happen on memory, therefore it
eliminates the latency of Disk IO that happen extensively
during MapReduce job.
23. MPP vs MapReduce
Both are distributed data processing systems but difference are as follows:
MPP MapReduce
used on expensive, specialized
hardware tuned for CPU, storage
and network performance
deployed to clusters of commodity
servers that in turn use commodity
disks
Faster Slower
In memory computation Disk I/O computation
Queried with SQL Java code
Declarative query Imperative code (machine code)
SQL is easier and more productive More difficult for IT processional
24. Stinger
1. Refers to new versions of Hive (versions
0.11 - 0.13) to overcome the performance
barrier of MapReduce computation
2. More SQL compliance for Hive SQL
http://hortonworks.com/labs/stinger/
26. Presto
1. Respond to Cloudera Impala, Facebook introduced
Presto in 2012
2. Presto is similar in approach to Impala in that it is
designed to provide an interactive experience whilst
still using your existing datasets stored in Hadoop.
It provides:
JDBC Drivers
ANSI-SQL syntax support (presumably ANSI-92)
A set of ‘connectors’ used to read data from existing data sources. Connectors
include: HDFS, Hive, and Cassandra.
Interop with the Hive metastore for schema sharing
28. Comparison of Hive, Impala, Presto and
Stinger
Hive Impala Presto Stinger
Year 2007 2012 Developing Developing
Orginal developer Facebook Cloudera Facebook hortonworks
Main Purpose Data warehouse Enable analysts and data
scientists to directly interact
with any data stored in
Hadoop. Offload self-service
business intelligence to
Hadoop.
RDBMS RDBMS
Computation
approach
MapReduce Massively parallel processing
(MPP) architecture
MPP MPP
Performance low fast fast fast
Latency High low latency low latency low latency
Language SQL like script ANSI-92 SQL support with
user-defined functions (UDFs)
SQL including RANK,
LEAD, LAG
SQL like script
Interfaces CLI, Web, ODBC,
JDBC
ODBC, JDBC , impala-shell,
web JDBC JDBC
High availability
Hadoop 2.0/CDH4
has HA on hdfs level
Yes
Hadoop 2.0/CDH4 has
HA on hdfs level
Hadoop 2.0/CDH4
has HA on hdfs
level
Replication Yes supported between two CDH 5
clusters
Unknown Unknown
29. Hive pros and cons
Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/
Advantage Disadvantage
It’s been around 5 years. You could say it
is matured and proven solution.
Since it is using MapReduce, It’s carrying
all the drawbacks which MapReduce has
such as expensive shuffle phase as well
as huge IO operations
Runs on proven MapReduce framework Hive still not support multiple reducers that
make queries like Group By and Order By
lot slower
Good support for user defined functions Lot slower compare to other competitors.
It can be mapped to HBase and other
systems easily
30. Impala pros and cons
Advantage Disadvantage
Lighting speed and promise near real
time adhoc query processing.
No fault tolerance for running queries.
If a query failed on a node, the query
has to be reissued, It can’t resume
from where it fails.
The computation happen in memory,
that reduce enormous amount of
latency and Disk IO
Latest version supports UDF
Open source, Apache licensed
Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/
31. Presto pros and cons
Advantage Disadvantage
Lighting fast and promise near real time
interactive querying.
It’s a new born baby. Need to wait and watch
since there were some interesting active
developments going on.
Used extensively in Facebook. So it is proven
and stable.
As of now support only Hive managed tables.
Though the website claim one can query
hbase also, the feature still under
development.
Open Source and there is a strong momentum
behind it ever since it’s been open sourced.
Still no UDF support yet. This is the most
requested feature to be added.
It is also using Distributed query processing
engine. So it eliminates all the latency and
DiskIO issues with traditional MapReduce.
Well documented. Perhaps this is the first open
source software from Facebook that got a
dedicated website from day 1.
Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/
36. Comments on Impala
Among Impala, Hive and Presto, it seems that
Impala is a matured SQL in Hadoop
Impala appears to be the winner in term of
performance and matured level
38. Combining Hadoop and SQL Server tools
Both Hadoop and SQL Server have strengths and
weaknesses
Combining Hadoop and SQL Server tools will
overcome strengths and weaknesses of each
technology
39. SQL Server vs SQL on Hadoop
SQL Server SQL on Hadoop
SQL Server enforces data
quality and consistency better
(unique index, key and
foreign key)
Lack of data quality
enforcement
There is scalability limit Better for scaling and
processing massive data
40. Deployment options
Hadoop on Premise
Hadoop in the Cloud
1. Infrastructure as a Service (IAAS) – providers of IaaS
offer computers – physical or (more often) virtual
machines
2. Platform as a Service (PAAS) - including operating
system, programming language execution environment,
database, and web server.
3. Software as a service (SaaS) - provide access to
application software and databases
45. Use right ETL tools
SSIS – existing skills in organisation, need
transformation, performance tuning is impartant
Pig – use when very large data set, take advantage
of the scalability of Hadoop, IT staff is comfortable
learning a new language
Sqoop –Little need to transform the data, easy to
use, IT staff isn’t comfortable with SSIS or Pig, load
sql table directly to Hadoop.
46. SQL Server Parallel Data Warehouse –
- A high performance & expensive solution
SQL Server Parallel Data Warehouse is the MPP edition of SQL
Server.
Unlike the Standard, Enterprise or Data Center editions, PDW is
actually a hardware and software bundle rather than just a piece of
software. Microsoft call it a database "appliance".
It isn't a substitute for SSIS, SSAS and SSRS. It's Microsoft's
answer for customers needing to process 10s or 100s of terabytes
who want the ability to scale out large workloads across multiple
servers, large storage arrays and many processors.
It includes:
◦ Microsoft PolyBase
◦ Microsoft Analytics Platform System (APS)
◦ Run on the top of Hadoop
49. References
Microsoft Big Data Solutions, Wiley, February 2014
Microsoft SQL 2012 Server with Hadoop, Debarchan
Sarkar, published by Packt Publishing Ltd 2013
Cloudera.com
Hortonworks.com
Hadoop.apache.org
Microsoft.com/bigdata
Impala.io
Prestodb.io
Hive.apache.org