Weitere ähnliche Inhalte Ähnlich wie Hadoop & Greenplum: Why Do Such a Thing? (20) Kürzlich hochgeladen (20) Hadoop & Greenplum: Why Do Such a Thing?1. Greenplum & Hadoop
Why do such a thing?
Donald Miner
Solutions Architect
Advanced Technologies Group
Donald.Miner@emc.com
© Copyright 2012 EMC Corporation. All rights reserved. 1
3. GREENPLUM DATABASE
Greenplum Database Basics
Massively Parallel Processing (MPP) Database
Uses commodity hardware Master Master
Data is distributed by a
user-defined “distribution key”
Master node delegates
queries to segments Segment Segment Segment Segment
1:1 segment and master
mirroring for redundancy
© Copyright 2012 EMC Corporation. All rights reserved. 3
4. GREENPLUM DATABASE
Greenplum Database Features
Full SQL support based on PostgreSQL 8.2
Columnar or row-oriented storage with compression
Multi-level table partitioning with query time partition pruning
B-tree and bitmap indexes
JDBC, ODBC, OLEDB, etc. interfaces
High speed, parallel bulk ingest
Parallel query optimizer
External tables
© Copyright 2012 EMC Corporation. All rights reserved. 4
5. GREENPLUM DATABASE
MADlib Analytics with Greenplum
Scalable and in-database > SELECT householdID, variables
FROM households
Mathematical, statistical, ORDER BY RANDOM()
LIMIT 100000;
machine learning
> SELECT run_univariate_analysis (
'households_training',
Active open source project 'variables');
WHERE pvalue<.01 AND r2>.01;
> SELECT run_regression(
'univariate_results',
'households_training');
> SELECT householdID,
madlib.array_dot(
coef::REAL[],
xmatrix::REAL[])
FROM coefficients, households;
© Copyright 2012 EMC Corporation. All rights reserved. 5
6. GREENPLUM DATABASE
MADlib In-Database Analytical Functions
Descriptive Statistics Modeling
Quantile Correlation Matrix
Profile Association Rule Mining
CountMin (Cormode-Muthukrishnan)
K-Means Clustering
Sketch-based Estimator
FM (Flajolet-Martin) Sketch-based
Naïve Bayes Classification
Estimator
MFV (Most Frequent Values) Sketch-
Linear Regression
based Estimator
Frequency Logistic Regression
Histogram Support Vector Machines
Bar Chart SVD Matrix Factorisation
Box Plot Chart Decision Trees/CART
Latent Dirichlet Allocation Topic
Modeling
© Copyright 2012 EMC Corporation. All rights reserved. 6
7. GREENPLUM DATABASE
PostGIS Support in Greenplum DB
PostGIS adds support for geographic objects in PostgreSQL
Example: find all records within 25 miles of hurricane path
http://postgis.refractions.net/
select customer_id, ST_AsText(lat_lon), phone_num
from clients
where ST_DWithin(lat_lon, ST_GeometryFromText('LINESTRING(
-79.3 17, -79.3 17.1, -79.3 17.3, -79.7 17.6, -79.6 17.4, -79.6 16.8, -79.9 15.8, -80.2 15.8, -
80 15.7, -80 15.7, -80.2 15.9, -80.6 16.5, -81.1 16.7, -81.8 16.7, - 82.1 16.8, -82.5 17.2, -
83.9 17.9, -85.2 18.3, -85.5 18.4)', 4326), 25.0/3959.0 * 180.0/PI())
customer_id | st_astext | phone_num
------------+-----------------------------+-------------
493140 | POINT(-80.040397 26.570613) | 1231231234
192401 | POINT(-81.820933 26.242611) | 2342342345
© Copyright 2012 EMC Corporation. All rights reserved. 7
8. GREENPLUM DATABASE
Solr integration with GPDB
Solr is an open source enterprise search engine
Enable in-database text indexing and search
id | score | message_text
select -----------+------------------+-------------------------------------------
t.id, 71552856 | 5.43078422546387 | Hates BB's Love IPhones!
q.score, 91373993 | 4.06371879577637 | Its a love hate relationship with
t.message_text iPhone spellcheck
from
message t, 25444233 | 4.05911064147949 | #iPhone autocorrect is a love/hate
gptext.search( relationship...
'twitter.public.message',
120166038 | 3.39410924911499 | Love the new iPhone 4s, hate
'(iphone and (hate or love))', @ATT service #Verizonhereicome
'author_lang:en',
100 117498183 | 3.39181470870972 | I got a love-hate relationship for
)q my iPhone!!!
where
t.id=q.id 86416378 | 3.39180779457092 | Absolutely love the new iPhone,
but Siri seems to hate me..
order by score desc;
© Copyright 2012 EMC Corporation. All rights reserved. 8
10. GREENPLUM HADOOP
Greenplum “HD”
• Bundled open source
• HDFS, MapReduce, Hive, Pig, HBase, ZooKeeper, Ma
hout
© Copyright 2012 EMC Corporation. All rights reserved. 10
11. GREENPLUM HADOOP
Greenplum “MR”
• Bundled MapR, a commercial version of Hadoop
• API compatible with traditional Hadoop
• MapR improvements over Hadoop:
– Improved control system
– Major portions of HDFS re-implemented
in C++
– HDFS is NFS mountable
– Improved shuffle and sort
– Distributed NameNode
– Supports large number of files
– Mirroring, snapshot capability
© Copyright 2012 EMC Corporation. All rights reserved. 11
12. Why do such a thing?
Greenplum DB
MADLib
Partitioning GP Solr/Lucene
SQL
Indexing Text objects
RDBMS PostGIS
GPMapReduce
Tables and Schemas
STRUCTURED SEMISTRUCTURED UNSTRUCTURED
© Copyright 2012 EMC Corporation. All rights reserved. 12
13. Why do such a thing?
Hadoop
Schema on load
MapReduce
Hive
XML, JSON, … Flat files
Pig
STRUCTURED SEMISTRUCTURED UNSTRUCTURED
© Copyright 2012 EMC Corporation. All rights reserved. 13
14. Why do such a thing?
HBase
Row keys
Hive Flexible schema MapReduce
HBase Tables
Pig
STRUCTURED SEMISTRUCTURED UNSTRUCTURED
© Copyright 2012 EMC Corporation. All rights reserved. 14
15. Why do such a thing?
Hybrid architecture with all three (or two…)
MADLib
Partitioning Row keys GP Solr/Lucene
SQL Schema on load
Indexing Text objects
Flexible schema MapReduce
RDBMS Hive PostGIS
HBase Tables GPMapReduce
Tables and Schemas Pig XML, JSON, … Flat files
STRUCTURED SEMISTRUCTURED UNSTRUCTURED
© Copyright 2012 EMC Corporation. All rights reserved. 15
17. Hadoop External Tables in GPDB
External tables bring external data into the database.
Native support for HDFS with parallelized loading.
Can write to HDFS or read from HDFS.
> CREATE EXTERNAL TABLE hdfs_document_feature (
docid integer,
term text,
freq integer)
LOCATION ('gphdfs://namenode:9000/user/don/docs/part-*')
FORMAT 'text' (delimiter '|');
> SELECT COUNT(*) FROM hdfs_document_feature h, gpdb_words g WHERE
h.term = g.word;
> WRITE INTO hdfs_export SELECT * FROM gpdb_source;
© Copyright 2012 EMC Corporation. All rights reserved. 17
18. Why do such a thing?
Many of the same use cases of a HBase/Hadoop environment
Use Hadoop as a data groomer
Do rollups in Hadoop and store results in GPDB
Use the best tool for the job (structured vs. unstructured)
Use GPDB to host data sets in a more real-time layer for ad-hoc
analytics
© Copyright 2012 EMC Corporation. All rights reserved. 18
19. EMC Isilon
Hardware appliance for scale-out
network-attached storage (NAS)
Stripes data across all nodes
Uses Infiniband for intra-cluster
communication
Up to 15.5PB total storage
3 different hardware configurations
to handle different workloads
Uses “OneFS”, Isilon’s operating system and file system
Interfaces with iSCSI, NFS, CIFS, HTTP, HDFS, and a few
more.
© Copyright 2012 EMC Corporation. All rights reserved. 19
20. Isilon HDFS interface
Isilon is able to “pretend” to be a HDFS
cluster: it mimics the NameNode and
DataNode protocols to host data.
Underlying system is OneFS and does not
follow the traditional HDFS scheme.
Point HDFS clients (MapReduce, command
line, etc.) to any IP in the Isilon cluster.
© Copyright 2012 EMC Corporation. All rights reserved. 20
21. Pros & Cons
Isilon is more dense
Isilon can be mounted via a number of
protocols
– Easier ingest / egress
– Raw data accessible by applications
Isilon is easy to manage
Free of certain HDFS limitations
Isilon loses data locality (~250MB/sec
throughput per node over network)
© Copyright 2012 EMC Corporation. All rights reserved. 21
22. Why do such a thing?
Hadoop backup or archive
– More dense than HDFS, more accessible than
tape, no need for compute
Complete HDFS replacement
– More dense, more accessible, utilize existing
Isilon, slower per terabyte of storage
Hot/warm storage
– Use HDFS as primary, but Isilon as secondary
Storage for original content
– Use MapReduce to extract metadata from original
content, and leave original content in place
© Copyright 2012 EMC Corporation. All rights reserved. 22
23. HBase External Tables in GPDB
Project in development
Load data in parallel from HBase by specifying table name and
column qualifiers
> CREATE EXTERNAL TABLE hbase_document_feature (
“HBASEROWKEY” text,
“term” text,
“freq” integer)
LOCATION ('gphbase://docfeatures')
FORMAT ‟CUSTOM' (formatter=„gpdbwriteable_import‟);
> SELECT COUNT(*) FROM hbase_document_feature h, gpdb_words g WHERE
h.term = g.word;
© Copyright 2012 EMC Corporation. All rights reserved. 23
24. HBase External Tables in GPDB
Possible TODO list:
Specify range of rowkeys
Support writes into HBase
Specify filter criteria on the external table
select * from hbase_external where ROWKEY=‘abc’
Accumulo?
© Copyright 2012 EMC Corporation. All rights reserved. 24
25. Why do such a thing?
Have HBase store semi-structured data
Exploit the strengths of each
Use HBase for really really wide tables
Use HBase as a scalable archive of raw records
Leverage existing HBase applications
© Copyright 2012 EMC Corporation. All rights reserved. 25
26. Greenplum On HDFS
Get Greenplum Database to run natively off of HDFS
Underlying Greenplum Database data is stored in HDFS
Unifies the two platform further – no need for external tables
Fully supports Greenplum’s append-only tables
Early project in R&D
Talk will be given by Chang Lei at Yahoo Summit
© Copyright 2012 EMC Corporation. All rights reserved. 26
27. Greenplum On HDFS
Master host
Interconnect
Segment
Segment (Mirror)
Segment Segment Segment
Segment
Segment Segment
(Mirror)
Segment Segment
(Mirror) (Mirror) (Mirror)
Segment host Segment host Segment host Segment host Segment host
Meta Ops Read/Write
Tables in HDFS filespace
Namenode
B
Datanode replication
Datanode Datanode
Rack1 Rack2
© Copyright 2012 EMC Corporation. All rights reserved. 27
28. Why do such a thing?
Covers many of the same use cases as Hive
Run Hadoop MapReduce over data managed by Greenplum DB
Initial results show it is faster than Hive
You only have to store your data in one system
© Copyright 2012 EMC Corporation. All rights reserved. 28
Hinweis der Redaktion Greenplum HD HadoopSoftware