Weitere ähnliche Inhalte Ähnlich wie Keynote: Getting Serious about MySQL and Hadoop at Continuent (20) Kürzlich hochgeladen (20) Keynote: Getting Serious about MySQL and Hadoop at Continuent3. ©Continuent 2014
What is a Hadoop?
3
Hadoop Distributed File System (HDFS)
MapReduce
Spark
Hive
Storm
Pig
Shark
Mahout
HBase
Oozie
Avro
HCatalog
Scalding
Stinger
Impala
Sqoop
Ambari
Cassandra
Zookeeper
5. ©Continuent 2014
Hadoop analyzes any type of data
5
Server Logs
Social
media
feeds
Geolocation
data
Clickstreams
Sensor
readings
Business
transactions
Analytic reports
6. ©Continuent 2014
Hadoop data loading is simple
!
mysql> select * into
-> outfile '/tmp/sakila.rental.csv'
-> fields terminated by ','
-> lines terminated by 'n'
-> from sakila.rental;
Query OK, 16044 rows affected (0.03 sec)
!
mysql> quit
Bye
$ hadoop fs -put /tmp/sakila.rental.csv
6
7. ©Continuent 2014
Hadoop exploits downward cost of
storing and processing data
7
Disk Storage -- Average Cost Per Gigabyte
$0.01
$0.10
$1.00
$10.00
$100.00
$1,000.00
$10,000.00
1990 1993 1996 1999 2002 2005 2008 2011 2014
(Source: John McCallum, http://www.jcmit.com)
8. ©Continuent 2014
Hadoop is shifting from batch to real-
time analytics
8
Cycle time for different iterative algorithms
Page Rank
K-Means Clustering
Logistic Regression
0 40 80 120 160
0.96
4.1
14
110
155
80
Core Hadoop Spark
(Source: Pat McDonough, http://spark-summit.org/2013)
11. ©Continuent 2014
Three integration problems
11
1.Continuous, high-performance loading
2.Meaningful analytics on Hadoop
3.Optimized operation for large-scale
deployment
16. ©Continuent 2014
We can implement that!
16
MySQL
binlog_format=row
MySQL
Binlog
Tungsten 3.0 Master
hadoop
Tungsten 3.0 Slave
hadoop
CSV
Files
CSV
Files
CSV
Files
CSV
FilesCSV
Apache Sqoop/ETL
Fast data filtering
Buffered
CSV
Programmable
load scripts
Parallel apply
Parallel table
dumps
Low impact
replication from
the binlog
17. ©Continuent 2014
How do you like your data?
(Your data stored in MySQL)
+---------+--------------------+-------------+--------+
| film_id | title | rental_rate | length |
+---------+--------------------+-------------+--------+
| 556 | MALTESE HOPE | 4.99 | 127 |
| 557 | MANCHURIAN CURTAIN | 2.99 | 177 |
| 558 | MANNEQUIN WORST | 2.99 | 71 |
| 559 | MARRIED GO | 2.99 | 114 |
+---------+--------------------+-------------+--------+
!
17
18. ©Continuent 2014
Does it really look better like this?
!
!
!
!
556,MALTESE HOPE,4.99,127n
557,MANCHURIAN CURTAIN,3.99,177n
558,MANNEQUIN WORST,2.99,71n
559,MARRIED GO,2.99,114n
18
field separator
file partitioning
record separator
compression type conversions
(Your data stored in Hadoop)
19. ©Continuent 2014
Or this?
19
!
(INSERT)
I,57,556,2014-03-27 21:04:24.000,556,MALTESE HOPE,
4.99,127n
!
(UPDATE)
D,57,557,2014-03-27 21:04:24.000,557,N,N,Nn
I,57,558,2014-03-27 21:04:24.000,557,MANCHURIAN
CURTAIN,2.99,177n
!
(DELETE)
D,57,559,2014-03-27 21:04:24.000,558,N,N,Nn
20. ©Continuent 2014
One more thing to replicate...
20
Dump/load
Replication
CSV
Files
CSV
Files
Buffered
Transactions
Binlog
Table metadata
21. ©Continuent 2014
A more civilized view of data
!
!
(Your data viewed through Hive)
556
MALTESE HOPE
4.99
127
557
MANCHURIAN CURTAIN
3.99
177
558
MANNEQUIN WORST
2.99
71
559
MARRIED GO
2.99
114
21
23. ©Continuent 2014
Introducing a useful MapReduce trick...
23
Transaction logs Snapshot
UNION ALL
Emit last row per key if not a delete
MAP
REDUCE
Materialized view
including all updates
Sort by key(s), transaction orderSHUFFLE
24. ©Continuent 2014
...With some amazing properties
24
Apache Sqoop
Tungsten Replication
CSV
Files
CSV
Files
Buffered
CSV Files
No replication
failures due to
consistency
Reconstruct
consistent
views at will
No locks
No transactions
No need to pause
processing
Reprovision any
table at will
Table metadata
25. ©Continuent 2014
We can implement that too!!
25
https://github.com/continuent/continuent-tools-hadoop
Continuent
Hadoop
Tools
Schema
creation
Materialized
view
generation
Data
comparison
Apache 2.0
licensing
26. ©Continuent 2014
Optimizing large scale deployments
26
Replicator
m1 (slave)
m2 (slave)
m3 (slave)
Replicator
m1 (master)
m2 (master)
m3 (master)
Replicator
Replicator
RBR
RBR
RBR
29. ©Continuent 2014
Tungsten 3.0 Roadmap for Hadoop
29
Q1 2014 Q2 2014
Features
• Parallel extractor
• Polished MapReduce
tools
• Improved schema
change handling
• Binary data
conversion
• HortonWorks 2.0
Features
• Scripted load
• Better block commit
• Hive CSV format
• Hive DDL generation
• Partitioned files
• Auto-recovery
• Parallel batch apply
• Sqoop integration
• Cloudera 4.x/5.0
31. ©Continuent 2014
Users can prepare...
• Use Unicode/UTF8
• Standardize on UTC for time
• Enable row replication
• Cluster your data in a way that supports
restarts
31
33. ©Continuent 2014
The MySQL community can prepare...
• Fast heterogeneous replication and loading
• Innovative projects to make relational data
easy to consume on Hadoop
• Competing solutions that improve life for
users
33
34. ©Continuent 2014
Conclusion
• Hadoop is for real and the MySQL community
needs to adapt
• The challenge is to move data to Hadoop and
make it easy to integrate into analytics
• MySQL can be *the* preferred RDBMS to use
with Hadoop
34
36. ©Continuent 2014
Wed 2:20pm Ballroom B - Hadoop for MySQL People
!
Thurs 1pm Ballroom D - From Dolphins to Elephants:
Real-Time MySQL to Hadoop Replication
We’re Hiring!
http://www.continuent.com