The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
Hive et Hadoop Usage chez Square
1. Hadoop and Hive at Square
Nicolas Thiébaud
!
nicothieb@
nicolas@squareup.com
Data Engineering at Square
July 2014
2. Square: Make
commerce easy
Remove crappy POSes from the counter
Building the best register for small businesses.
Started with card processing and bringing more
value to merchants using the point of sale.
!
Merchant and Buyer facing products
Square Register, Square Cash, Pickup,
Feedback
!
Data products
Merchant Analytics, Capital
3.
4.
5.
6. Data at Square
Internal Data
!
Produced on app servers (~200+ services),
mysql or psql
!
Logging and tracing from apps and web
to public endpoint
!
Example: payment data, user data, ledger
entries
External Data
!
Payment processing partners ship flat files
to us
Offline Data usage at Square
!
BI/Analysis/Reporting: ~200 mysql users,
~100 hadoop users
!
ML: Risk detection, recommendation
!
Apps: A/B testing, Commercial support,
Capital
7. Data Architecture at
Square: Kafka
Historical, most of our users still use this
App DB -> Analytical DB stripping out PII,
cursoring, looking at binlog replication
!
Hadoop: Kafka as a backbone
App DB -> Kafka using cursoring and PII
stripping
App Server -> Kafka (eg: tracing) in proto
format
Feed consumption -> Kafka
!
Kafka written to hdfs using offsets, dupes
are written when the consumer restarts
!
Raw data is deduped and extracted from
protos to rcfiles in daily batches. Everything
is exposed in Hive
8. Most datasets don’t fit in mysql. Most queries
cannot run anymore
Analysts broke down their jobs to run on single
day windows. The query sniper keeps hitting
them.
!
Mysql no longer supported as source of truth
for offline data. Tables are windowed
We keep revisiting the amount of data stored in
MySQL
!
Everyone must migrate to hive (users and
apps)
Mysql Analytical DBs will now be an export
location for data reduced in Hadoop
!
All datasets must be present in Hadoop
Even small ones :)
Transitioning to Hive
9. Transitioning to Hive
Stability
!
Hive 10 + Hue 2.5 as starting point + many
patches -> 2 restarts a day with small load
!
Decided to go to hive 12 and patch the
bugs affecting us in an internal build
!
Two major tasks: 10 -> 12 and building
hive internally
Reliability
!
Sentinel, data validation daemon
!
Conduit, hive etls
!
Customer defined SLA’s
Education
!
Office hours, trainings, mailing list
11. Project Babar: Building a stable Hive 12
Patch open source hive to address
Square specific issues
!
Setup integration tests in kochiku, no
performance test
!
Hiveserver only, no cli. Staging and
production envs
!
Push and pull changes to apache jira
Build and deploy hive artifacts
!
Makefile
!
metastore, hiveserver (staging and prod),
cli tools (beeline), hivesandbox
!
package configuration
Misc
!
hue 3.5
!
hive-udfs
12. Internal Hive Build
cdh5-0.12.0_5.0.1 branch + 9 commits
3 test fixes, 2 square specific changes (pom
+ ci)
!
DATAPLAT-436 Beeline should return non-
zero on invalid statements
!
HIVE-5799: session/operation timeout for
hiveserver2
HIVE-5707: Validate values for ConfVar
!
HIVE-7040: Allow TCP keep alive on Hive
Server 2
!
(merged in cdh5-0.12.0_5.0.1) HIVE-6893:
out of sequence error in HiveMetastore
13. Story of HIVE-7040 + HIVE-5799
HIVE-7040: Allow TCP keep alive on Hive
Server 2
F5 stateful firewall kills open connections
HIVE-5799: session/operation timeout for
hiveserver2
Beeline interrupt does not close sessions
15. Next Steps
Figure out the best way to contribute back
patches
!
HIVE-668{3,4}: Beeline comments suck
HIVE-7200: Beeline output displays column
heading even if --showHeader=false is set
HIVE-4924: Support JDBC query timeouts
HIVE-5232: Use async interface for jdbc
!
Hive HA
Shark
Tez?