Oow 2008 yahoo_pie-db

Large Scale Data Warehousing
at Yahoo!
Bohan Chen
(bchen@yahoo-inc.com)
Database Architect at Yahoo!
Oracle Certified Master
1

Agenda

• Project Requirements
• POC Candidates
• Goals
• Tests
• Architecture and Configuration
– Database Server
– Network/Cluster Interconnects
– Storage
• Critical Success Factors
• Parallel Query on RAC
• Lessons Learned and Challenges
• Future Plans 2

“Pie DB” Project Requirements
• Yahoo Product Intelligence Engineering – Pie DB
– Several billions page views per day
– A unified data warehouse that could support click
streams, page views, and link views data
• Main requirements:
– Support > 1PB of data
– Linear scalability when adding storage or CPU
– Store data in a compressed format
– Standard SQL access
– Integrate with 3rd-party BI tools
– Support ~60 concurrent queries
– Resource management
– Reasonable and affordable cost 3

POC Candidates

• Oracle
• Greenplum
• Netezza
• Data Allegro
• Hadoop
• And others…

4

Goals
• High data compression rate
– Hadoop pre-processing improves compression rate
to 4-5x!
• ~4GB/s of reads (sustained)
– ~20GB/s effective read rate, based on 5x
compression rate
• Load 10TB in 3 hours
– 3.5TB/hr load rate; that is ~1GB/s writes
• No Indexes for queries
– Avoid additional space needed for indexes
– Avoid indexes building/rebuilding time after data
loading

5

Goals
• No SQL Hints!
• Standard Hardware / Software stack
– Avoid proprietary solutions as much as possible
– Easily repurpose if necessary
• Delete / Expire / Rolloff old data
– Truncate / drop old partitions
– No vacuum process
• Leverage hardware investment before deciding on
ETL tools
– Use database as transformation engine in the initial
phase (ELT instead of ETL)

6

Tests
• Load 3 months of clicks, page views, and link views
historical data
– Almost 100TB of raw data
– 21TB in database (due to compression)
• Load and transform data
– Load raw data
– Create dimension tables and merge with existing
dimensions
• 20 base queries to test system
– Typical queries we will see in the production
– Run queries serially and concurrently
– Concurrent test has to finish faster than serial
test
7

Tests
• Scalability
– Performance increases close to linearly as we add RAC
nodes
• Deep analytical queries
• Ad hoc queries
– Allow users to submit random queries to system and see if
it breaks!
-----------------------------------------------------------------------------------
| Id | Operation | Rows | Bytes | TempSpc | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | 16M| 7980M| | 610K (16)| 02:02:03|
|* 1 | VIEW | 16M| 7980M| | 610K (16)| 02:02:03|
. .
. .
| 10 | PX PARTITION HASH ALL | 16M| 3959M| | 610K (16)| 02:02:03|
|* 11 | HASH JOIN RIGHT OUTER| 16M| 3959M| 932M| 610K (16)| 02:02:03|
|* 12 | TABLE ACCESS FULL | 11G| 804G| | 25036 (7)| 00:05:01|
|* 13 | HASH JOIN | 16M| 2794M| | 543K (17)| 01:48:43|
|* 14 | TABLE ACCESS FULL | 16M| 1894M| | 69951 (1)| 00:14:00|
|* 15 | TABLE ACCESS FULL | 597G| 31T| | 471K (19)| 01:34:13|
-----------------------------------------------------------------------------------
8

System Requirements
• Network/Cluster Interconnects
– GigE does not meet the bandwidth requirement
– 10GigE is still too expensive
– InfiniBand is chosen (up to 20Gb/s)
• Storage
– Block based storage / SAN solution
– Price performance justified for warehouse workloads
• Oracle 10.2.0.3 x86_64 (RAC)
– Native IB support
– Many improvements and fixes on “warehousing” features
– Latest 10.2 patch set at that time
• Oracle Automatic Storage Management
– Provides LVM style striping of data
– Supports clustered access (required for RAC)

9

Overall System Topology

Applications
NAS Storage
Private LAN Public LAN
for RAW data

2 GigE NICs per
server
Node 1 Node 2 Node 3 Node 4 Node 5 Node 16
16 x3850 M2’s

InfiniBand Network
(redundant)

Storage Area
Network (SAN)
4x4Gb FCP 4x4Gb FCP
(SP-A) (SP-B)

6 x EMC CX3-40’s
Legend

1000TX Public (primary)
20Gb Full Duplex IB
4Gb FCP (Switch 1)
4Gb FCP (Switch 2)

Database Server Configuration

• IBM x3850 M2
– 64GB RAM (DDR2 SDRAM)
– 4 x Intel Xeon E7330 @ 2.40GHz (quad core)
• 4 x 4 = 16 cores per node
– One of the fastest servers in the same class; power efficiency
• 3 x QLogic QLE 2462 HBA (dual port)
• 4Gb FCP per port (for EMC SAN)
• 2 x QLogic 7104-HCA-128LPX-DDR
• 20Gb (for InfiniBand)
• RHEL4 Update 6
– Large SMP Kernel for x86_64 (2.6.9-67.ELlargesmp x86_64)
• Oracle 10.2.0.3 X86_64 Clusterware/ASM/RDBMS (with patches)

11

Database Server
Hardware Configuration (Simplified)

IBM x3850 M2

GigE HBA HCA

Public/ Fibre RDS over IB or
Oracle VIP Channel IP over IB
Cisco 4948 Brocade QLogic/SilverStorm
Ethernet Switch 4900 SAN Switch 9024 IB Switch

EMC CX3-40

12

Database Server
Software Architecture (Simplified)

Oracle Oracle Oracle
ASM Oracle RDBMS
Clusterware

Operating SCSI IP RDS/IB
Systems Multipath IP/IB

Hardware HBA GigE NIC HCA

13

Database Server
Init Parameters

• _PX_use_large_pool = TRUE
• db_block_size = 8192
• db_cache_size = 8048M
• db_file_multiblock_read_count = 128
• large_pool_size = 4G
• parallel_adaptive_multi_user = FALSE
• parallel_execution_message_size = 16384
• parallel_max_servers = 32
• parallel_threads_per_cpu = 2
• pga_aggregate_target = 38G
• sga_max_size = 18512M
• shared_pool_size = 6G

14

Network/Cluster Interconnects
InfiniBand Architecture
IP over IB RDS over IB

RAC RAC
Database Database

IPC IPC
Library Library
User User

Kernel Kernel
UDP IB/RDS

IP

IPoIB

Hardware Hardware
NIC HCA NIC HCA
15

• InfiniBand Switch is required
• HCA is required
– Run INSTALL script to provide IP and netmask
• Relink Oracle
– cd $ORACLE_HOME/rdbms/lib
– make -f ins_rdbms.mk ipc_rds ioracle
• Oracle patch 6643259 – Intermittent hang for inter-instance parallel
query using RDS over IB
– Patch available for 10.2.0.3 and 11.1.0.6
• Kernel panic on an idle system/IB hang at reboot
– Fixed by upgrading the HCA driver

$ cat /proc/iba/mt25218/config
SilverStorm Technologies Inc. MT25218/MT25204 Verbs Provider Driver, version 4.2.0.5.2
for SilverStorm Technologies Inc. InfiniBand(tm) Transport Driver, version 4.2.0.5.2
Built for Linux Kernel 2.6.9-67.ELlargesmp

16

• Oracle Verification
– “cluster interconnect IPC version: Oracle RDS/IP (generic)”
in alert log
• Linux Verification
– cat /proc/driver/rds/info
– cat /proc/driver/rds/stats
– cat /proc/driver/rds/config

$ cat /proc/driver/rds/stats
Rds Statistics:
Sockets open: 205
End Nodes connected: 15

Performance Counters: ON
Transmit:
Xmit bytes 268914077203
Xmit packets 250454334

17

Storage
EMC SAN Architecture

18

Storage
EMC SAN Details
• 6 x CX3-40F arrays
– 900 x 400GB 10K drives (150 drives @ RAID5 4+1 =
40TB usable per array)
– 96GB cache (16GB per array)
– 48 x 4Gb ports (8 per array)
– Capable of ~7.5GB/s read throughput (1.25GB/s
per array)
• 240TB usable storage capacity
– 200TB for Oracle data (1PB logical with 5:1 Oracle
compression)
– 40TB additional storage required for Oracle TEMP
space
19

Storage
EMC SAN Details
• 2 x EMC Brocade 4900 Departmental Switches
– 128 x 4Gb Ports (64 per Switch)
– Simple Dual-Fabric Design
• Ability to expand by adding drives and/or
arrays
• Linear scaling with 6 arrays
• Oracle ASM to rebalance data when adding stroage
• Best price/performance at the time

20

Storage
Oracle Automatic Storage Management

• Only stores metadata about where data lives – an LVM for Oracle data
• Stripe size is 1MB (_asm_stripesize=1048576)
• Stripe a Datafile evenly across all storage arrays to use all spindles
• Vendor agnostic; Can add / remove storage as needed

ASM Software
Layer Storage

1MB 1MB 1MB 1MB 1MB
SAN Based Storage
1MB 1MB 1MB 1MB 1MB
(iSCSI / FCP)

21

Critical Success Factors (Oracle)

• gzip support for external tables
– Feature added by Oracle to make POC succeed
– Patch 6522622: External tables need to read
compressed files
• Compression
– Reduce required disk space
– More effective throughput (5x)
• Automatic Storage Management
– Distribute IO evenly; scale IO linearly
• Features and Enhancements for Data warehouse
– Partitioning and composite partitioning
– Patch 6402957: Adaptive aggregation push-down
22

Critical Success Factors
• InfiniBand Interconnect
– Provide bandwidth needed
– Reduce latency/cluster wait
– Highest utilization is 7Gb/s but only for a brief
period (when using RDS over IB)
– 1~2Gb/s is more typical under load
• EMC SAN solution
– IO throughput to support the full table scan
– Max 1.25GB/s per array

23

Oracle Parallel Query
(Simplified)
select * from table …

Query QC
Coordinator

Px Px Px Px
Producer / Consumer
Pairs

Px Px Px Px

Link Views
Table
Table
Partitions P1 P2 P3 P4

24

PQ and RAC

Query QC
Coordinator

Px Px Px Px
Producer / Consumer
Pairs

Px Px Px Px

Link Views
Table
Table
Partitions P1 P2 P3 P4

25

PQ and RAC scaling issue

• All architectures, including parallel shared
nothing systems, eventually need a funnel
point (query coordinator)
– Lots of “select * from petabyte_table order
by 1” will kill everyone
• During POC, we had to ensure that Oracle
could parallelize ALL operations, otherwise
parallel query becomes useless
– This is a common source of PQ scaling
problems as it requires too much data to
traverse the interconnect
26

Scaling PQ on RAC

• Large number of sub-partitions required to achieve
high degree of parallelism and performance
• Reduce interconnect traffic
• Need an interconnect that can support
throughput requirements of QC
• Avoid “broadcast” redistribution of PQ results

27

Oracle Parallel Query
(More Realistic)
select … from table pageviews, linkviews where pageviews.pvid = ... group
by date;
QC

Group by Px Px

Px Px
Hash Join

Table Scan Px Px Px Px

Link Views Page Views

PVID Partitions
P1 P2 P1 P2
28

Need to Avoid
Node 1 Node 2
QC

Group by Px Px

Px Px
Hash Join


Link Views Page Views

PVID Partitions
P1 P2 P1 P2
29

Best Scenario
Node 1 Node 2
QC

Group by Px Px

Px Px
Hash Join


LVS PVS LVS PVS
PVID Partitions
P1 P1 P2 P2
30

How PQ Survives in RAC
Environment
• Node Affinity to avoid interconnect traffic
– The consumer / producer pair always lives
on the same node
• Joining tables that have the same partition
key and the same number of partitions result
in partition-wise join
– This is the key to scaling!
– Queries that join large tables that are not
partitioned on the same key will require
“brute force” interconnects to survive

31

Lessons Learned and Challenges

• Parallel Shared Nothing does not always scale linearly
• Although most Data Warehouse technology did very well
within 25TB, things started to change quickly at 100TB
• At this data volume, do not expect any commercial solution to
work without some growing pains
– Expect to see bugs!
• Avoiding proprietary solutions and staying open means
possibly multiple vendors are involved
– Working with multiple vendors/teams might be challenging
– Select vendors with quality support and knowledge
transfer
– Dedication from Oracle support and development team to
make the POC successful

32

Backup and Restore Challenges

• Web logs/events (the fact tables) can be
reloaded; no need to back up
• Aggregation/summary is backed up
– Range-partitioned by date
– Set read-only for historical partitions
– Only back up new partitions; skip RO
partitions
• Backup and Restore
– Oracle RMAN: 6 Channels; level 0
– NetVault with 6 Tapes
– 300+ MB/s backup and 200+ MB/s restore
33

Challenges for Oracle
• Degree of parallelism (DOP) is fixed at the query
startup
• AWR report has no aggregation for parallel executions
yet
• ORA-12805: parallel query server died unexpectedly
– Once that happens, all work is abandoned, and
resubmit is the only solution so far
– Hope to see “auto-recovery” feature in the future!
• No DOP information is available in the execution plan
– Improved in 11g (AUTOTRACE can see the DOP!)
• Lacking detailed information on parallel servers activity
and progress
– Improved in 11g (GV$SQL_MONITOR)
34

Major Oracle Enhancements /
Patches for Data Warehouse

• 6522622 – External tables need to read
compressed files
• 6643259 – Intermittent hang for inter-
instance parallel query using RDS over IB
• 6748058 – Transformed query does not
parallelize
• 6402957 – Predicate pushdown not working
with window functions for some cases
• 6808773 – Sub optimal hash distribution
when join on highly skewed columns
• 6471770 – Parallel servers die unexpectedly
35

Future Plans
• Near future:
– ETL Tool
– Backup/Restore throughput enhancement
– Resource plans for different users and workloads
• Further collaboration/integration with Hadoop
• Oracle 11g evaluation and upgrade
• EMC CX4-960
– Up to 2x IO and 2x capacity (vs CX3)
– Upgrade without migrating data
• Intel 7400 series 6-core CPU “Dunnington”
– Up to 50% more performance and 10% less power
consumption vs 7300 series
• 10 GigE evaluation 36

Oow 2008 yahoo_pie-db

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Oow 2008 yahoo_pie-db

Ähnlich wie Oow 2008 yahoo_pie-db (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Oow 2008 yahoo_pie-db