SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Large Scale Data Warehousing
          at Yahoo!
              Bohan Chen
        (bchen@yahoo-inc.com)
      Database Architect at Yahoo!
         Oracle Certified Master
                                     1
Agenda

•   Project Requirements
•   POC Candidates
•   Goals
•   Tests
•   Architecture and Configuration
    – Database Server
    – Network/Cluster Interconnects
    – Storage
•   Critical Success Factors
•   Parallel Query on RAC
•   Lessons Learned and Challenges
•   Future Plans                      2
“Pie DB” Project Requirements
• Yahoo Product Intelligence Engineering – Pie DB
   – Several billions page views per day
  – A unified data warehouse that could support click
    streams, page views, and link views data
• Main requirements:
  – Support > 1PB of data
  – Linear scalability when adding storage or CPU
  – Store data in a compressed format
  – Standard SQL access
  – Integrate with 3rd-party BI tools
  – Support ~60 concurrent queries
  – Resource management
  – Reasonable and affordable cost                      3
POC Candidates

• Oracle
• Greenplum
• Netezza
• Data Allegro
• Hadoop
• And others…


                       4
Goals
• High data compression rate
   – Hadoop pre-processing improves compression rate
     to 4-5x!
• ~4GB/s of reads (sustained)
   – ~20GB/s effective read rate, based on 5x
     compression rate
• Load 10TB in 3 hours
   – 3.5TB/hr load rate; that is ~1GB/s writes
• No Indexes for queries
   – Avoid additional space needed for indexes
   – Avoid indexes building/rebuilding time after data
     loading

                                                         5
Goals
• No SQL Hints!
• Standard Hardware / Software stack
   – Avoid proprietary solutions as much as possible
   – Easily repurpose if necessary
• Delete / Expire / Rolloff old data
   – Truncate / drop old partitions
   – No vacuum process
• Leverage hardware investment before deciding on
  ETL tools
   – Use database as transformation engine in the initial
     phase (ELT instead of ETL)


                                                            6
Tests
• Load 3 months of clicks, page views, and link views
  historical data
   – Almost 100TB of raw data
   – 21TB in database (due to compression)
• Load and transform data
   – Load raw data
   – Create dimension tables and merge with existing
     dimensions
• 20 base queries to test system
   – Typical queries we will see in the production
   – Run queries serially and concurrently
   – Concurrent test has to finish faster than serial
     test
                                                        7
Tests
   •   Scalability
        – Performance increases close to linearly as we add RAC
          nodes
   •   Deep analytical queries
   •   Ad hoc queries
        – Allow users to submit random queries to system and see if
          it breaks!
-----------------------------------------------------------------------------------
| Id | Operation                 | Rows | Bytes | TempSpc | Cost (%CPU)| Time      |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |    16M| 7980M|          |   610K (16)| 02:02:03|
|* 1 | VIEW                      |    16M| 7980M|          |   610K (16)| 02:02:03|
.                                                                                  .
.                                                                                  .
| 10 | PX PARTITION HASH ALL     |    16M| 3959M|          |   610K (16)| 02:02:03|
|* 11 |     HASH JOIN RIGHT OUTER|    16M| 3959M|      932M|   610K (16)| 02:02:03|
|* 12 |      TABLE ACCESS FULL   |    11G|   804G|         | 25036   (7)| 00:05:01|
|* 13 |      HASH JOIN           |    16M| 2794M|          |   543K (17)| 01:48:43|
|* 14 |       TABLE ACCESS FULL |     16M| 1894M|          | 69951   (1)| 00:14:00|
|* 15 |       TABLE ACCESS FULL |    597G|    31T|         |   471K (19)| 01:34:13|
-----------------------------------------------------------------------------------
                                                                                 8
System Requirements
•   Network/Cluster Interconnects
     – GigE does not meet the bandwidth requirement
     – 10GigE is still too expensive
     – InfiniBand is chosen (up to 20Gb/s)
•   Storage
     – Block based storage / SAN solution
     – Price performance justified for warehouse workloads
•   Oracle 10.2.0.3 x86_64 (RAC)
     – Native IB support
     – Many improvements and fixes on “warehousing” features
     – Latest 10.2 patch set at that time
•   Oracle Automatic Storage Management
     – Provides LVM style striping of data
     – Supports clustered access (required for RAC)

                                                               9
Overall System Topology

                                                                       Applications
NAS Storage
                    Private LAN               Public LAN
for RAW data

                                                                              2 GigE NICs per
                                                                              server
 Node 1    Node 2    Node 3      Node 4   Node 5             Node 16
                                                                              16 x3850 M2’s

                                                                              InfiniBand Network
                                                                              (redundant)

                                                                              Storage Area
                                                                              Network (SAN)
                     4x4Gb FCP                        4x4Gb FCP
                    (SP-A)                           (SP-B)

                           6 x EMC CX3-40’s
                                                                                Legend

                                                                          1000TX Public (primary)
                                                                          20Gb Full Duplex IB
                                                                          4Gb FCP (Switch 1)
                                                                          4Gb FCP (Switch 2)
Database Server Configuration

•   IBM x3850 M2
     – 64GB RAM (DDR2 SDRAM)
     – 4 x Intel Xeon E7330 @ 2.40GHz (quad core)
         • 4 x 4 = 16 cores per node
     – One of the fastest servers in the same class; power efficiency
•   3 x QLogic QLE 2462 HBA (dual port)
         • 4Gb FCP per port (for EMC SAN)
•   2 x QLogic 7104-HCA-128LPX-DDR
         • 20Gb (for InfiniBand)
•   RHEL4 Update 6
     – Large SMP Kernel for x86_64 (2.6.9-67.ELlargesmp x86_64)
•   Oracle 10.2.0.3 X86_64 Clusterware/ASM/RDBMS (with patches)



                                                                        11
Database Server
        Hardware Configuration (Simplified)

                       IBM x3850 M2

          GigE              HBA               HCA

Public/                       Fibre                RDS over IB or
Oracle VIP                    Channel              IP over IB
       Cisco 4948          Brocade       QLogic/SilverStorm
     Ethernet Switch   4900 SAN Switch     9024 IB Switch




                       EMC CX3-40

                                                                12
Database Server
            Software Architecture (Simplified)



 Oracle       Oracle                           Oracle
               ASM         Oracle              RDBMS
                         Clusterware




Operating      SCSI             IP              RDS/IB
Systems      Multipath                 IP/IB



Hardware      HBA         GigE NIC               HCA


                                                         13
Database Server
          Init Parameters

•   _PX_use_large_pool = TRUE
•   db_block_size = 8192
•   db_cache_size = 8048M
•   db_file_multiblock_read_count = 128
•   large_pool_size = 4G
•   parallel_adaptive_multi_user = FALSE
•   parallel_execution_message_size = 16384
•   parallel_max_servers = 32
•   parallel_threads_per_cpu = 2
•   pga_aggregate_target = 38G
•   sga_max_size = 18512M
•   shared_pool_size = 6G

                                              14
Network/Cluster Interconnects
         InfiniBand Architecture
         IP over IB                     RDS over IB

                RAC                            RAC
              Database                       Database


                IPC                             IPC
              Library                         Library
User                             User

Kernel                           Kernel
                   UDP                       IB/RDS


                   IP

                         IPoIB


Hardware                         Hardware
             NIC         HCA                NIC         HCA
                                                              15
Network/Cluster Interconnects
               InfiniBand Architecture
•     InfiniBand Switch is required
•     HCA is required
       – Run INSTALL script to provide IP and netmask
•     Relink Oracle
       – cd $ORACLE_HOME/rdbms/lib
       – make -f ins_rdbms.mk ipc_rds ioracle
•     Oracle patch 6643259 – Intermittent hang for inter-instance parallel
      query using RDS over IB
       – Patch available for 10.2.0.3 and 11.1.0.6
•     Kernel panic on an idle system/IB hang at reboot
       – Fixed by upgrading the HCA driver


    $ cat /proc/iba/mt25218/config
    SilverStorm Technologies Inc. MT25218/MT25204 Verbs Provider Driver, version 4.2.0.5.2
    for SilverStorm Technologies Inc. InfiniBand(tm) Transport Driver, version 4.2.0.5.2
    Built for Linux Kernel 2.6.9-67.ELlargesmp


                                                                                        16
Network/Cluster Interconnects
           InfiniBand Architecture
•   Oracle Verification
     – “cluster interconnect IPC version: Oracle RDS/IP (generic)”
       in alert log
•   Linux Verification
     – cat /proc/driver/rds/info
     – cat /proc/driver/rds/stats
     – cat /proc/driver/rds/config

       $ cat /proc/driver/rds/stats
       Rds Statistics:
         Sockets open:      205
         End Nodes connected: 15

         Performance Counters: ON
         Transmit:
          Xmit bytes          268914077203
          Xmit packets         250454334

                                                                     17
Storage
EMC SAN Architecture




                       18
Storage
         EMC SAN Details
• 6 x CX3-40F arrays
   – 900 x 400GB 10K drives (150 drives @ RAID5 4+1 =
     40TB usable per array)
   – 96GB cache (16GB per array)
   – 48 x 4Gb ports (8 per array)
   – Capable of ~7.5GB/s read throughput (1.25GB/s
     per array)
• 240TB usable storage capacity
   – 200TB for Oracle data (1PB logical with 5:1 Oracle
     compression)
   – 40TB additional storage required for Oracle TEMP
     space
                                                          19
Storage
        EMC SAN Details
• 2 x EMC Brocade 4900 Departmental Switches
   – 128 x 4Gb Ports (64 per Switch)
   – Simple Dual-Fabric Design
• Ability to expand by adding drives and/or
  arrays
• Linear scaling with 6 arrays
• Oracle ASM to rebalance data when adding stroage
• Best price/performance at the time




                                                     20
Storage
             Oracle Automatic Storage Management

• Only stores metadata about where data lives – an LVM for Oracle data
• Stripe size is 1MB (_asm_stripesize=1048576)
• Stripe a Datafile evenly across all storage arrays to use all spindles
• Vendor agnostic; Can add / remove storage as needed




   ASM Software
      Layer                              Storage

                          1MB      1MB       1MB     1MB      1MB
 SAN Based Storage
                          1MB      1MB       1MB     1MB      1MB
   (iSCSI / FCP)


                                                                      21
Critical Success Factors (Oracle)

• gzip support for external tables
   – Feature added by Oracle to make POC succeed
   – Patch 6522622: External tables need to read
     compressed files
• Compression
   – Reduce required disk space
   – More effective throughput (5x)
• Automatic Storage Management
   – Distribute IO evenly; scale IO linearly
• Features and Enhancements for Data warehouse
   – Partitioning and composite partitioning
   – Patch 6402957: Adaptive aggregation push-down
                                                     22
Critical Success Factors
• InfiniBand Interconnect
   – Provide bandwidth needed
   – Reduce latency/cluster wait
   – Highest utilization is 7Gb/s but only for a brief
     period (when using RDS over IB)
   – 1~2Gb/s is more typical under load
• EMC SAN solution
   – IO throughput to support the full table scan
   – Max 1.25GB/s per array




                                                         23
Oracle Parallel Query
            (Simplified)
                            select * from table …

      Query                         QC
    Coordinator



                      Px    Px           Px         Px
Producer / Consumer
       Pairs

                      Px    Px           Px         Px

    Link Views
       Table
       Table
     Partitions       P1    P2           P3         P4

                                                         24
PQ and RAC


      Query                     QC
    Coordinator



                      Px   Px   Px   Px
Producer / Consumer
       Pairs

                      Px   Px   Px   Px

    Link Views
       Table
       Table
     Partitions       P1   P2   P3   P4

                                          25
PQ and RAC scaling issue

• All architectures, including parallel shared
  nothing systems, eventually need a funnel
  point (query coordinator)
   – Lots of “select * from petabyte_table order
     by 1” will kill everyone
• During POC, we had to ensure that Oracle
  could parallelize ALL operations, otherwise
  parallel query becomes useless
   – This is a common source of PQ scaling
     problems as it requires too much data to
     traverse the interconnect
                                                   26
Scaling PQ on RAC

• Large number of sub-partitions required to achieve
  high degree of parallelism and performance
• Reduce interconnect traffic
• Need an interconnect that can support
  throughput requirements of QC
• Avoid “broadcast” redistribution of PQ results




                                                       27
Oracle Parallel Query
           (More Realistic)
select … from table pageviews, linkviews where pageviews.pvid = ... group
by date;
                                              QC


      Group by                         Px            Px


                                       Px            Px
      Hash Join



      Table Scan         Px            Px            Px            Px

                              Link Views                   Page Views

    PVID Partitions
                         P1            P2             P1           P2
                                                                        28
Need to Avoid
                                Node 1   Node 2
                          QC



  Group by                      Px         Px


                                Px         Px
  Hash Join



  Table Scan      Px            Px         Px             Px

                       Link Views                 Page Views

PVID Partitions
                  P1            P2          P1            P2
                                                               29
Best Scenario
                              Node 1   Node 2
                        QC



  Group by                    Px         Px


                              Px         Px
  Hash Join



  Table Scan      Px          Px         Px     Px

                  LVS        PVS        LVS     PVS
PVID Partitions
                  P1         P1          P2     P2
                                                     30
How PQ Survives in RAC
        Environment
• Node Affinity to avoid interconnect traffic
   – The consumer / producer pair always lives
     on the same node
• Joining tables that have the same partition
  key and the same number of partitions result
  in partition-wise join
   – This is the key to scaling!
   – Queries that join large tables that are not
     partitioned on the same key will require
     “brute force” interconnects to survive

                                                   31
Lessons Learned and Challenges

•   Parallel Shared Nothing does not always scale linearly
•   Although most Data Warehouse technology did very well
    within 25TB, things started to change quickly at 100TB
•   At this data volume, do not expect any commercial solution to
    work without some growing pains
     – Expect to see bugs!
•   Avoiding proprietary solutions and staying open means
    possibly multiple vendors are involved
     – Working with multiple vendors/teams might be challenging
     – Select vendors with quality support and knowledge
        transfer
     – Dedication from Oracle support and development team to
        make the POC successful



                                                                    32
Backup and Restore Challenges

• Web logs/events (the fact tables) can be
  reloaded; no need to back up
• Aggregation/summary is backed up
  – Range-partitioned by date
  – Set read-only for historical partitions
  – Only back up new partitions; skip RO
    partitions
• Backup and Restore
  – Oracle RMAN: 6 Channels; level 0
  – NetVault with 6 Tapes
  – 300+ MB/s backup and 200+ MB/s restore
                                              33
Challenges for Oracle
• Degree of parallelism (DOP) is fixed at the query
  startup
• AWR report has no aggregation for parallel executions
  yet
• ORA-12805: parallel query server died unexpectedly
   – Once that happens, all work is abandoned, and
     resubmit is the only solution so far
   – Hope to see “auto-recovery” feature in the future!
• No DOP information is available in the execution plan
   – Improved in 11g (AUTOTRACE can see the DOP!)
• Lacking detailed information on parallel servers activity
  and progress
   – Improved in 11g (GV$SQL_MONITOR)
                                                         34
Major Oracle Enhancements /
      Patches for Data Warehouse

• 6522622 – External tables need to read
  compressed files
• 6643259 – Intermittent hang for inter-
  instance parallel query using RDS over IB
• 6748058 – Transformed query does not
  parallelize
• 6402957 – Predicate pushdown not working
  with window functions for some cases
• 6808773 – Sub optimal hash distribution
  when join on highly skewed columns
• 6471770 – Parallel servers die unexpectedly
                                                35
Future Plans
• Near future:
   – ETL Tool
   – Backup/Restore throughput enhancement
   – Resource plans for different users and workloads
• Further collaboration/integration with Hadoop
• Oracle 11g evaluation and upgrade
• EMC CX4-960
   – Up to 2x IO and 2x capacity (vs CX3)
   – Upgrade without migrating data
• Intel 7400 series 6-core CPU “Dunnington”
   – Up to 50% more performance and 10% less power
     consumption vs 7300 series
• 10 GigE evaluation                                    36
Next Stop



            37
10 Petabytes!


                38
Thank You!



             39

Weitere ähnliche Inhalte

Was ist angesagt?

Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceOdinot Stanislas
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCGanesan Narayanasamy
 
UberCloud HPC Experiment Introduction for Beginners
UberCloud HPC Experiment Introduction for BeginnersUberCloud HPC Experiment Introduction for Beginners
UberCloud HPC Experiment Introduction for Beginnershpcexperiment
 
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM Research
 
How to Modernize Your Database Platform to Realize Consolidation Savings
How to Modernize Your Database Platform to Realize Consolidation SavingsHow to Modernize Your Database Platform to Realize Consolidation Savings
How to Modernize Your Database Platform to Realize Consolidation SavingsIsaac Christoffersen
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuAlan Sill
 
Unifying Network Filtering Rules for the Linux Kernel with eBPF
Unifying Network Filtering Rules for the Linux Kernel with eBPFUnifying Network Filtering Rules for the Linux Kernel with eBPF
Unifying Network Filtering Rules for the Linux Kernel with eBPFNetronome
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.Jack Levin
 
QCT Fact Sheet-English
QCT Fact Sheet-EnglishQCT Fact Sheet-English
QCT Fact Sheet-EnglishPeggy Ho
 
IP Address Lookup By Using GPU
IP Address Lookup By Using GPUIP Address Lookup By Using GPU
IP Address Lookup By Using GPUJino Antony
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)Kohei KaiGai
 
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...Databricks
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGaiKohei KaiGai
 
IBM/ASTRON DOME 64-bit Hot Water Cooled Microserver
IBM/ASTRON DOME  64-bit Hot Water Cooled MicroserverIBM/ASTRON DOME  64-bit Hot Water Cooled Microserver
IBM/ASTRON DOME 64-bit Hot Water Cooled MicroserverIBM Research
 

Was ist angesagt? (20)

Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
 
@IBM Power roadmap 8
@IBM Power roadmap 8 @IBM Power roadmap 8
@IBM Power roadmap 8
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
 
UberCloud HPC Experiment Introduction for Beginners
UberCloud HPC Experiment Introduction for BeginnersUberCloud HPC Experiment Introduction for Beginners
UberCloud HPC Experiment Introduction for Beginners
 
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
 
How to Modernize Your Database Platform to Realize Consolidation Savings
How to Modernize Your Database Platform to Realize Consolidation SavingsHow to Modernize Your Database Platform to Realize Consolidation Savings
How to Modernize Your Database Platform to Realize Consolidation Savings
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
Unifying Network Filtering Rules for the Linux Kernel with eBPF
Unifying Network Filtering Rules for the Linux Kernel with eBPFUnifying Network Filtering Rules for the Linux Kernel with eBPF
Unifying Network Filtering Rules for the Linux Kernel with eBPF
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
QCT Fact Sheet-English
QCT Fact Sheet-EnglishQCT Fact Sheet-English
QCT Fact Sheet-English
 
IP Address Lookup By Using GPU
IP Address Lookup By Using GPUIP Address Lookup By Using GPU
IP Address Lookup By Using GPU
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Infrastructure et serveurs HP
Infrastructure et serveurs HPInfrastructure et serveurs HP
Infrastructure et serveurs HP
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
 
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai
 
IBM/ASTRON DOME 64-bit Hot Water Cooled Microserver
IBM/ASTRON DOME  64-bit Hot Water Cooled MicroserverIBM/ASTRON DOME  64-bit Hot Water Cooled Microserver
IBM/ASTRON DOME 64-bit Hot Water Cooled Microserver
 

Ähnlich wie Oow 2008 yahoo_pie-db

Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioOPNFV
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)byteLAKE
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Databricks
 
IBM Power9 Features and Specifications
IBM Power9 Features and SpecificationsIBM Power9 Features and Specifications
IBM Power9 Features and Specificationsinside-BigData.com
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-GeneOpenStack Korea Community
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureCeph Community
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitecturePatrick McGarry
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...PROIDEA
 
Shak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-finalShak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-finalTommy Lee
 
Ocpeu14
Ocpeu14Ocpeu14
Ocpeu14KALRAY
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
LinkedIn OpenFabric Project - Interop 2017
LinkedIn OpenFabric Project - Interop 2017LinkedIn OpenFabric Project - Interop 2017
LinkedIn OpenFabric Project - Interop 2017Shawn Zandi
 
Vizuri Exadata East Coast Users Conference
Vizuri Exadata East Coast Users ConferenceVizuri Exadata East Coast Users Conference
Vizuri Exadata East Coast Users ConferenceIsaac Christoffersen
 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...OpenEBS
 
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...Michael Gschwind
 
High perf-networking
High perf-networkingHigh perf-networking
High perf-networkingmtimjones
 

Ähnlich wie Oow 2008 yahoo_pie-db (20)

Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
 
IBM Power9 Features and Specifications
IBM Power9 Features and SpecificationsIBM Power9 Features and Specifications
IBM Power9 Features and Specifications
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
 
Shak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-finalShak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-final
 
Ocpeu14
Ocpeu14Ocpeu14
Ocpeu14
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
LinkedIn OpenFabric Project - Interop 2017
LinkedIn OpenFabric Project - Interop 2017LinkedIn OpenFabric Project - Interop 2017
LinkedIn OpenFabric Project - Interop 2017
 
Vizuri Exadata East Coast Users Conference
Vizuri Exadata East Coast Users ConferenceVizuri Exadata East Coast Users Conference
Vizuri Exadata East Coast Users Conference
 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
 
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
 
High perf-networking
High perf-networkingHigh perf-networking
High perf-networking
 
100 M pps on PC.
100 M pps on PC.100 M pps on PC.
100 M pps on PC.
 

Kürzlich hochgeladen

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Kürzlich hochgeladen (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Oow 2008 yahoo_pie-db

  • 1. Large Scale Data Warehousing at Yahoo! Bohan Chen (bchen@yahoo-inc.com) Database Architect at Yahoo! Oracle Certified Master 1
  • 2. Agenda • Project Requirements • POC Candidates • Goals • Tests • Architecture and Configuration – Database Server – Network/Cluster Interconnects – Storage • Critical Success Factors • Parallel Query on RAC • Lessons Learned and Challenges • Future Plans 2
  • 3. “Pie DB” Project Requirements • Yahoo Product Intelligence Engineering – Pie DB – Several billions page views per day – A unified data warehouse that could support click streams, page views, and link views data • Main requirements: – Support > 1PB of data – Linear scalability when adding storage or CPU – Store data in a compressed format – Standard SQL access – Integrate with 3rd-party BI tools – Support ~60 concurrent queries – Resource management – Reasonable and affordable cost 3
  • 4. POC Candidates • Oracle • Greenplum • Netezza • Data Allegro • Hadoop • And others… 4
  • 5. Goals • High data compression rate – Hadoop pre-processing improves compression rate to 4-5x! • ~4GB/s of reads (sustained) – ~20GB/s effective read rate, based on 5x compression rate • Load 10TB in 3 hours – 3.5TB/hr load rate; that is ~1GB/s writes • No Indexes for queries – Avoid additional space needed for indexes – Avoid indexes building/rebuilding time after data loading 5
  • 6. Goals • No SQL Hints! • Standard Hardware / Software stack – Avoid proprietary solutions as much as possible – Easily repurpose if necessary • Delete / Expire / Rolloff old data – Truncate / drop old partitions – No vacuum process • Leverage hardware investment before deciding on ETL tools – Use database as transformation engine in the initial phase (ELT instead of ETL) 6
  • 7. Tests • Load 3 months of clicks, page views, and link views historical data – Almost 100TB of raw data – 21TB in database (due to compression) • Load and transform data – Load raw data – Create dimension tables and merge with existing dimensions • 20 base queries to test system – Typical queries we will see in the production – Run queries serially and concurrently – Concurrent test has to finish faster than serial test 7
  • 8. Tests • Scalability – Performance increases close to linearly as we add RAC nodes • Deep analytical queries • Ad hoc queries – Allow users to submit random queries to system and see if it breaks! ----------------------------------------------------------------------------------- | Id | Operation | Rows | Bytes | TempSpc | Cost (%CPU)| Time | ----------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | 16M| 7980M| | 610K (16)| 02:02:03| |* 1 | VIEW | 16M| 7980M| | 610K (16)| 02:02:03| . . . . | 10 | PX PARTITION HASH ALL | 16M| 3959M| | 610K (16)| 02:02:03| |* 11 | HASH JOIN RIGHT OUTER| 16M| 3959M| 932M| 610K (16)| 02:02:03| |* 12 | TABLE ACCESS FULL | 11G| 804G| | 25036 (7)| 00:05:01| |* 13 | HASH JOIN | 16M| 2794M| | 543K (17)| 01:48:43| |* 14 | TABLE ACCESS FULL | 16M| 1894M| | 69951 (1)| 00:14:00| |* 15 | TABLE ACCESS FULL | 597G| 31T| | 471K (19)| 01:34:13| ----------------------------------------------------------------------------------- 8
  • 9. System Requirements • Network/Cluster Interconnects – GigE does not meet the bandwidth requirement – 10GigE is still too expensive – InfiniBand is chosen (up to 20Gb/s) • Storage – Block based storage / SAN solution – Price performance justified for warehouse workloads • Oracle 10.2.0.3 x86_64 (RAC) – Native IB support – Many improvements and fixes on “warehousing” features – Latest 10.2 patch set at that time • Oracle Automatic Storage Management – Provides LVM style striping of data – Supports clustered access (required for RAC) 9
  • 10. Overall System Topology Applications NAS Storage Private LAN Public LAN for RAW data 2 GigE NICs per server Node 1 Node 2 Node 3 Node 4 Node 5 Node 16 16 x3850 M2’s InfiniBand Network (redundant) Storage Area Network (SAN) 4x4Gb FCP 4x4Gb FCP (SP-A) (SP-B) 6 x EMC CX3-40’s Legend 1000TX Public (primary) 20Gb Full Duplex IB 4Gb FCP (Switch 1) 4Gb FCP (Switch 2)
  • 11. Database Server Configuration • IBM x3850 M2 – 64GB RAM (DDR2 SDRAM) – 4 x Intel Xeon E7330 @ 2.40GHz (quad core) • 4 x 4 = 16 cores per node – One of the fastest servers in the same class; power efficiency • 3 x QLogic QLE 2462 HBA (dual port) • 4Gb FCP per port (for EMC SAN) • 2 x QLogic 7104-HCA-128LPX-DDR • 20Gb (for InfiniBand) • RHEL4 Update 6 – Large SMP Kernel for x86_64 (2.6.9-67.ELlargesmp x86_64) • Oracle 10.2.0.3 X86_64 Clusterware/ASM/RDBMS (with patches) 11
  • 12. Database Server Hardware Configuration (Simplified) IBM x3850 M2 GigE HBA HCA Public/ Fibre RDS over IB or Oracle VIP Channel IP over IB Cisco 4948 Brocade QLogic/SilverStorm Ethernet Switch 4900 SAN Switch 9024 IB Switch EMC CX3-40 12
  • 13. Database Server Software Architecture (Simplified) Oracle Oracle Oracle ASM Oracle RDBMS Clusterware Operating SCSI IP RDS/IB Systems Multipath IP/IB Hardware HBA GigE NIC HCA 13
  • 14. Database Server Init Parameters • _PX_use_large_pool = TRUE • db_block_size = 8192 • db_cache_size = 8048M • db_file_multiblock_read_count = 128 • large_pool_size = 4G • parallel_adaptive_multi_user = FALSE • parallel_execution_message_size = 16384 • parallel_max_servers = 32 • parallel_threads_per_cpu = 2 • pga_aggregate_target = 38G • sga_max_size = 18512M • shared_pool_size = 6G 14
  • 15. Network/Cluster Interconnects InfiniBand Architecture IP over IB RDS over IB RAC RAC Database Database IPC IPC Library Library User User Kernel Kernel UDP IB/RDS IP IPoIB Hardware Hardware NIC HCA NIC HCA 15
  • 16. Network/Cluster Interconnects InfiniBand Architecture • InfiniBand Switch is required • HCA is required – Run INSTALL script to provide IP and netmask • Relink Oracle – cd $ORACLE_HOME/rdbms/lib – make -f ins_rdbms.mk ipc_rds ioracle • Oracle patch 6643259 – Intermittent hang for inter-instance parallel query using RDS over IB – Patch available for 10.2.0.3 and 11.1.0.6 • Kernel panic on an idle system/IB hang at reboot – Fixed by upgrading the HCA driver $ cat /proc/iba/mt25218/config SilverStorm Technologies Inc. MT25218/MT25204 Verbs Provider Driver, version 4.2.0.5.2 for SilverStorm Technologies Inc. InfiniBand(tm) Transport Driver, version 4.2.0.5.2 Built for Linux Kernel 2.6.9-67.ELlargesmp 16
  • 17. Network/Cluster Interconnects InfiniBand Architecture • Oracle Verification – “cluster interconnect IPC version: Oracle RDS/IP (generic)” in alert log • Linux Verification – cat /proc/driver/rds/info – cat /proc/driver/rds/stats – cat /proc/driver/rds/config $ cat /proc/driver/rds/stats Rds Statistics: Sockets open: 205 End Nodes connected: 15 Performance Counters: ON Transmit: Xmit bytes 268914077203 Xmit packets 250454334 17
  • 19. Storage EMC SAN Details • 6 x CX3-40F arrays – 900 x 400GB 10K drives (150 drives @ RAID5 4+1 = 40TB usable per array) – 96GB cache (16GB per array) – 48 x 4Gb ports (8 per array) – Capable of ~7.5GB/s read throughput (1.25GB/s per array) • 240TB usable storage capacity – 200TB for Oracle data (1PB logical with 5:1 Oracle compression) – 40TB additional storage required for Oracle TEMP space 19
  • 20. Storage EMC SAN Details • 2 x EMC Brocade 4900 Departmental Switches – 128 x 4Gb Ports (64 per Switch) – Simple Dual-Fabric Design • Ability to expand by adding drives and/or arrays • Linear scaling with 6 arrays • Oracle ASM to rebalance data when adding stroage • Best price/performance at the time 20
  • 21. Storage Oracle Automatic Storage Management • Only stores metadata about where data lives – an LVM for Oracle data • Stripe size is 1MB (_asm_stripesize=1048576) • Stripe a Datafile evenly across all storage arrays to use all spindles • Vendor agnostic; Can add / remove storage as needed ASM Software Layer Storage 1MB 1MB 1MB 1MB 1MB SAN Based Storage 1MB 1MB 1MB 1MB 1MB (iSCSI / FCP) 21
  • 22. Critical Success Factors (Oracle) • gzip support for external tables – Feature added by Oracle to make POC succeed – Patch 6522622: External tables need to read compressed files • Compression – Reduce required disk space – More effective throughput (5x) • Automatic Storage Management – Distribute IO evenly; scale IO linearly • Features and Enhancements for Data warehouse – Partitioning and composite partitioning – Patch 6402957: Adaptive aggregation push-down 22
  • 23. Critical Success Factors • InfiniBand Interconnect – Provide bandwidth needed – Reduce latency/cluster wait – Highest utilization is 7Gb/s but only for a brief period (when using RDS over IB) – 1~2Gb/s is more typical under load • EMC SAN solution – IO throughput to support the full table scan – Max 1.25GB/s per array 23
  • 24. Oracle Parallel Query (Simplified) select * from table … Query QC Coordinator Px Px Px Px Producer / Consumer Pairs Px Px Px Px Link Views Table Table Partitions P1 P2 P3 P4 24
  • 25. PQ and RAC Query QC Coordinator Px Px Px Px Producer / Consumer Pairs Px Px Px Px Link Views Table Table Partitions P1 P2 P3 P4 25
  • 26. PQ and RAC scaling issue • All architectures, including parallel shared nothing systems, eventually need a funnel point (query coordinator) – Lots of “select * from petabyte_table order by 1” will kill everyone • During POC, we had to ensure that Oracle could parallelize ALL operations, otherwise parallel query becomes useless – This is a common source of PQ scaling problems as it requires too much data to traverse the interconnect 26
  • 27. Scaling PQ on RAC • Large number of sub-partitions required to achieve high degree of parallelism and performance • Reduce interconnect traffic • Need an interconnect that can support throughput requirements of QC • Avoid “broadcast” redistribution of PQ results 27
  • 28. Oracle Parallel Query (More Realistic) select … from table pageviews, linkviews where pageviews.pvid = ... group by date; QC Group by Px Px Px Px Hash Join Table Scan Px Px Px Px Link Views Page Views PVID Partitions P1 P2 P1 P2 28
  • 29. Need to Avoid Node 1 Node 2 QC Group by Px Px Px Px Hash Join Table Scan Px Px Px Px Link Views Page Views PVID Partitions P1 P2 P1 P2 29
  • 30. Best Scenario Node 1 Node 2 QC Group by Px Px Px Px Hash Join Table Scan Px Px Px Px LVS PVS LVS PVS PVID Partitions P1 P1 P2 P2 30
  • 31. How PQ Survives in RAC Environment • Node Affinity to avoid interconnect traffic – The consumer / producer pair always lives on the same node • Joining tables that have the same partition key and the same number of partitions result in partition-wise join – This is the key to scaling! – Queries that join large tables that are not partitioned on the same key will require “brute force” interconnects to survive 31
  • 32. Lessons Learned and Challenges • Parallel Shared Nothing does not always scale linearly • Although most Data Warehouse technology did very well within 25TB, things started to change quickly at 100TB • At this data volume, do not expect any commercial solution to work without some growing pains – Expect to see bugs! • Avoiding proprietary solutions and staying open means possibly multiple vendors are involved – Working with multiple vendors/teams might be challenging – Select vendors with quality support and knowledge transfer – Dedication from Oracle support and development team to make the POC successful 32
  • 33. Backup and Restore Challenges • Web logs/events (the fact tables) can be reloaded; no need to back up • Aggregation/summary is backed up – Range-partitioned by date – Set read-only for historical partitions – Only back up new partitions; skip RO partitions • Backup and Restore – Oracle RMAN: 6 Channels; level 0 – NetVault with 6 Tapes – 300+ MB/s backup and 200+ MB/s restore 33
  • 34. Challenges for Oracle • Degree of parallelism (DOP) is fixed at the query startup • AWR report has no aggregation for parallel executions yet • ORA-12805: parallel query server died unexpectedly – Once that happens, all work is abandoned, and resubmit is the only solution so far – Hope to see “auto-recovery” feature in the future! • No DOP information is available in the execution plan – Improved in 11g (AUTOTRACE can see the DOP!) • Lacking detailed information on parallel servers activity and progress – Improved in 11g (GV$SQL_MONITOR) 34
  • 35. Major Oracle Enhancements / Patches for Data Warehouse • 6522622 – External tables need to read compressed files • 6643259 – Intermittent hang for inter- instance parallel query using RDS over IB • 6748058 – Transformed query does not parallelize • 6402957 – Predicate pushdown not working with window functions for some cases • 6808773 – Sub optimal hash distribution when join on highly skewed columns • 6471770 – Parallel servers die unexpectedly 35
  • 36. Future Plans • Near future: – ETL Tool – Backup/Restore throughput enhancement – Resource plans for different users and workloads • Further collaboration/integration with Hadoop • Oracle 11g evaluation and upgrade • EMC CX4-960 – Up to 2x IO and 2x capacity (vs CX3) – Upgrade without migrating data • Intel 7400 series 6-core CPU “Dunnington” – Up to 50% more performance and 10% less power consumption vs 7300 series • 10 GigE evaluation 36
  • 37. Next Stop 37