AWS Community Day CPH - Three problems of Terraform
Â
Whitepaper: Exadata Consolidation Success Story
1. 1
EEXADATAXADATA CCONSOLIDATION SUCCESSONSOLIDATION SUCCESS SSTORYTORY
Karl Arao, Enkitec
ABSTRACT
In todayâs competitive business climate companies are under constant pressure to reduce costs without sacrificing quality.
Many companies see database and server consolidation as the key to meeting this goal. Since its introduction, Exadata has
become the obvious choice for database and server consolidation projects. It is the next step in the evolutionary process. But
managing highly consolidated environments is difficult, especially for mixed workload environments. If not done properly the
quality of service suffers. In this paper we will touch on how to do accurate provisioning and capacity planning and the tools
youâll need to ensure that your consolidation story has a happy ending.
TARGET AUDIENCE
Target audiences are DBAs, Architects, Performance Engineers, and Capacity Planners
Learner will be able to:
⢠Describe the provisioning process and implementation challenges
⢠Apply the tools and methodology to successfully migrate on an Exadata platform
⢠Develop a resource management model (CPU instance caging & IORM) â detailed on the presentation
BACKGROUND
The whole consolidation workflow relies on the basic capacity planning formula
Utilization = Requirements / Capacity
Capacity planning plays a very important role to ensure proper resources are available and be able to handle expected and
unexpected workloads. And Exadata is not really a different animal when it comes to provisioning, although it has an
intelligent storage the database servers still has limited capacity. The primary principle is to ensure the application workload
requirements will fit into the available capacity of the database server.
A SIMPLE CONSOLIDATION SCENARIO
Letâs say we have a half rack Exadata. That would consist of four (4) database servers and seven (7) storage cells. Each
database server has a total CPU capacity of 24 Logical CPUs (cat /proc/cpuinfo) and then multiply that to the number of
nodes (4) youâll get the total CPU capacity for the whole cluster (96 CPUs).
When we migrate and consolidate these databases on Exadata, each database has a CPU requirement. For now treat the âCPU
requirementâ as the amount of CPUs it needs to run on Exadata. We would like to have a good balance between the CPU
requirements of the databases and the available CPU capacity across the database nodes, and also making sure that we donât
max out the CPU capacity on any of the nodes.
2. 2
Letâs say we have the following databases to migrate on Exadata:
On the above diagram each database (A to G) has a requirement of four (4) CPUs and will run on two nodes. The first row
will be read as âDatabase A has a CPU requirement of 4, and will run on nodes 1 and 2â. Each of the databases is essentially a
two node RAC that is spread out across four database servers. On a RAC environment the users will be load balanced
between the available RAC instances, so if a database running on two node RAC has a CPU requirement of 4 given that the
load is equally distributed across the nodes then they get 50/50 percent share of CPU requirement.
On the bottom part is the grand total of CPU requirement for all the databases, which is 28 CPUs. We can then say that out of
96 Total CPUs across four nodes, we are only using 29%. Well that is correct, but we also want to know how each server is
utilized that is depicted by the red circles across the same node numbers because we may be having a node thatâs 80% utilized
where the rest of the nodes are on the 10% range and an equal balance of utilization is critical to capacity planning.
Hereâs another view of the node layout where we distribute the CPUs of the instances based on their node assignments.
3. 3
Each block on the right side is one CPU. Thatâs a total of 24 Logical CPUs which accounts the number of CPU threads on the
server which is based on the CPU_COUNT database parameter and /proc/cpuinfo. We also set or take the number of CPUs
from CPU_COUNT whenever we do instance caging and we are using this metric just to be consistent with the monitoring of
OEM and AWR.
From the image above the cluster level utilization itâs 29.2% while on the per compute node they are still below the 70%
threshold which is the ideal utilization we want for the compute nodes.
Now what we donât want to happen is to have an unbalanced utilization if we change the node layout and assign more
instances on node 2 and still make use of the same number of CPU requirement across the databases.
4. 4
On the cluster level it will be the same utilization but on the per compute node we end up having node2 with 83% utilization
while the rest are pretty much idle.
THE PROVISIONING WORKSHEET
Enkitec has developed a tool called Provisioning Worksheet that is mainly used for sizing and consolidation of databases. The
sections below describes the overall workflow of the tool.
General Tour and Workflow
The Provisioning Worksheet has four main tabs that also represent the workflow of the whole provisioning project.
Below youâll see the mind map and the overview of the 4-step process:
⢠Data gathering
o The âSystem & DB Detailâ tab on the worksheet. Itâs the section that we send to the customers to fill out.
The sheet will contain all the important information of the databases that will be migrated to the destination
(target) server.
⢠Define the target server
o The âExadata Capacityâ tab on the worksheet, the default capacity will be a half rack Exadata. But depending
on the customer capacity requirements the default values may not be enough so a review of the end resource
5. 5
utilization is a must. At the end of the provisioning process there will be a section where the minimum
âRecommended Hardwareâ capacity will be shown and should be matched.
⢠Create a provisioning plan
o The âExadata Layoutâ tab on the worksheet. This is where we input the final list of databases that will be
migrated to the target server. The instance mapping where we spread out the instances if theyâll be running as
two or four node RAC and the failure scenarios will also be done here.
⢠Review resource utilization
o The âSummary & Graphsâ tab on the worksheet. Once everything is finalized this section will give you the
visualization and summary report of the end utilization of the target server. A red highlight on the utilization
number on any of the resources means thereâs not enough capacity and the âExadata Capacityâ section
should be revisited and add more resources.
7. 7
Data gathering
The first step on the provisioning process is getting all the capacity requirements. This section of the worksheet is extracted
and saved on a separate excel file then sent to the customers for them to fill out. And once the customer sent this back we can
now start grouping the servers according to platform or workload type and also get started with the migration planning with
the customer. Ultimately this sheet serves as a scratchpad and you are free to add more columns to help with the
categorization of the environments that will be migrated.
Below is a sample output:
The data gathering sheet is divided into four parts:
Host details CPU Memory Storage
⢠DB Name
⢠DB Version
⢠App Type/Front
End
⢠Workload Type
(OLTP/DW/Mix)
⢠Node count (single
instance/RAC)
⢠Hostname
⢠Server Make &
Model
⢠OS
⢠Cpu Type
⢠CPU Speed
⢠# of Logical CPUs
⢠CPU Utilization
Avg/Peak
⢠Physical Memory
(GB)
⢠SGA Size (GB)
⢠PGA Size (GB)
⢠Storage Make &
Model
⢠Disk RAID Level
⢠# of IO Channels
⢠Database Size (GB)
⢠Backup Size (GB)
⢠Peak R + W IOPS
⢠Peak R IOPS
⢠Peak W IOPS
⢠Peak R + W MB/s
⢠Peak R MB/s
⢠Peak W MB/s
⢠Peak R/W Ratio
The gathered raw data points will then be feed to the provisioning plan and will be accounted to the available capacity.
8. 8
Define the target server
This is the section of the provisioning worksheet where we input the capacity of the Exadata that the customer currently have
or the hardware that they have in mind. What we input here is just the initial hardware capacity where the requirements will be
accounted and at the end of the provisioning process we can go back here and add more resources if there are any exceptions
(red highlight) on the utilization summary report.
⢠Node Count
o On the node count you put 2,4, or 8 which is equivalent to quarter, half, full rack of Exadata
⢠Exadata Speed (SPEC)
o Then get the SPECint_rate equivalent of the Exadata processor so we will be able to compare the SPEED of
the Exadata CPU against the source servers
o See the section CPU -> The âSpeed SPECâ or the âSPECint_rate2006/coreâ of the Appendix B for more
details on how to get the value for a particular server.
⢠Exadata CPUs/node
o The number of Logical CPUs which is equivalent to the CPU_COUNT parameter or /proc/cpuinfo
⢠Exadata Mem/node (G)
o Each node has 96GB of memory on an Exadata
⢠Exadata Storage (G)
o Disk space is dependent on ASM redundancy and DATA/RECO allocation. Input the raw GB space
capacity.
⢠Backup Factor
o The backup factor is usually set to 1, which means we want to have a space for at least 1 full backup of the
database
⢠Table Compression Factor
o The table compression factor lets us gain more disk space as we compress the big tables. I usually set this to
zero for a conservative sizing.
⢠Offload Factor
o The OFFLOAD FACTOR which is the amount of CPU resources that will be offloaded to the storage cells.
I usually set this to zero for a conservative sizing.
9. 9
Create a provisioning plan
This is the section where we input the data points from the Data Gathering (âSystem & DB Detailâ tab). Then we play around
with the node layout where we spread out the instances across the compute nodes according to the customerâs preferences and
what transpired from the migration planning. Thereâs an underlying capacity planning math thatâs processing the data points of
each of the database according to their node layout which will be accounted to the overall and node level capacity. (See
Appendix B for the formulas)
The node layout
There are three values for the node layout:
⢠P = preferred node (green)
o primary node
o accepts client connections
⢠F = failover node (red)
o secondary node
o does not accept client connections
o instance is down and does not consume resources
⢠A = available node (blue)
10. 10
o secondary node
o client connections are just pre-connected, sessions will failover only when the preferred node
fails/shutdown
o instance is up and running and have provisioned resources
Hereâs how to interpret the image above:
o The DBFS database runs across 4 nodes
o hcm2tst only runs on node1 and has a failover instance on node2
o bi2tst runs on node3 and has a pre-connect instance on node4
Above the node layout are node level utilization reports which contains the following:
⢠The number of instances running on that node
⢠CPU utilization
⢠Memory utilization
⢠Recommended minimum Huge Pages allocation
o The value shown has 10% allowance for SGA growth/resize
o To convert the GB to actual Huge Pages settings make use of the following formula
Â§ď§ (HPages GB * 1024) / 2
Most of the columns came from the Data Gathering (âSystem & DB Detailâ tab) sheet but thereâs a Host Speed (SPEC)
column which represents the âSPECint_rate2006/coreâ value of the source server. See the section CPU -> The âSpeed
SPECâ or the âSPECint_rate2006/coreâ of the Appendix B for more details on how to get the value for a particular
server.
Node failure scenario
The node layout will also depend on the failure scenarios where letâs say if one node goes down the end resource utilization of
the remaining nodes should still be on an acceptable range (80% below).
Below youâll see a scenario that when the node1 goes down (changed from P to F) the remaining preferred nodes will catch all
the failed over sessions and will cause an increase in resource utilization in terms of CPU and memory. Here the CPU is at the
critical level which is at 120% utilization when this scenario happens.
11. 11
The node failure scenario is essential to the availability planning of the whole cluster. This is the part where we do trial and
error until we get to the sweet spot of the node layout where we have already failed each of the node and the end utilization of
the remaining nodes are on an acceptable utilization range (80% below).
Review resource utilization
As we change the node layout we can quickly check on the âSummary & Graphsâ tab to see the effects of the layout change.
The graphs will be rendered in a split second and weâll be able to visually check the allocation of resources and quickly see any
imbalance on the provisioning plan in terms of CPU, memory, and storage. On the top section of the sheet are the Overall
Utilization and the Recommended Hardware.
12. 12
Overall Utilization
While the node layout section has the âper nodeâ utilization which quickly alerts us on any resource imbalance between nodes
the âOverall Utilizationâ is very useful for monitoring the resource allocation on a cluster level. Here are some important
points about this summary section:
⢠It has a conditional formatting where it will have a âred highlightâ for any resource component that reaches 75% and
above
⢠A âred highlightâ means revisiting the provisioning plan (removing databases or reducing allocations) or adding more
resources to the capacity which could be as follows:
o Additional compute nodes
o Memory upgrade
o Additional storage
o All of the above
⢠The goal is to get rid of the âred highlightâ
Recommended Hardware
All the data points gathered translates to resource requirements and then to the amount or size of hardware it needs to run
smoothly. This section is very helpful for validating if the hardware that we currently have or planning on buying is enough to
run all the databases that will be migrated or consolidated. Here are some important points about this summary section:
⢠The Equivalent compute nodes has 35% allowance on the CPUs, which means if the Total CPU used is 64.15 which
translates to 2.67 compute nodes (divide by 24 Logical CPUs) with the 35% allowance it will be 86.6 which translates
to 3.6 compute nodes. This allowance for any workload growth/spikes or any unforeseen CPU workload profile
change that could be caused by the migration (plan change, etc.).
⢠The Equivalent compute nodes only accounts for the CPU
⢠On the example below, the 3.6 nodes satisfies the minimum required number of nodes to run the 64.15 CPU
requirement and still be at below 75% overall CPU utilization. But for the memory itâs not the case, so we either have
to upgrade to more compute nodes or upgrade the memory from 96GB to 128GB. These resource capacity values can
be modified on the âExadata Capacityâ tab.