Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Exadata Cell metrics
1. Exadata Cell metrics
Exadata CELLSRV periodically records important runtime properties,
called metrics, for cell components such as CPUs, cell disks, grid
disks, flash cache, and IORM statistics. These metrics are recorded
in memory. Based on its own metric collection schedule, the
Management Server (MS) gets the set of metric data accumulated by
CELLSRV.
Management Server (MS) provides Exadata cell management and
configuration functions. MS is responsible for sending alerts and
collects some statistics in addition to those collected by CELLSRV.
Each cell is individually managed with Exadata cell command-line
interface (CellCLI).
Locate the MS process
-------------------------------$ ps -ef | grep ms.err
1000
3940 3723 0 01:42 pts/0
00:00:00 grep ms.err
root
24541 24540
0 Sep28 ?
00:01:32
/usr/java/jdk1.5.0_15/bin/java
-Xms256m
-Xmx512m
Djava.library.path=/opt/oracle/
Check the Alert History
-----------------------MS triggers an alert when it discovers a:
Cell hardware issue
Cell software or configuration issue
CELLSRV internal error
Metric that has exceeded a threshold defined in the cell
CellCLI> list alerthistory
1
2013-09-26T22:51:15-04:00
critical
"ORA-00700: soft internal error, arguments: [main_6a], [3], [Invalid
IP addresses in cellinit.ora file], [], [], [], [], [], [], [], [],
[]"
2_1
2013-09-26T22:52:07-04:00
warning
"Hugepage allocation failure in service cellsrv. Number of
Hugepages allocated is 0, failed to allocate 110"
3
2013-09-26T22:54:08-04:00
critical
"ORA-00700: soft internal error, arguments: [main_6a], [3], [Invalid
IP addresses in cellinit.ora file], [], [], [], [], [], [], [], [],
[]"
4
2013-09-28T13:05:21-04:00
critical
"RS-7445 [Serv RS_BACKUP is absent] [It will be restarted] [] [] []
[] [] [] [] [] [] []"
5
2013-09-28T22:05:38-04:00
critical
"RS-7445 [Serv CELLSRV is absent] [It will be restarted] [] [] [] []
[] [] [] [] [] []"
Create and check for disk I/O errors
---------------------------ellCLI> create threshold CD_IO_ERRS_MIN comparison='>', warning=0, > occurrences=1, observation=1
2. Threshold CD_IO_ERRS_MIN successfully created
CellCLI> list threshold CD_IO_ERRS_MIN detail
name:
CD_IO_ERRS_MIN
comparison:
>
observation:
1
occurrences:
1
warning:
0.0
ellCLI> list alerthistory where severity='warning';
2_1
2013-09-26T23:02:12-04:00
warning
"Hugepage allocation failure in service cellsrv. Number of
Hugepages allocated is 0, failed to allocate 110"
CellCLI> list alerthistory where severity='critical';
1
2013-09-26T23:01:18-04:00
critical
"ORA-00700: soft internal error, arguments: [main_6a], [3],
IP addresses in cellinit.ora file], [], [], [], [], [], [],
[]"
3
2013-09-26T23:04:11-04:00
critical
"ORA-00700: soft internal error, arguments: [main_6a], [3],
IP addresses in cellinit.ora file], [], [], [], [], [], [],
[]"
4
2013-10-01T06:42:39-04:00
critical
"RS-7445 [Serv CELLSRV is absent] [It will be restarted] []
[] [] [] [] [] []"
[Invalid
[], [],
[Invalid
[], [],
[] [] []
CellCLI> list alerthistory where severity='clear';
CellCLI> list alerthistory where severity='info';
MetricType:
- cumulative: Cumulative statistics since the metric was created
- instantaneous: Value at the time that the metric is collected
- rate: Rates computed by averaging statistics over observation
periods
- transition: Collected at the time when the value of the metrics
has changed, and typically captures important transitions in
hardware status
CellCLI> list metriccurrent attributes
name,metrictype,metricobjectname,metricvalue,collectionTime where
metrictype='Rate'
Monitoring Exadata with Active Requests
---------------------------------------CellCLI> LIST ACTIVEREQUEST WHERE IoType = 'predicate pushing'
DETAIL
ioType identifies the type of active request file initialization
Possible values are read, write, predicate pushing, filtered backup
read, predicate push read
Check retention period for metric and alert history
------------------------------------------------------CellCLI> list cell attributes metricHistoryDays
7
CellCLI> alter cell metrichistorydays=5
3. Cell qr03cel02 successfully altered
CellCLI> list cell attributes metrichistorydays
5
CellCLI> list cell attributes name,interconnectCount
qr03cel02
2
configure the cell to automatically send an email and/or SNMP
message to a designated set of Exadata administrator.
-----------------------------------------------------------------------------------------------------------------alter cell smtpServer='my_mail.example.com', smtpFromAddr='monowar.mukul@example.com', smtpFrom='monowar mukul', smtpToAddr='jane.smith@example.com', notificationPolicy='critical,warning,clear', notificationMethod='mail'
Watching for Undelivered Alerts
--------------------------------It is important to periodically check the storage servers just to
make sure that raised alerts have actually been delivered (via email
and/or to Grid or Cloud Control).
CellCLI>LIST
examinedBy=''
ALERTHISTORY
where
notificationState
dcli
-g
cell_group
cellcli
-e
"LIST
notificationState != 1 and examinedBy='' "
!=
ALERTHISTORY
1
2013-09-26T23:01:18-04:00
critical
"ORA-00700: soft internal error, arguments: [main_6a], [3],
IP addresses in cellinit.ora file], [], [], [], [], [], [],
[]"
2_1
2013-09-26T23:02:12-04:00
warning
"Hugepage allocation failure in service cellsrv. Number of
Hugepages allocated is 0, failed to allocate 110"
3
2013-09-26T23:04:11-04:00
critical
"ORA-00700: soft internal error, arguments: [main_6a], [3],
IP addresses in cellinit.ora file], [], [], [], [], [], [],
[]"
4
2013-10-01T06:42:39-04:00
critical
"RS-7445 [Serv CELLSRV is absent] [It will be restarted] []
[] [] [] [] [] []"
Drop Alert History
--------------------CellCLI> drop alerthistory all
Alert 1 successfully dropped
Alert 2_1 successfully dropped
Alert 3 successfully dropped
Checking Threshold
------------------CellCLI> list threshold
cl_fsut./
1
and
where
[Invalid
[], [],
[Invalid
[], [],
[] [] []
4. cl_fsut./u01
CellCLI> create threshold cl_tst."/u01" comparison='>', warning=80
Threshold cl_fsut."/u01" successfully created
CellCLI> list threshold detail
name:
comparison:
warning:
name:
comparison:
warning:
cl_fsut./
>
70.0
cl_fsut./u01
>
80.0
CellCLI> alter threshold cl_fsut."/" comparison='>', warning=50
Threshold cl_fsut."/" successfully altered
CellCLI> list threshold detail
name:
comparison:
warning:
name:
comparison:
warning:
cl_fsut./
>
50.0
cl_fsut./u01
>
80.0
Execute the following command inside the cell operating system. It
creates a 512-MB file on the root file system which will increase
the utilization metric. After the metric crosses the threshold , an
alert will be generated.
$ dd if=/dev/zero of=/tmp/file.out bs=1024 count=500000
[celladmin@qr03cel02 ~]$ dd if=/dev/zero of=/tmp/file.out bs=1024
count=500000
500000+0 records in
500000+0 records out
512000000 bytes (512 MB) copied, 4.25551 seconds, 120 MB/s
[celladmin@qr03cel02 ~]$ cellcli
CellCLI: Release 11.2.3.1.0 - Production on Mon Sep 30 01:36:45 EDT
2013
Copyright (c) 2007, 2011, Oracle.
Cell Efficiency Ratio: 26M
All rights reserved.
CellCLI> list alerthistory
1_1
2013-09-30T01:32:46-04:00
warning
"The warning threshold for the following metric has been crossed.
Metric Name
: CL_FSUT
Metric Description : Percentage of
total space on this file system that is currently used Object Name
: / Current Value
: 56.0 % Threshold Value
: 50.0 % "
CellCLI> alter alerthistory 1_1 examinedby='investigator'
Alert 1_1 successfully altered
CellCLI> list alerthistory detail
name:
1_1
5. alertMessage:
"The warning threshold for the
following metric has been crossed. Metric Name
: CL_FSUT
Metric Description : Percentage of total space on this file system
that is currently used Object Name
: / Current Value
:
56.0 % Threshold Value
: 50.0 % "
alertSequenceID:
1
alertShortName:
CL_FSUT
alertType:
Stateful
beginTime:
2013-09-30T01:32:46-04:00
endTime:
examinedBy:
investigator
metricObjectName:
"/"
metricValue:
56.0
notificationState:
0
sequenceBeginTime:
2013-09-30T01:32:46-04:00
severity:
warning
alertAction:
"Examine the metric value that is
violating the specified threshold, and take appropriate actions if
needed."
The value of the name attribute is a composite of abbreviations.
• CL_ (cell)
• CD_ (cell disk)
• GD_ (grid disk)
• FC_ (flash cache)
• DB_ (database)
• CG_ (consumer group)
• CT_ (category)
• N_ (interconnect network)
-- Monitoring IORM with cellcli command.
I/O-related metric:
• IO_RQ (number of requests)
• IO_BY (number of MB)
• IO_TM (I/O latency)
• IO_WT (I/O wait time)
_R for read
_W for write.
_SM small I/O
_LG large I/O
_SEC signify per second
_RQ to signify per request
• CD_IO_WT_R_SM is the cell disk (CD_) I/O wait time (IO_WT) to read
(_R) small blocks (_SM).
• GD_IO_RQ_W_LG_SEC is the grid disk (GD_) number of requests
(IO_RQ) to write (_W) of large block (_LG) I/O per second (_SEC) on
a grid disk.