SlideShare a Scribd company logo
1 of 70
Download to read offline
Performance Tuning,
Monitoring, Management
Getting the Most out of SUSE Linux Enterprise Server
                            ®




Matthias G. Eckermann
Senior Product Manager
SUSE Linux Enterprise
mge@novell.com
Agenda


       Performance Analysis and Tuning

       Kernel Resource Management with Control Groups

       Built-in Monitoring Capabilities




2   © Novell, Inc. All rights reserved.
Part I:
Performance Analysis and Tuning
General Considerations
(Hardware, Configuration,...)
Hardware and Configuration

    Ultimately, hardware and its configuration set the
    upper limits for our tuning efforts.
    Are we starting with the best possible (minimum
    needed) hardware platform and components?
              –   CPU speed only critical for compute-intense tasks
              –   RAM (amount and speed) and interconnects do matter
              –   Bottleneck I/O: network bandwidth, disk,...

    Is the hardware configuration appropriate?


    The weakest link kills performance!
5   © Novell, Inc. All rights reserved.
(Hardware) Configuration

    Optimize storage configuration
              –   Optimize distribution of data across controllers/disks
              –   Swap to extra disk
              –   Use RAID with striping
    Tune hardware setup (BIOS, EFI,...)
              –   Only enable/proble what you have.
              –   Tune for fast reboot vs. startup checks (if desired)
              –   Carefully review all settings.
    Disable unneeded services
    # rc<SERVICE> stop

6   © Novell, Inc. All rights reserved.
Identifying Problems
Where Has My Memory Gone!?

    Slab Cache
              –   Structures of much less than one page in size
              –   Generic slabs of predefined sizes (32, 64) plus
                  slabs for specific data structures

    Page Cache
              –   Pages with actual contents of files (or block device)
                  usually the largest, by far

    Buffer Cache
              –   File system metadata



8   © Novell, Inc. All rights reserved.
Identifying Problems

    Start by finding the bottleneck: I/O, disk, mem,...
    iostat to identify overloaded drives
         –   package syssat
             #iostat -x 1

    vmstat for basic system usage
               # vmstat 1

       r   b     swpd         free        buff   cache   si so bi bo in cs    us sy id wa
       0   0     76804        8268        14996 167396    1 1 36 64 132 197    4 1 92 3
       0   0     76804        8268        14996 167396    0 0 0 0 1023 879     3 0 97 0
       0   0     76804        8300        14996 167396    0 0 0 0 1158 1134    2 0 98 0


    Slabtop for slab cache use
9   © Novell, Inc. All rights reserved.
File Systems
Picking a File System

     Pick the right file system for the task
               –   Indexed metadata

               –   File sizes

               –   Number of files

               –   Workloads (database, mail server,...)

               –   AccessPaths

               –   Dump/Restore




11   © Novell, Inc. All rights reserved.
SUSE Linux Enterprise         ®




     Filesystems
     Feature                                                                                  Ext 3                       reiserfs                                 XFS                         OCFS 2                               btrfs
     Data/Metadata Journaling                                                                   •/•                          ○/•                                    ○/•                          ○/•                               N/A [3]
     Journal internal/external                                                                  •/•                          •/•                                    •/•                          •/○                                 N/A
     Offline extend/shrink                                                                      •/•                          •/•                                     ○/○                         •/○                                 •/•
     Online extend/shrink                                                                      •/○                           •/○                                •/○                              •/○                                 •/•
     Inode-Allocation-Map                                                                     table                      u. B*-tree                           B+-tree                           table                              B-tree
     Sparse Files                                                                                •                            •                                   •                                •                                  •
     Tail Packing                                                                               ○                             •                                  ○                                 ○                                  •
     Defrag                                                                                     ○                             ○                                   •                                ○                                  •
     ExtAttr / ACLs                                                                             •/•                          •/•                                 •/•                              •/•                                •/•
     Quotas                                                                                      •                            •                                   •                                •                                  •
     Dump/Restore                                                                                •                            ○                                   •                                ○                                  ○
     Blocksize default                                                                                                                                         4KiB
     max. Filesystemsize [1]                                                                16 TiB                            16 TiB                           8 EiB                            16 TiB                             16 EiB
     max. Filesize [1]                                                                      2 TiB                             1 EiB                            8 EiB                            1 EiB                              16 EiB
     Support Status                                                                         SLES                              SLES                             SLES                            SLE HA                            Technology
                                                                                                                                                                                                                                  Preview
     SUSE® Linux Enterprise was the first enterprise Linux distribution to support journaling filesystems and logical volume managers back in 2000. Today, we have customers running XFS and ReiserFS with more
     than 8TiB in one filesystem, and the SUSE Linux Enterprise engineering team is using our 3 major Linux journaling filesystems for all their servers. We are excited to add the OCFS2 cluster filesystem to the range of
     supported filesystems in SUSE Linux Enterprise. For large-scale filesystems, for example for file serving (e.g., with with Samba, NFS, etc.), we recommend using XFS. (In this table "+" means "available/supported"; "-"
     is "unsupported")
     [1] The maximum file size above can be larger than the filesystem's actual size due to usage of sparse blocks. It should also be noted that unless a filesystem comes with large file support (LFS), the maximum file size on a 32-bit system is 2 GB
     (231 bytes). Currently all of our standard filesystems (including ext3 and ReiserFS) have LFS, which gives a maximum file size of 263 bytes in theory. The numbers given in the above tables assume that the filesystems are using 4 KiB block size.
     When using different block sizes, the results are different, but 4 KiB reflects the most common standard.

     [2] In this document: 1024 Bytes = 1 KiB; 1024 KiB = 1 MiB; 1024 MiB = 1 GiB; 1024 GiB = 1 TiB; 1024 TiB = 1 PiB; 1024 PiB = 1 EiB (see also http://physics.nist.gov/cuu/Units/binary.html )
     [3] Btrf s is a copy -on-write logging-sty le f ile sy stem, so rather than needing to journal changes bef ore writing them in-place, it writes them in a new location, and then links it in. Until the last write,
     the new changes are not “committed.”


12   © Novell, Inc. All rights reserved.
File Systems: ReiserFS

     Applications that use many small files

          –   Mail servers

          –   NFS servers

          –   Database servers


     or other applications that use synchronous I/O




13   © Novell, Inc. All rights reserved.
File Systems: Ext3

     Default file system in SUSE Linux Enterprise 11
                                             ®




     Best suited for
          –   Small (<100GiB) file systems
     When using Ext3 with many files in one directory,
     consider enabling btree support
     (enabled by default in SUSE Linux Enterprise
     Server 11 SP 1)
     # mkfs.ext3 -O dir_index
     When using Ext3 with multiple threads appending to files
     in the same directory, consider turning preallocation on
      # mount -o reservation

14   © Novell, Inc. All rights reserved.
File Systems: XFS

     Best suited for
          –   Medium (>100GiB) to very large file systems (> 1 TiB)
          –   Large files/many files
          –   Streaming multimedia (low latencies)

     Special features and capabilities
          –   dump/restore
          –   online filesystem-check
          –   online-defragmentation




15   © Novell, Inc. All rights reserved.
Cluster File System: OCFS2

     OCFS2 (Oracle Cluster File System)
     •   Shared access by multiple nodes
          –   Ensures data integrity in case of a node-failure
          –   Scale-out for data access
     •   Generic use
          –   POSIX-compliant
          –   Cluster-aware POSIX locking
     •   Higher throughput
          –   Parallel I/O
     •   Disaster Tolerance
          –   Integration with data replication for dual node

16   © Novell, Inc. All rights reserved.
Filesystems: btrfs

     •   Integrated Volume Management
     •   Support for copy on write
     •   Powerful snapshot capabilities
     •   Scalability
     •   Data integrity (checksums)
     •   Full community support
     •   Technology preview in
         SUSE Linux Enterprise Server 11 SP 1
                         ®




17   © Novell, Inc. All rights reserved.
Barriers

     SUSE Linux Enterprise defaults to maximum data integrity
                    ®



     guarantee by enforcing barriers from the file system so
     that reordering of journal writes cannot happen.
     This may cost some performance; tunable via mount option
     ReiserFS
          –   enable with “barrier=flush” (default)
          –   disable with “barrier=none”
     Ext3
          –   enable with “barrier=1” (default)
          –   disable with “barrier=0”
     XFS
          –   enable with “barrier”
          –   disable with “nobarrier”
18   © Novell, Inc. All rights reserved.
Logging Modes

     Journaling file systems offer different modes to write
     the actual data
     For ReiserFS and Ext3, mount option data=<X>
               –   data=ordered: use barriers for data
                   no risk exposing old data (default)
               –   data=writeback: no barriers for data
                   fastest in many workloads
               –   journal: use journal for data
                   generally slow, but can improve mail server workloads


     By default, SUSE Linux Enterprise Server ensures
                                           ®



     data integrity at the cost of some performance
19   © Novell, Inc. All rights reserved.
Dedicated Logging Devices

     ReiserFS
              mkreiserfs -j /dev/xxx -s 8193 /dev/xxy
              reiserfstune –journal-new-device /dev/xxx -s 8193

     Ext3
              mke2fs -O journal_dev /dev/xxx
              mke2fs -j -J device=/dev/xxx,size=8193 /dev/xxy
              tune2fs -J device=/dev/xxx,size=8193 /dev/xxy

     XFS
              mkfs.xfs -l logdev=/dev/xxx,size=10000b /dev/xxy

20   © Novell, Inc. All rights reserved.
File System Tuning

     Split file systems based on data access patterns

          –   Keep commit heavy data away from data that does not
              have to be synchronous

          –   Keep streaming writes and reads on different spindles than
              random I/O


     Consider disabling atime updates on files and directories

              # mount -o noatime,nodiratime


21   © Novell, Inc. All rights reserved.
File System Tuning

     Optimize directory layout for the file system

          –   Keep data that will be accessed together in the
              same subdirectories


          –   Spread data out into different subdirectories to increase
              large file concurrency


          –   Different file systems order directories differently




22   © Novell, Inc. All rights reserved.
Block Layer
I/O Scheduler

     Flexible, pluggable I/O scheduler
     Selectable via boot parameter elevator=<X>
          –   noop
          –   deadline
          –   as (default in mainline kernels)
          –   cfq (default in SUSE Linux Enterprise)
                                           ®




     I/O Scheduler per device
          –   Check
              /sys/block/*DEV*/queue/iosched
          –   Set
              echo SCHEDNAME > /sys/block/*DEV*/queue/scheduler

24   © Novell, Inc. All rights reserved.
I/O Scheduler: Noop

     No reordering, just merging

     Best for storage with extensive caching and
     scheduling of its own, such as:

     MultiPathing



     Activated by boot parameter elevator=noop



25   © Novell, Inc. All rights reserved.
I/O Scheduler: Deadline

     Per-request service deadline

          –   Caps maximum latency per request



          –   Maintains good disk throughput


     Best for disk-intensive database applications


     Activated by boot parameter elevator=deadline

26   © Novell, Inc. All rights reserved.
I/O Scheduler: Anticipatory

     Similar to “deadline”, but anticipates reads by putting
     them in front of the queue and delays a few ms after
     every read
          –   Maximizes throughput
          –   At the cost of increasing latency
     Best for file servers and desktop workloads with
     single IDE/SATA disks.
     Default in mainline kernels

     Activated by boot parameter elevator=as


27   © Novell, Inc. All rights reserved.
I/O Scheduler: CFQ

     Complete Fair Queuing
     Treat all competing processes equally by keeping
     a unique request queue for each and giving equal
     bandwidth to each queue
          –   Good compromise between throughput and latency
          –   Minimal worst case latency on all reads and writes
     Suitable for a wide variety of applications
     Default in SUSE Linux Enterprise      ®




     Activated by boot parameter elevator=cfq

28   © Novell, Inc. All rights reserved.
Block Layer Tuning

     Spreading the load across controllers


               –   Per-target locking for SCSI



               –   Software RAID bandwidth



     Battery backed caching



29   © Novell, Inc. All rights reserved.
Blocker Layer Tunables

     Block read ahead buffer
     /sys/block/<sdX/hdX>/queue/read_ahead_kb
     Default is 128. Increase to 512 for fast storage
     (SCSI disks or RAID)
     May speed up streaming reads a lot


     Number of requests
     /sys/block/<sdX/hdX>/queue/nr_requests
     Default is 128. Increase to 256 with CFQ
     scheduler for fast storage
     Increases throughput at minor latency expense
30   © Novell, Inc. All rights reserved.
Memory Management (VM)
Buffer Flushing

     How to write dirty pages to disk
     This can be tuned by
               –   /proc/sys/vm/dirty_ratio (40%)
                   Generator of dirty data starts writeback.
               –   /proc/sys/vm/dirty_background_ratio (10%)
               –   /proc/sys/vm/dirty_expire_centisecs (3000)
                   How long may dirty pages remain dirty?
               –   /proc/sys/vm/dirty_writeback_centisecs (500)
                   How often does bdflush wake up?
     Defaults are pretty high which is good for databases
     (but may result in lots of unreclaimable pagecache)
     For other workloads (HPC) you may want to lower these
32   © Novell, Inc. All rights reserved.
VM: Swapiness

     The threshold when processes should be swapped
     can be tuned via
               –   /proc/sys/vm/swappiness

     Default is 60, which works well if you want to swap out
     daemons or programs which have not done a lot lately



     Higher values will provide more buffer/page cache,
     lower values will wait longer to swap out idle processes


33   © Novell, Inc. All rights reserved.
NUMA (1)

     NUMA = Non-uniform Memory Architecture
     SUSE Linux Enterprise detects and uses NUMA
                     ®



     topology and automatically
          –   Prefers memory that is local to a node;

          –   Evenly balances system data across nodes;   NUMA


          –   Gracefully handle CPU-less nodes; etc.



     Also applications can (and should) optimize for
     NUMA topology!

34   © Novell, Inc. All rights reserved.
NUMA (2)

     The NUMA system can be tuned via `numactl $CMD`;
     the settings then apply to $CMD and all of its children
          –   --preferred=255

          –   --membind=!0-1
                                                                 NUMA
          –   --cpunodebind=2-5

          –   --physcpubind=2-5

          –   --localalloc (always allocate from current node)



     Node 0 may be the most contended, so avoid it
35   © Novell, Inc. All rights reserved.
Miscellaneous
(Scheduler, Network)
Binding Processes/Interrupts to CPUs

     Problem: context switching costs
     CPU affinity: binding CPUs to a specific process
     can improve performance
          –   taskset 0x3 [-p pid] [command]

              In this example, 0x3 is a bitmap referring to CPUs 1
              and 2; 0x6 would be CPUs 2 and 4.
     Bind interrupts to CPUs
          –   cat /proc/interrupts

          –   echo 0x3 > /proc/irq/0/smp_affinity

          –   Example: distribute NICs among CPUs.

37   © Novell, Inc. All rights reserved.
Network Improvements

     Gigabit Ethernet and 10g
               –   Significant interrupt overhead reduction

               –   Consider Jumbo Frames (larger 1500 bytes)

               –      # ifconfig <DEV> mtu 9000

     NFS modes
               –   TCP (default) vs UDP

               –   NFSv3 (default) vs NFSv4

               –   rsize=<X>/wsize=<X>
                   - read/write in chunks of <X> bytes
                   - default is 1024, use 8192 for higher throughput
38   © Novell, Inc. All rights reserved.
Application Interplay
Async I/O, O_DIRECT

     Asynchronous I/O
               –   Specific model for concurrency
               –   Heavily used by databases

     Direct I/O (O_DIRECT) on block devices or files
               –   Databases like to use raw disks. Historically /dev/raw
                   was used, but O_DIRECT is more performant.
               –   Files should be preallocated (no holes, no appending);
                   the system falls back to buffered I/O otherwise.
               –   In both cases: cache pollution benefits
               –   Not specific to database workloads!

40   © Novell, Inc. All rights reserved.
Part II:
Kernel Resource Managemet with
                 Control Groups
Control Groups

     •   Understanding control groups: An in-depth overview

          –   What Control Groups is designed to do

          –   How Control Groups work


     •   Using control groups in
         SUSE Linux Enterprise Server 11
                         ®




          –   Understanding the components



42   © Novell, Inc. All rights reserved.
Understanding Control Groups
        An In-depth Overview
What Are Control Groups?

     Control Groups provide a mechanism for
     aggregating/partitioning sets of tasks, and all their
     future children, into hierarchical groups with
     specialized behavior.
          –   cgroup is another name for Control Groups
          –   Partition tasks (processes) into a one or many groups of
              tree hierarchies
          –   Associate a set of tasks in a group to a set subsystem
              parameters
          –   Subsystems provide the parameters that can be assigned
          –   Tasks are affected by the assigning parameters

44   © Novell, Inc. All rights reserved.
Example of the Capabilities of a cgroup

     Consider a large university server with various users -
     students, professors, system tasks etc. The resource
     planning for this server could be along the following lines:


                CPUs                       Memory                                  Network I/O
          Top cpuset (20%)                 Professors = 50%                 WWW browsing = 20%
                    /                     Students = 30%                               /       
     CPUSet1                    CPUSet2    System = 20%                     Prof (15%)          Students (5%)
            |                       |
       (Profs)             (Students)      Disk I/O                         Network File System (60%)
          60%                    20%       Professors = 50%
                                           Students = 30%                   Others (20%)
                                           System = 20%

45   © Novell, Inc. All rights reserved.             Source: /usr/src/linux/Documentation/cgroups/cgroups.txt
                                                                                                            •
Control Group Subsystems

     Two types of subsystems
     •   Isolation and special controls
          –   cpuset, namespace, freezer, device, checkpoint/restart
     •   Resource control
          –   cpu(scheduler), memory, disk i/o, network


     Each subsystem can be mounted independently
          –   mount -t cgroup -o cpu none /cpu
          –   mount -t cgroup -o cpuset none /cpuset
     or all at once
          –   mount -t cgroup none /cgroup
46   © Novell, Inc. All rights reserved.   Source: http://jp.linuxfoundation.org/jp_uploads/seminar20081119/CgroupMemcgMaster.pdf
                                                                                                                                •
Cpuset Subsystem (Isolation)

     Cpuset is for tying processes to cpu and memory.



       Process              Process        Process                   Process          Process                   Process
       Group A1             Group A2       Group B                   Group A1         Group A2                  Group B




             Memory                    Memory                               Memory                    Memory



47   © Novell, Inc. All rights reserved.     Source: http://jp.linuxfoundation.org/jp_uploads/seminar20081119/CgroupMemcgMaster.pdf
                                                                                                                                  •
Namespace Subsystem (Isolation)

     Namespace is for showing private view of system to

     processes in cgroup. Mainly used for OS-level

     virtualization. This subsystem itself has no special

     functions and just tracks changes in namespace.




48   © Novell, Inc. All rights reserved.   Source: http://jp.linuxfoundation.org/jp_uploads/seminar20081119/CgroupMemcgMaster.pdf
                                                                                                                                •
Freezer Subsystem (Control)

     Freezer cgroup is for freezing (stopping) all tasks in
     a group.

              mount -t cgroup none /freezer -o freezer

              ....put task into /freezer/tasks...

              echo FROZEN > /freezer/freezer.state

              echo RUNNING > /freezer/freezer.state



49   © Novell, Inc. All rights reserved.   Source: http://jp.linuxfoundation.org/jp_uploads/seminar20081119/CgroupMemcgMaster.pdf
                                                                                                                                •
Device Subsystem (Isolation)

     A system administrator can provide a list of devices
     that can be accessed by processes under cgroup
          –   Allow/Deny Rule

          –   Allow/Deny : READ/WRITE/MKNOD




     Limits access to device or file system on a device to
     only tasks in specified cgroup



50   © Novell, Inc. All rights reserved.   Source: http://jp.linuxfoundation.org/jp_uploads/seminar20081119/CgroupMemcgMaster.pdf
                                                                                                                                •
Checkpoint/Restart Subsystem
     (Control)
     •   Save all process's status in a cgroup to a dump file,
         restart it later (or just save state and continue)



     •   For allowing “saved container” moved between
         physical machines (as VM can do)



     •   Dump all process's image to a file



51   © Novell, Inc. All rights reserved.   Source: http://jp.linuxfoundation.org/jp_uploads/seminar20081119/CgroupMemcgMaster.pdf
                                                                                                                                •
CPU Subsystem (Resource Control)

     •   Share CPU bandwidth between groups by group
         scheduling function of CFS (the scheduler)
     •   Mechanically complicated




                 Share = 2000              Share = 1000   Share = 4000
52   © Novell, Inc. All rights reserved.
Memory Subsystem
     (Resource Control)
     •   For limiting memory usage of user space processes.

     •   Limit LRU (Least Recently Used) pages

           –    Anonymous and file cache


     •   No limits for kernel memory

           –    Maybe in another subsystem if needed



53   © Novell, Inc. All rights reserved.   Source: http://jp.linuxfoundation.org/jp_uploads/seminar20081119/CgroupMemcgMaster.pdf
                                                                                                                                •
Disk I/O Subsystem
     (Resource Control) (Draft)
     •   3 proposals are currently being discussed
          –   dm-ioband, io-throttle, io-controller
     •   Consensus has not been reached but io-controller
         seems to taking the lead
          –   Both dm-ioband and io-throttle suffer from a significant
              problem: they can defeat the policies (such as I/O priority)
              being implemented by the I/O scheduler.
          –   Io-throttle is does bandwidth control at the I/O scheduler level
          –   Designed to work with mainline I/O controllers: CFQ, deadline,
              Anticipatory, and no-op but requires significant changes
          –   Currently v4 as of June 8, 2009 and based on
              2.6.30-rc8 kernel
                                           Source: http://jp.linuxfoundation.org/jp_uploads/seminar20081119/CgroupMemcgMaster.pdf
                                                                                                 Source: http://lwn.net/Articles/331857/
                                                                                                 Source: http://lwn.net/Articles/332839/
54   © Novell, Inc. All rights reserved.                                                      Source: http://lkml.org/lkml/2009/6/8/580
Network I/O Subsystem
     (Resource Control) (Draft)
     •   Like the Disk I/O subsystem, it seems the jury is
         still out on the implementation of this subsystem

     •   Kernel developers are talking about traffic control

          –   cgroup_tc - This patch provides a simple resource
              controller which uses traffic control (tc) features already
              in the Linux kernel


          –   Not much discussion on this topic since late 2008


                                                                                           Source: http://lkml.org/lkml/2008/7/22/361
                                           Source: https://lists.linux-foundation.org/pipermail/containers/2008-August/012419.html
55   © Novell, Inc. All rights reserved.   Source: https://lists.linux-foundation.org/pipermail/containers/2008-August/012512.html
Reading More on cgroups

     Remember to install kernel source!!
           –   /usr/src/linux/Documentation/cgroups/cgroup.txt
           –   /usr/src/linux/Documentation/cpusets.txt
           –   /usr/src/linux/Documentation/controllers/*
           –   /usr/src/linux/Documentation/scheduler/sched-design-CFS.txt
          –    /usr/src/linux/Documentation/kernel-parameters.txt

     Additional RPM packages
           –   libcgroup1 - /usr/share/doc/packages/libcgroup1/README*
          –    cpuset (Alex Tsariounov) -
               /usr/share/doc/packages/cpuset/cset*.txt

56   © Novell, Inc. All rights reserved.
Reading More on cgroups (continued)

     Manpages
          –    man cpuset

          –    man cset

     On the web
           –   http://lkml.org/lkml/2009/2/9/372

           –   http://lkml.org/lkml/2009/2/10/140

           –   http://lkml.org/lkml/2008/1/29/60

           –   http://kerneltrap.org/mailarchive/linux-kernel/2008/6/18/2161114/thread

57   © Novell, Inc. All rights reserved.
Using Control Groups in
SUSE Linux Enterprise Server 11
     ®
Preparing
     SUSE Linux Enterprise Server 11
                           ®




     •   Start with patched SLES11 install

     •   Add the following packages
          –   libcgroup1

          –   cpuset

          –   libcpuset1

          –   kernel-source (Documentation purposes)

          –   gcc (Needed to compile the stress tool)


59   © Novell, Inc. All rights reserved.
What Subsystems Are Available?

     A way to figure this out.
           –    mount -t cgroup none /cgroup
           –    cat /proc/mounts


     Current subsystems in SUSE Linux Enterprise
     Server 11
          –    rw,freezer,devices,cpuacct,cpu,ns,cpuset
          –    memory – Disabled by default
               >   Add a kernel parameter - cgroup_enable=memory

     Possible future subsystems in SLES 11
           –    Disk and Network subsystem controllers

60   © Novell, Inc. All rights reserved.
Generating Load on
     SUSE Linux Enterprise Server 11
                           ®




     Search the Web for “linux load generator” - Results:
     •   http://devin.com/lookbusy/
     •   http://www.ibm.com/developerworks/linux/library/l-stress/index.html
          –   Good article
     •   http://ltp.sourceforge.net/
          –   Powerful toolkit for Linux developers
     •   http://hardware.slashdot.org/article.pl?sid=05/04/06/218233
          –   Simple scripting examples
     •   http://weather.ou.edu/~apw/projects/stress/
          –   Probably best choice
          –   Available (community driven) also at:

     http://software.opensuse.org/search?baseproject=SUSE:SLE-11&p=1&q=stress

61   © Novell, Inc. All rights reserved.
Crash Course on CPUSETs
     The Hard Way

     •   Determine the number of CPUs and Memory Nodes
           –    Look at /proc/cpuinfo and /proc/zoneinfo
     •   Creating the CPUSET hierarchy
                 mkdir /dev/cpuset
                 mount -t cpuset cpuset /dev/cpuset
                 cd /dev/cpuset
                 mkdir Charlie
                 cd Charlie
                 /bin/echo 2-3 > cpus
                 /bin/echo 1 > mems
                 /bin/echo $$ > tasks
                 # The current shell is now running in cpuset Charlie
                 # The next line should display '/Charlie'
                 cat /proc/self/cpuset
     •   Removing the CPUSET
                 cat /dev/cpuset/Charlie/tasks (move any remaining tasks!!)
                 rmdir /dev/cpuset/Charlie

62   © Novell, Inc. All rights reserved.
Crash Course on CPUSETs
     The Easy Way – Thanks to Alex Tsariounov of Novell           ®




     •   Determine the number of CPUs and Memory Nodes
           –    cset set --list
     •   Creating the CPUSET hierarchy
           –    cset set --cpu=2-3 --mem=1 --set=Charlie
     •   Starting processes in a CPUSET
           –    cset proc --set Charlie --exec -- stress -c 1 &
     •   Moving existing processes to a CPUSET
           –    cset proc --move --pid PID --toset=Charlie
     •   List task in a CPUSET
           –    cset proc --list --set Charlie
     •   Removing a CPUSET
           –    cset set --destroy Charlie

63   © Novell, Inc. All rights reserved.
Follow It Up with cgroups
     The Hard way

     •   Creating the cgroup hierarchy
                 mkdir /dev/cgroup
                 mount -t cgroup cgroup /dev/cgroup
                 cd /dev/cgroup
                 mkdir priority
                 cd priority
                 cat cpu.shares

     •   Understanding cpu.shares
           –    1024 is the default (more in sched-design-CFS.txt) = 50% utilization
           –    1524 = 60% utilization
           –    2048 = 67% utilization
           –    512 = 40% utilization
     •   Changing cpu.shares
           –    /bin/echo 1024 > cpu.shares


64   © Novell, Inc. All rights reserved.
More cgroup Functionality to Learn

     The libcgroup1 package
     •   Basic tools in user space to simplify resource
         management functionality
           –    uid, gid or exec rules for placement of a task
           –    /etc/init.d/cgconfig – setup cgroup filesystem based on
                /etc/cgconfig.conf
     •   UID/GID rules
           –    Managed in /etc/cgrules.conf by root user
     •   EXEC rules
           –    Fully managed by a user in a config file in their home directory
     •   Methods used to place task in proper cgroup
           –    pam_cgroup (at login); cgexec (task start); cgclassify (task move)
           –    User space daemon (cgred in /etc/init.d and /etc/sysconfig)
65   © Novell, Inc. All rights reserved.
Linux Containers
     LXC

     •   Build upon CGroups and specific kernel settings;
         use “lxc-checkconfig” to check compliance
     •   Fully enabled in SUSE Linux Enterprise Server 11 SP1
     •   Basic Functionality
              lxc-execute --name=NAME -- COMMAND
     •   Function Overview
          –   lxc-start lxc-execute / lxc-stop
          –   lxc-freeze lxc-unfreeze
          –   Monitoring: lxc-ps, lxc-info, lxc-netstat, lxc-monitor
          –   Modifying CGroup parameters: lxc-cgroup


66   © Novell, Inc. All rights reserved.
Part III:
Built-in Monitoring Capabilities
Monitoring Overview and Hands-on

     •   Low Level
          –   smartmontools - Monitor for S.M.A.R.T. Disks and Devices
          –   sensors - Hardware health monitoring for Linux
          –   iptraf - TCP/IP Network Monitor
          –   pcp - Performance Co-Pilot
              (system-level performance monitoring)
          –   sysstat - Sar and Iostat Commands for Linux
          –   perfmon
          –   blktrace, ltrace, strace
          –   systemtap - Instrumentation System
     •   High Level
          –   argus – network auditing tool
          –   nagios

68   © Novell, Inc. All rights reserved.
Unpublished Work of Novell, Inc. All Rights Reserved.
This work is an unpublished work and contains confidential, proprietary, and trade secret information of Novell, Inc.
Access to this work is restricted to Novell employees who have a need to know to perform tasks within the scope
of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified,
translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of Novell, Inc.
Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.


General Disclaimer
This document is not to be construed as a promise by any participating company to develop, deliver, or market a
product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in
making purchasing decisions. Novell, Inc. makes no representations or warranties with respect to the contents
of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any
particular purpose. The development, release, and timing of features or functionality described for Novell products
remains at the sole discretion of Novell. Further, Novell, Inc. reserves the right to revise this document and to
make changes to its content, at any time, without obligation to notify any person or entity of such revisions or
changes. All Novell marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc.
in the United States and other countries. All third-party trademarks are the property of their respective owners.

More Related Content

Viewers also liked

Linux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA'sLinux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA'sMydbops
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
Linux System Monitoring basic commands
Linux System Monitoring basic commandsLinux System Monitoring basic commands
Linux System Monitoring basic commandsMohammad Rafiee
 
Linux command ppt
Linux command pptLinux command ppt
Linux command pptkalyanineve
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at NetflixBrendan Gregg
 

Viewers also liked (9)

美团技术团队 - KVM性能优化
美团技术团队 - KVM性能优化美团技术团队 - KVM性能优化
美团技术团队 - KVM性能优化
 
Linux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA'sLinux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA's
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
Linux System Monitoring basic commands
Linux System Monitoring basic commandsLinux System Monitoring basic commands
Linux System Monitoring basic commands
 
Linux command ppt
Linux command pptLinux command ppt
Linux command ppt
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 

Similar to Perfomance Tuning, Monitoring, Management: Getting the Most from SUSE Linux Enterprise Server

Why btrfs is the Bread and Butter of Filesystems
Why btrfs is the Bread and Butter of FilesystemsWhy btrfs is the Bread and Butter of Filesystems
Why btrfs is the Bread and Butter of Filesystemsdegarden
 
LinuxCon_2013_NA_Eckermann_Filesystems_btrfs.pdf
LinuxCon_2013_NA_Eckermann_Filesystems_btrfs.pdfLinuxCon_2013_NA_Eckermann_Filesystems_btrfs.pdf
LinuxCon_2013_NA_Eckermann_Filesystems_btrfs.pdfdegarden
 
Key Aspects in 3D File Format Conversions
Key Aspects in 3D File Format ConversionsKey Aspects in 3D File Format Conversions
Key Aspects in 3D File Format Conversionspbajcsy
 
Working of Volatile and Non-Volatile memory
Working of Volatile and Non-Volatile memoryWorking of Volatile and Non-Volatile memory
Working of Volatile and Non-Volatile memoryDon Caeiro
 
Storage devices(present)
Storage devices(present)Storage devices(present)
Storage devices(present)Zawawi Mohamad
 
Unlocking the secrets to how essbase thinks e roske in sync10 oracle epm track
Unlocking the secrets to how essbase thinks e roske in sync10 oracle epm trackUnlocking the secrets to how essbase thinks e roske in sync10 oracle epm track
Unlocking the secrets to how essbase thinks e roske in sync10 oracle epm trackInSync Conference
 
I can\'t believe this is butter - A Tour of btrfs
I can\'t believe this is butter - A Tour of btrfsI can\'t believe this is butter - A Tour of btrfs
I can\'t believe this is butter - A Tour of btrfsAvi Miller
 
Some key value stores using log-structure
Some key value stores using log-structureSome key value stores using log-structure
Some key value stores using log-structureZhichao Liang
 
Sun storage tek 6140 customer presentation
Sun storage tek 6140 customer presentationSun storage tek 6140 customer presentation
Sun storage tek 6140 customer presentationxKinAnx
 
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksXldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksliqiang xu
 
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum
 
O leary2012 comp_ppt_ch08
O leary2012 comp_ppt_ch08O leary2012 comp_ppt_ch08
O leary2012 comp_ppt_ch08Dalia Saeed
 
High Availability MySQL with DRBD and Heartbeat: MTV Japan ...
High Availability MySQL with DRBD and Heartbeat: MTV Japan ...High Availability MySQL with DRBD and Heartbeat: MTV Japan ...
High Availability MySQL with DRBD and Heartbeat: MTV Japan ...webhostingguy
 
Scaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for ClassificationScaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for Classificationsmatsus
 
Severalnines Self-Training: MySQL® Cluster - Part VIII
Severalnines Self-Training: MySQL® Cluster - Part VIIISeveralnines Self-Training: MySQL® Cluster - Part VIII
Severalnines Self-Training: MySQL® Cluster - Part VIIISeveralnines
 
Lotus notes comparacion version a version
Lotus notes   comparacion version a versionLotus notes   comparacion version a version
Lotus notes comparacion version a versionIT Factory S.R.L
 
G1 collector and tuning and Cassandra
G1 collector and tuning and CassandraG1 collector and tuning and Cassandra
G1 collector and tuning and CassandraChris Lohfink
 
Sun storage tek 6140 technical presentation
Sun storage tek 6140 technical presentationSun storage tek 6140 technical presentation
Sun storage tek 6140 technical presentationxKinAnx
 

Similar to Perfomance Tuning, Monitoring, Management: Getting the Most from SUSE Linux Enterprise Server (20)

Why btrfs is the Bread and Butter of Filesystems
Why btrfs is the Bread and Butter of FilesystemsWhy btrfs is the Bread and Butter of Filesystems
Why btrfs is the Bread and Butter of Filesystems
 
LinuxCon_2013_NA_Eckermann_Filesystems_btrfs.pdf
LinuxCon_2013_NA_Eckermann_Filesystems_btrfs.pdfLinuxCon_2013_NA_Eckermann_Filesystems_btrfs.pdf
LinuxCon_2013_NA_Eckermann_Filesystems_btrfs.pdf
 
Key Aspects in 3D File Format Conversions
Key Aspects in 3D File Format ConversionsKey Aspects in 3D File Format Conversions
Key Aspects in 3D File Format Conversions
 
Working of Volatile and Non-Volatile memory
Working of Volatile and Non-Volatile memoryWorking of Volatile and Non-Volatile memory
Working of Volatile and Non-Volatile memory
 
Storage devices(present)
Storage devices(present)Storage devices(present)
Storage devices(present)
 
Unlocking the secrets to how essbase thinks e roske in sync10 oracle epm track
Unlocking the secrets to how essbase thinks e roske in sync10 oracle epm trackUnlocking the secrets to how essbase thinks e roske in sync10 oracle epm track
Unlocking the secrets to how essbase thinks e roske in sync10 oracle epm track
 
I can\'t believe this is butter - A Tour of btrfs
I can\'t believe this is butter - A Tour of btrfsI can\'t believe this is butter - A Tour of btrfs
I can\'t believe this is butter - A Tour of btrfs
 
Workshop 3
Workshop 3Workshop 3
Workshop 3
 
Some key value stores using log-structure
Some key value stores using log-structureSome key value stores using log-structure
Some key value stores using log-structure
 
Sun storage tek 6140 customer presentation
Sun storage tek 6140 customer presentationSun storage tek 6140 customer presentation
Sun storage tek 6140 customer presentation
 
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksXldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocks
 
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
 
O leary2012 comp_ppt_ch08
O leary2012 comp_ppt_ch08O leary2012 comp_ppt_ch08
O leary2012 comp_ppt_ch08
 
High Availability MySQL with DRBD and Heartbeat: MTV Japan ...
High Availability MySQL with DRBD and Heartbeat: MTV Japan ...High Availability MySQL with DRBD and Heartbeat: MTV Japan ...
High Availability MySQL with DRBD and Heartbeat: MTV Japan ...
 
Scaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for ClassificationScaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for Classification
 
Severalnines Self-Training: MySQL® Cluster - Part VIII
Severalnines Self-Training: MySQL® Cluster - Part VIIISeveralnines Self-Training: MySQL® Cluster - Part VIII
Severalnines Self-Training: MySQL® Cluster - Part VIII
 
Lotus notes comparacion version a version
Lotus notes   comparacion version a versionLotus notes   comparacion version a version
Lotus notes comparacion version a version
 
G1 collector and tuning and Cassandra
G1 collector and tuning and CassandraG1 collector and tuning and Cassandra
G1 collector and tuning and Cassandra
 
When ACLs Attack
When ACLs AttackWhen ACLs Attack
When ACLs Attack
 
Sun storage tek 6140 technical presentation
Sun storage tek 6140 technical presentationSun storage tek 6140 technical presentation
Sun storage tek 6140 technical presentation
 

More from Novell

Filr white paper
Filr white paperFilr white paper
Filr white paperNovell
 
Social media class 4 v2
Social media class 4 v2Social media class 4 v2
Social media class 4 v2Novell
 
Social media class 3
Social media class 3Social media class 3
Social media class 3Novell
 
Social media class 2
Social media class 2Social media class 2
Social media class 2Novell
 
Social media class 1
Social media class 1Social media class 1
Social media class 1Novell
 
Social media class 2 v2
Social media class 2 v2Social media class 2 v2
Social media class 2 v2Novell
 
LinkedIn training presentation
LinkedIn training presentationLinkedIn training presentation
LinkedIn training presentationNovell
 
Twitter training presentation
Twitter training presentationTwitter training presentation
Twitter training presentationNovell
 
Getting started with social media
Getting started with social mediaGetting started with social media
Getting started with social mediaNovell
 
Strategies for sharing and commenting in social media
Strategies for sharing and commenting in social mediaStrategies for sharing and commenting in social media
Strategies for sharing and commenting in social mediaNovell
 
Information Security & Compliance in Healthcare: Beyond HIPAA and HITECH
Information Security & Compliance in Healthcare: Beyond HIPAA and HITECHInformation Security & Compliance in Healthcare: Beyond HIPAA and HITECH
Information Security & Compliance in Healthcare: Beyond HIPAA and HITECHNovell
 
Workload iq final
Workload iq   finalWorkload iq   final
Workload iq finalNovell
 
The Identity-infused Enterprise
The Identity-infused EnterpriseThe Identity-infused Enterprise
The Identity-infused EnterpriseNovell
 
Shining the Enterprise Light on Shades of Social
Shining the Enterprise Light on Shades of SocialShining the Enterprise Light on Shades of Social
Shining the Enterprise Light on Shades of SocialNovell
 
Accelerate to the Cloud
Accelerate to the CloudAccelerate to the Cloud
Accelerate to the CloudNovell
 
The New Business Value of Today’s Collaboration Trends
The New Business Value of Today’s Collaboration TrendsThe New Business Value of Today’s Collaboration Trends
The New Business Value of Today’s Collaboration TrendsNovell
 
Preventing The Next Data Breach Through Log Management
Preventing The Next Data Breach Through Log ManagementPreventing The Next Data Breach Through Log Management
Preventing The Next Data Breach Through Log ManagementNovell
 
Iaas for a demanding business
Iaas for a demanding businessIaas for a demanding business
Iaas for a demanding businessNovell
 
Workload IQ: A Differentiated Approach
Workload IQ: A Differentiated ApproachWorkload IQ: A Differentiated Approach
Workload IQ: A Differentiated ApproachNovell
 
Virtual Appliances: Simplifying Application Deployment and Accelerating Your ...
Virtual Appliances: Simplifying Application Deployment and Accelerating Your ...Virtual Appliances: Simplifying Application Deployment and Accelerating Your ...
Virtual Appliances: Simplifying Application Deployment and Accelerating Your ...Novell
 

More from Novell (20)

Filr white paper
Filr white paperFilr white paper
Filr white paper
 
Social media class 4 v2
Social media class 4 v2Social media class 4 v2
Social media class 4 v2
 
Social media class 3
Social media class 3Social media class 3
Social media class 3
 
Social media class 2
Social media class 2Social media class 2
Social media class 2
 
Social media class 1
Social media class 1Social media class 1
Social media class 1
 
Social media class 2 v2
Social media class 2 v2Social media class 2 v2
Social media class 2 v2
 
LinkedIn training presentation
LinkedIn training presentationLinkedIn training presentation
LinkedIn training presentation
 
Twitter training presentation
Twitter training presentationTwitter training presentation
Twitter training presentation
 
Getting started with social media
Getting started with social mediaGetting started with social media
Getting started with social media
 
Strategies for sharing and commenting in social media
Strategies for sharing and commenting in social mediaStrategies for sharing and commenting in social media
Strategies for sharing and commenting in social media
 
Information Security & Compliance in Healthcare: Beyond HIPAA and HITECH
Information Security & Compliance in Healthcare: Beyond HIPAA and HITECHInformation Security & Compliance in Healthcare: Beyond HIPAA and HITECH
Information Security & Compliance in Healthcare: Beyond HIPAA and HITECH
 
Workload iq final
Workload iq   finalWorkload iq   final
Workload iq final
 
The Identity-infused Enterprise
The Identity-infused EnterpriseThe Identity-infused Enterprise
The Identity-infused Enterprise
 
Shining the Enterprise Light on Shades of Social
Shining the Enterprise Light on Shades of SocialShining the Enterprise Light on Shades of Social
Shining the Enterprise Light on Shades of Social
 
Accelerate to the Cloud
Accelerate to the CloudAccelerate to the Cloud
Accelerate to the Cloud
 
The New Business Value of Today’s Collaboration Trends
The New Business Value of Today’s Collaboration TrendsThe New Business Value of Today’s Collaboration Trends
The New Business Value of Today’s Collaboration Trends
 
Preventing The Next Data Breach Through Log Management
Preventing The Next Data Breach Through Log ManagementPreventing The Next Data Breach Through Log Management
Preventing The Next Data Breach Through Log Management
 
Iaas for a demanding business
Iaas for a demanding businessIaas for a demanding business
Iaas for a demanding business
 
Workload IQ: A Differentiated Approach
Workload IQ: A Differentiated ApproachWorkload IQ: A Differentiated Approach
Workload IQ: A Differentiated Approach
 
Virtual Appliances: Simplifying Application Deployment and Accelerating Your ...
Virtual Appliances: Simplifying Application Deployment and Accelerating Your ...Virtual Appliances: Simplifying Application Deployment and Accelerating Your ...
Virtual Appliances: Simplifying Application Deployment and Accelerating Your ...
 

Perfomance Tuning, Monitoring, Management: Getting the Most from SUSE Linux Enterprise Server

  • 1. Performance Tuning, Monitoring, Management Getting the Most out of SUSE Linux Enterprise Server ® Matthias G. Eckermann Senior Product Manager SUSE Linux Enterprise mge@novell.com
  • 2. Agenda Performance Analysis and Tuning Kernel Resource Management with Control Groups Built-in Monitoring Capabilities 2 © Novell, Inc. All rights reserved.
  • 5. Hardware and Configuration Ultimately, hardware and its configuration set the upper limits for our tuning efforts. Are we starting with the best possible (minimum needed) hardware platform and components? – CPU speed only critical for compute-intense tasks – RAM (amount and speed) and interconnects do matter – Bottleneck I/O: network bandwidth, disk,... Is the hardware configuration appropriate? The weakest link kills performance! 5 © Novell, Inc. All rights reserved.
  • 6. (Hardware) Configuration Optimize storage configuration – Optimize distribution of data across controllers/disks – Swap to extra disk – Use RAID with striping Tune hardware setup (BIOS, EFI,...) – Only enable/proble what you have. – Tune for fast reboot vs. startup checks (if desired) – Carefully review all settings. Disable unneeded services # rc<SERVICE> stop 6 © Novell, Inc. All rights reserved.
  • 8. Where Has My Memory Gone!? Slab Cache – Structures of much less than one page in size – Generic slabs of predefined sizes (32, 64) plus slabs for specific data structures Page Cache – Pages with actual contents of files (or block device) usually the largest, by far Buffer Cache – File system metadata 8 © Novell, Inc. All rights reserved.
  • 9. Identifying Problems Start by finding the bottleneck: I/O, disk, mem,... iostat to identify overloaded drives – package syssat #iostat -x 1 vmstat for basic system usage # vmstat 1 r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 76804 8268 14996 167396 1 1 36 64 132 197 4 1 92 3 0 0 76804 8268 14996 167396 0 0 0 0 1023 879 3 0 97 0 0 0 76804 8300 14996 167396 0 0 0 0 1158 1134 2 0 98 0 Slabtop for slab cache use 9 © Novell, Inc. All rights reserved.
  • 11. Picking a File System Pick the right file system for the task – Indexed metadata – File sizes – Number of files – Workloads (database, mail server,...) – AccessPaths – Dump/Restore 11 © Novell, Inc. All rights reserved.
  • 12. SUSE Linux Enterprise ® Filesystems Feature Ext 3 reiserfs XFS OCFS 2 btrfs Data/Metadata Journaling •/• ○/• ○/• ○/• N/A [3] Journal internal/external •/• •/• •/• •/○ N/A Offline extend/shrink •/• •/• ○/○ •/○ •/• Online extend/shrink •/○ •/○ •/○ •/○ •/• Inode-Allocation-Map table u. B*-tree B+-tree table B-tree Sparse Files • • • • • Tail Packing ○ • ○ ○ • Defrag ○ ○ • ○ • ExtAttr / ACLs •/• •/• •/• •/• •/• Quotas • • • • • Dump/Restore • ○ • ○ ○ Blocksize default 4KiB max. Filesystemsize [1] 16 TiB 16 TiB 8 EiB 16 TiB 16 EiB max. Filesize [1] 2 TiB 1 EiB 8 EiB 1 EiB 16 EiB Support Status SLES SLES SLES SLE HA Technology Preview SUSE® Linux Enterprise was the first enterprise Linux distribution to support journaling filesystems and logical volume managers back in 2000. Today, we have customers running XFS and ReiserFS with more than 8TiB in one filesystem, and the SUSE Linux Enterprise engineering team is using our 3 major Linux journaling filesystems for all their servers. We are excited to add the OCFS2 cluster filesystem to the range of supported filesystems in SUSE Linux Enterprise. For large-scale filesystems, for example for file serving (e.g., with with Samba, NFS, etc.), we recommend using XFS. (In this table "+" means "available/supported"; "-" is "unsupported") [1] The maximum file size above can be larger than the filesystem's actual size due to usage of sparse blocks. It should also be noted that unless a filesystem comes with large file support (LFS), the maximum file size on a 32-bit system is 2 GB (231 bytes). Currently all of our standard filesystems (including ext3 and ReiserFS) have LFS, which gives a maximum file size of 263 bytes in theory. The numbers given in the above tables assume that the filesystems are using 4 KiB block size. When using different block sizes, the results are different, but 4 KiB reflects the most common standard. [2] In this document: 1024 Bytes = 1 KiB; 1024 KiB = 1 MiB; 1024 MiB = 1 GiB; 1024 GiB = 1 TiB; 1024 TiB = 1 PiB; 1024 PiB = 1 EiB (see also http://physics.nist.gov/cuu/Units/binary.html ) [3] Btrf s is a copy -on-write logging-sty le f ile sy stem, so rather than needing to journal changes bef ore writing them in-place, it writes them in a new location, and then links it in. Until the last write, the new changes are not “committed.” 12 © Novell, Inc. All rights reserved.
  • 13. File Systems: ReiserFS Applications that use many small files – Mail servers – NFS servers – Database servers or other applications that use synchronous I/O 13 © Novell, Inc. All rights reserved.
  • 14. File Systems: Ext3 Default file system in SUSE Linux Enterprise 11 ® Best suited for – Small (<100GiB) file systems When using Ext3 with many files in one directory, consider enabling btree support (enabled by default in SUSE Linux Enterprise Server 11 SP 1) # mkfs.ext3 -O dir_index When using Ext3 with multiple threads appending to files in the same directory, consider turning preallocation on # mount -o reservation 14 © Novell, Inc. All rights reserved.
  • 15. File Systems: XFS Best suited for – Medium (>100GiB) to very large file systems (> 1 TiB) – Large files/many files – Streaming multimedia (low latencies) Special features and capabilities – dump/restore – online filesystem-check – online-defragmentation 15 © Novell, Inc. All rights reserved.
  • 16. Cluster File System: OCFS2 OCFS2 (Oracle Cluster File System) • Shared access by multiple nodes – Ensures data integrity in case of a node-failure – Scale-out for data access • Generic use – POSIX-compliant – Cluster-aware POSIX locking • Higher throughput – Parallel I/O • Disaster Tolerance – Integration with data replication for dual node 16 © Novell, Inc. All rights reserved.
  • 17. Filesystems: btrfs • Integrated Volume Management • Support for copy on write • Powerful snapshot capabilities • Scalability • Data integrity (checksums) • Full community support • Technology preview in SUSE Linux Enterprise Server 11 SP 1 ® 17 © Novell, Inc. All rights reserved.
  • 18. Barriers SUSE Linux Enterprise defaults to maximum data integrity ® guarantee by enforcing barriers from the file system so that reordering of journal writes cannot happen. This may cost some performance; tunable via mount option ReiserFS – enable with “barrier=flush” (default) – disable with “barrier=none” Ext3 – enable with “barrier=1” (default) – disable with “barrier=0” XFS – enable with “barrier” – disable with “nobarrier” 18 © Novell, Inc. All rights reserved.
  • 19. Logging Modes Journaling file systems offer different modes to write the actual data For ReiserFS and Ext3, mount option data=<X> – data=ordered: use barriers for data no risk exposing old data (default) – data=writeback: no barriers for data fastest in many workloads – journal: use journal for data generally slow, but can improve mail server workloads By default, SUSE Linux Enterprise Server ensures ® data integrity at the cost of some performance 19 © Novell, Inc. All rights reserved.
  • 20. Dedicated Logging Devices ReiserFS mkreiserfs -j /dev/xxx -s 8193 /dev/xxy reiserfstune –journal-new-device /dev/xxx -s 8193 Ext3 mke2fs -O journal_dev /dev/xxx mke2fs -j -J device=/dev/xxx,size=8193 /dev/xxy tune2fs -J device=/dev/xxx,size=8193 /dev/xxy XFS mkfs.xfs -l logdev=/dev/xxx,size=10000b /dev/xxy 20 © Novell, Inc. All rights reserved.
  • 21. File System Tuning Split file systems based on data access patterns – Keep commit heavy data away from data that does not have to be synchronous – Keep streaming writes and reads on different spindles than random I/O Consider disabling atime updates on files and directories # mount -o noatime,nodiratime 21 © Novell, Inc. All rights reserved.
  • 22. File System Tuning Optimize directory layout for the file system – Keep data that will be accessed together in the same subdirectories – Spread data out into different subdirectories to increase large file concurrency – Different file systems order directories differently 22 © Novell, Inc. All rights reserved.
  • 24. I/O Scheduler Flexible, pluggable I/O scheduler Selectable via boot parameter elevator=<X> – noop – deadline – as (default in mainline kernels) – cfq (default in SUSE Linux Enterprise) ® I/O Scheduler per device – Check /sys/block/*DEV*/queue/iosched – Set echo SCHEDNAME > /sys/block/*DEV*/queue/scheduler 24 © Novell, Inc. All rights reserved.
  • 25. I/O Scheduler: Noop No reordering, just merging Best for storage with extensive caching and scheduling of its own, such as: MultiPathing Activated by boot parameter elevator=noop 25 © Novell, Inc. All rights reserved.
  • 26. I/O Scheduler: Deadline Per-request service deadline – Caps maximum latency per request – Maintains good disk throughput Best for disk-intensive database applications Activated by boot parameter elevator=deadline 26 © Novell, Inc. All rights reserved.
  • 27. I/O Scheduler: Anticipatory Similar to “deadline”, but anticipates reads by putting them in front of the queue and delays a few ms after every read – Maximizes throughput – At the cost of increasing latency Best for file servers and desktop workloads with single IDE/SATA disks. Default in mainline kernels Activated by boot parameter elevator=as 27 © Novell, Inc. All rights reserved.
  • 28. I/O Scheduler: CFQ Complete Fair Queuing Treat all competing processes equally by keeping a unique request queue for each and giving equal bandwidth to each queue – Good compromise between throughput and latency – Minimal worst case latency on all reads and writes Suitable for a wide variety of applications Default in SUSE Linux Enterprise ® Activated by boot parameter elevator=cfq 28 © Novell, Inc. All rights reserved.
  • 29. Block Layer Tuning Spreading the load across controllers – Per-target locking for SCSI – Software RAID bandwidth Battery backed caching 29 © Novell, Inc. All rights reserved.
  • 30. Blocker Layer Tunables Block read ahead buffer /sys/block/<sdX/hdX>/queue/read_ahead_kb Default is 128. Increase to 512 for fast storage (SCSI disks or RAID) May speed up streaming reads a lot Number of requests /sys/block/<sdX/hdX>/queue/nr_requests Default is 128. Increase to 256 with CFQ scheduler for fast storage Increases throughput at minor latency expense 30 © Novell, Inc. All rights reserved.
  • 32. Buffer Flushing How to write dirty pages to disk This can be tuned by – /proc/sys/vm/dirty_ratio (40%) Generator of dirty data starts writeback. – /proc/sys/vm/dirty_background_ratio (10%) – /proc/sys/vm/dirty_expire_centisecs (3000) How long may dirty pages remain dirty? – /proc/sys/vm/dirty_writeback_centisecs (500) How often does bdflush wake up? Defaults are pretty high which is good for databases (but may result in lots of unreclaimable pagecache) For other workloads (HPC) you may want to lower these 32 © Novell, Inc. All rights reserved.
  • 33. VM: Swapiness The threshold when processes should be swapped can be tuned via – /proc/sys/vm/swappiness Default is 60, which works well if you want to swap out daemons or programs which have not done a lot lately Higher values will provide more buffer/page cache, lower values will wait longer to swap out idle processes 33 © Novell, Inc. All rights reserved.
  • 34. NUMA (1) NUMA = Non-uniform Memory Architecture SUSE Linux Enterprise detects and uses NUMA ® topology and automatically – Prefers memory that is local to a node; – Evenly balances system data across nodes; NUMA – Gracefully handle CPU-less nodes; etc. Also applications can (and should) optimize for NUMA topology! 34 © Novell, Inc. All rights reserved.
  • 35. NUMA (2) The NUMA system can be tuned via `numactl $CMD`; the settings then apply to $CMD and all of its children – --preferred=255 – --membind=!0-1 NUMA – --cpunodebind=2-5 – --physcpubind=2-5 – --localalloc (always allocate from current node) Node 0 may be the most contended, so avoid it 35 © Novell, Inc. All rights reserved.
  • 37. Binding Processes/Interrupts to CPUs Problem: context switching costs CPU affinity: binding CPUs to a specific process can improve performance – taskset 0x3 [-p pid] [command] In this example, 0x3 is a bitmap referring to CPUs 1 and 2; 0x6 would be CPUs 2 and 4. Bind interrupts to CPUs – cat /proc/interrupts – echo 0x3 > /proc/irq/0/smp_affinity – Example: distribute NICs among CPUs. 37 © Novell, Inc. All rights reserved.
  • 38. Network Improvements Gigabit Ethernet and 10g – Significant interrupt overhead reduction – Consider Jumbo Frames (larger 1500 bytes) – # ifconfig <DEV> mtu 9000 NFS modes – TCP (default) vs UDP – NFSv3 (default) vs NFSv4 – rsize=<X>/wsize=<X> - read/write in chunks of <X> bytes - default is 1024, use 8192 for higher throughput 38 © Novell, Inc. All rights reserved.
  • 40. Async I/O, O_DIRECT Asynchronous I/O – Specific model for concurrency – Heavily used by databases Direct I/O (O_DIRECT) on block devices or files – Databases like to use raw disks. Historically /dev/raw was used, but O_DIRECT is more performant. – Files should be preallocated (no holes, no appending); the system falls back to buffered I/O otherwise. – In both cases: cache pollution benefits – Not specific to database workloads! 40 © Novell, Inc. All rights reserved.
  • 41. Part II: Kernel Resource Managemet with Control Groups
  • 42. Control Groups • Understanding control groups: An in-depth overview – What Control Groups is designed to do – How Control Groups work • Using control groups in SUSE Linux Enterprise Server 11 ® – Understanding the components 42 © Novell, Inc. All rights reserved.
  • 43. Understanding Control Groups An In-depth Overview
  • 44. What Are Control Groups? Control Groups provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behavior. – cgroup is another name for Control Groups – Partition tasks (processes) into a one or many groups of tree hierarchies – Associate a set of tasks in a group to a set subsystem parameters – Subsystems provide the parameters that can be assigned – Tasks are affected by the assigning parameters 44 © Novell, Inc. All rights reserved.
  • 45. Example of the Capabilities of a cgroup Consider a large university server with various users - students, professors, system tasks etc. The resource planning for this server could be along the following lines: CPUs Memory Network I/O Top cpuset (20%) Professors = 50% WWW browsing = 20% / Students = 30% / CPUSet1 CPUSet2 System = 20% Prof (15%) Students (5%) | | (Profs) (Students) Disk I/O Network File System (60%) 60% 20% Professors = 50% Students = 30% Others (20%) System = 20% 45 © Novell, Inc. All rights reserved. Source: /usr/src/linux/Documentation/cgroups/cgroups.txt •
  • 46. Control Group Subsystems Two types of subsystems • Isolation and special controls – cpuset, namespace, freezer, device, checkpoint/restart • Resource control – cpu(scheduler), memory, disk i/o, network Each subsystem can be mounted independently – mount -t cgroup -o cpu none /cpu – mount -t cgroup -o cpuset none /cpuset or all at once – mount -t cgroup none /cgroup 46 © Novell, Inc. All rights reserved. Source: http://jp.linuxfoundation.org/jp_uploads/seminar20081119/CgroupMemcgMaster.pdf •
  • 47. Cpuset Subsystem (Isolation) Cpuset is for tying processes to cpu and memory. Process Process Process Process Process Process Group A1 Group A2 Group B Group A1 Group A2 Group B Memory Memory Memory Memory 47 © Novell, Inc. All rights reserved. Source: http://jp.linuxfoundation.org/jp_uploads/seminar20081119/CgroupMemcgMaster.pdf •
  • 48. Namespace Subsystem (Isolation) Namespace is for showing private view of system to processes in cgroup. Mainly used for OS-level virtualization. This subsystem itself has no special functions and just tracks changes in namespace. 48 © Novell, Inc. All rights reserved. Source: http://jp.linuxfoundation.org/jp_uploads/seminar20081119/CgroupMemcgMaster.pdf •
  • 49. Freezer Subsystem (Control) Freezer cgroup is for freezing (stopping) all tasks in a group. mount -t cgroup none /freezer -o freezer ....put task into /freezer/tasks... echo FROZEN > /freezer/freezer.state echo RUNNING > /freezer/freezer.state 49 © Novell, Inc. All rights reserved. Source: http://jp.linuxfoundation.org/jp_uploads/seminar20081119/CgroupMemcgMaster.pdf •
  • 50. Device Subsystem (Isolation) A system administrator can provide a list of devices that can be accessed by processes under cgroup – Allow/Deny Rule – Allow/Deny : READ/WRITE/MKNOD Limits access to device or file system on a device to only tasks in specified cgroup 50 © Novell, Inc. All rights reserved. Source: http://jp.linuxfoundation.org/jp_uploads/seminar20081119/CgroupMemcgMaster.pdf •
  • 51. Checkpoint/Restart Subsystem (Control) • Save all process's status in a cgroup to a dump file, restart it later (or just save state and continue) • For allowing “saved container” moved between physical machines (as VM can do) • Dump all process's image to a file 51 © Novell, Inc. All rights reserved. Source: http://jp.linuxfoundation.org/jp_uploads/seminar20081119/CgroupMemcgMaster.pdf •
  • 52. CPU Subsystem (Resource Control) • Share CPU bandwidth between groups by group scheduling function of CFS (the scheduler) • Mechanically complicated Share = 2000 Share = 1000 Share = 4000 52 © Novell, Inc. All rights reserved.
  • 53. Memory Subsystem (Resource Control) • For limiting memory usage of user space processes. • Limit LRU (Least Recently Used) pages – Anonymous and file cache • No limits for kernel memory – Maybe in another subsystem if needed 53 © Novell, Inc. All rights reserved. Source: http://jp.linuxfoundation.org/jp_uploads/seminar20081119/CgroupMemcgMaster.pdf •
  • 54. Disk I/O Subsystem (Resource Control) (Draft) • 3 proposals are currently being discussed – dm-ioband, io-throttle, io-controller • Consensus has not been reached but io-controller seems to taking the lead – Both dm-ioband and io-throttle suffer from a significant problem: they can defeat the policies (such as I/O priority) being implemented by the I/O scheduler. – Io-throttle is does bandwidth control at the I/O scheduler level – Designed to work with mainline I/O controllers: CFQ, deadline, Anticipatory, and no-op but requires significant changes – Currently v4 as of June 8, 2009 and based on 2.6.30-rc8 kernel Source: http://jp.linuxfoundation.org/jp_uploads/seminar20081119/CgroupMemcgMaster.pdf Source: http://lwn.net/Articles/331857/ Source: http://lwn.net/Articles/332839/ 54 © Novell, Inc. All rights reserved. Source: http://lkml.org/lkml/2009/6/8/580
  • 55. Network I/O Subsystem (Resource Control) (Draft) • Like the Disk I/O subsystem, it seems the jury is still out on the implementation of this subsystem • Kernel developers are talking about traffic control – cgroup_tc - This patch provides a simple resource controller which uses traffic control (tc) features already in the Linux kernel – Not much discussion on this topic since late 2008 Source: http://lkml.org/lkml/2008/7/22/361 Source: https://lists.linux-foundation.org/pipermail/containers/2008-August/012419.html 55 © Novell, Inc. All rights reserved. Source: https://lists.linux-foundation.org/pipermail/containers/2008-August/012512.html
  • 56. Reading More on cgroups Remember to install kernel source!! – /usr/src/linux/Documentation/cgroups/cgroup.txt – /usr/src/linux/Documentation/cpusets.txt – /usr/src/linux/Documentation/controllers/* – /usr/src/linux/Documentation/scheduler/sched-design-CFS.txt – /usr/src/linux/Documentation/kernel-parameters.txt Additional RPM packages – libcgroup1 - /usr/share/doc/packages/libcgroup1/README* – cpuset (Alex Tsariounov) - /usr/share/doc/packages/cpuset/cset*.txt 56 © Novell, Inc. All rights reserved.
  • 57. Reading More on cgroups (continued) Manpages – man cpuset – man cset On the web – http://lkml.org/lkml/2009/2/9/372 – http://lkml.org/lkml/2009/2/10/140 – http://lkml.org/lkml/2008/1/29/60 – http://kerneltrap.org/mailarchive/linux-kernel/2008/6/18/2161114/thread 57 © Novell, Inc. All rights reserved.
  • 58. Using Control Groups in SUSE Linux Enterprise Server 11 ®
  • 59. Preparing SUSE Linux Enterprise Server 11 ® • Start with patched SLES11 install • Add the following packages – libcgroup1 – cpuset – libcpuset1 – kernel-source (Documentation purposes) – gcc (Needed to compile the stress tool) 59 © Novell, Inc. All rights reserved.
  • 60. What Subsystems Are Available? A way to figure this out. – mount -t cgroup none /cgroup – cat /proc/mounts Current subsystems in SUSE Linux Enterprise Server 11 – rw,freezer,devices,cpuacct,cpu,ns,cpuset – memory – Disabled by default > Add a kernel parameter - cgroup_enable=memory Possible future subsystems in SLES 11 – Disk and Network subsystem controllers 60 © Novell, Inc. All rights reserved.
  • 61. Generating Load on SUSE Linux Enterprise Server 11 ® Search the Web for “linux load generator” - Results: • http://devin.com/lookbusy/ • http://www.ibm.com/developerworks/linux/library/l-stress/index.html – Good article • http://ltp.sourceforge.net/ – Powerful toolkit for Linux developers • http://hardware.slashdot.org/article.pl?sid=05/04/06/218233 – Simple scripting examples • http://weather.ou.edu/~apw/projects/stress/ – Probably best choice – Available (community driven) also at: http://software.opensuse.org/search?baseproject=SUSE:SLE-11&p=1&q=stress 61 © Novell, Inc. All rights reserved.
  • 62. Crash Course on CPUSETs The Hard Way • Determine the number of CPUs and Memory Nodes – Look at /proc/cpuinfo and /proc/zoneinfo • Creating the CPUSET hierarchy mkdir /dev/cpuset mount -t cpuset cpuset /dev/cpuset cd /dev/cpuset mkdir Charlie cd Charlie /bin/echo 2-3 > cpus /bin/echo 1 > mems /bin/echo $$ > tasks # The current shell is now running in cpuset Charlie # The next line should display '/Charlie' cat /proc/self/cpuset • Removing the CPUSET cat /dev/cpuset/Charlie/tasks (move any remaining tasks!!) rmdir /dev/cpuset/Charlie 62 © Novell, Inc. All rights reserved.
  • 63. Crash Course on CPUSETs The Easy Way – Thanks to Alex Tsariounov of Novell ® • Determine the number of CPUs and Memory Nodes – cset set --list • Creating the CPUSET hierarchy – cset set --cpu=2-3 --mem=1 --set=Charlie • Starting processes in a CPUSET – cset proc --set Charlie --exec -- stress -c 1 & • Moving existing processes to a CPUSET – cset proc --move --pid PID --toset=Charlie • List task in a CPUSET – cset proc --list --set Charlie • Removing a CPUSET – cset set --destroy Charlie 63 © Novell, Inc. All rights reserved.
  • 64. Follow It Up with cgroups The Hard way • Creating the cgroup hierarchy mkdir /dev/cgroup mount -t cgroup cgroup /dev/cgroup cd /dev/cgroup mkdir priority cd priority cat cpu.shares • Understanding cpu.shares – 1024 is the default (more in sched-design-CFS.txt) = 50% utilization – 1524 = 60% utilization – 2048 = 67% utilization – 512 = 40% utilization • Changing cpu.shares – /bin/echo 1024 > cpu.shares 64 © Novell, Inc. All rights reserved.
  • 65. More cgroup Functionality to Learn The libcgroup1 package • Basic tools in user space to simplify resource management functionality – uid, gid or exec rules for placement of a task – /etc/init.d/cgconfig – setup cgroup filesystem based on /etc/cgconfig.conf • UID/GID rules – Managed in /etc/cgrules.conf by root user • EXEC rules – Fully managed by a user in a config file in their home directory • Methods used to place task in proper cgroup – pam_cgroup (at login); cgexec (task start); cgclassify (task move) – User space daemon (cgred in /etc/init.d and /etc/sysconfig) 65 © Novell, Inc. All rights reserved.
  • 66. Linux Containers LXC • Build upon CGroups and specific kernel settings; use “lxc-checkconfig” to check compliance • Fully enabled in SUSE Linux Enterprise Server 11 SP1 • Basic Functionality lxc-execute --name=NAME -- COMMAND • Function Overview – lxc-start lxc-execute / lxc-stop – lxc-freeze lxc-unfreeze – Monitoring: lxc-ps, lxc-info, lxc-netstat, lxc-monitor – Modifying CGroup parameters: lxc-cgroup 66 © Novell, Inc. All rights reserved.
  • 68. Monitoring Overview and Hands-on • Low Level – smartmontools - Monitor for S.M.A.R.T. Disks and Devices – sensors - Hardware health monitoring for Linux – iptraf - TCP/IP Network Monitor – pcp - Performance Co-Pilot (system-level performance monitoring) – sysstat - Sar and Iostat Commands for Linux – perfmon – blktrace, ltrace, strace – systemtap - Instrumentation System • High Level – argus – network auditing tool – nagios 68 © Novell, Inc. All rights reserved.
  • 69.
  • 70. Unpublished Work of Novell, Inc. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary, and trade secret information of Novell, Inc. Access to this work is restricted to Novell employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of Novell, Inc. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. Novell, Inc. makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for Novell products remains at the sole discretion of Novell. Further, Novell, Inc. reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All Novell marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.