Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Advanced Performance Tuning..

Download as pdf or txt
Download as pdf or txt
You are on page 1of 135
At a glance
Powered by AI
The document discusses various performance tuning topics for HP-UX systems including hardware limitations, CPU, memory, disk I/O, networking and general tools.

The main modules covered include CPU, process management, memory, disk I/O, networking and general tools.

Tools to measure CPU load include sar and glance. Tools to measure disk I/O include iostat, vmstat, ioscan and diskinfo.

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

HP-UX Advanced Performance Tuning Class


Module 1 Hardware

1) The limitations of the hardware a) CPU speed and quantity b) Amount of physical memory (RAM) c) Amount of virtual memory and configuration (swap) d) Disk type and configuration e) Type of bus architecture, and configuration

Module 2 CPU
a) What tools are there to measure CPU load? b) CPU and process management c) Scalability

Module 3 Process Management


a) Process creation b) Process execution c) Process Termination d) Kernel Threads e) Process Resource Manager

Module 4

Memory

a) Memory ranges for HP systems b)Configurable memory parameters c)The system memory map d)The McKusic & Karels memory allocator e)The Arena Allocator f)Performance Optimized Page Sizing

Module 5 Disk I/O


a)Tools to measure disk I/O b)Factors that effect disk I/O c)Data configurations and their effects

Module 6 Network Performance


a)NFS performance b)Fiber Channel performance c)Fiber Channel Mass Storage performance

Module 7 General Tools


a)ADB - the absolute debugger b)SAR the System Activity Reporter c)Iostat ,Vmstat d)Time, Timex, IPCS e)Glance

Module 8 WTEC Tools


a)kmeminfo b)shminfo c) vmtrace d) tusc

Module 1 HARDWARE
It is essential to determine the hardware configuration and limitations to set reasonable expectations

1 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

of overall performance.

Information about current HP-UX servers configuration is available at: http://www.hp.com/products1/servers/operating/hpux.html

CPU
The current range of hp servers includes the rp2405 with 1-2 650 MHz PA-8700 CPU and 256Mb -8Gb of RAM Through the Superdome with up to 64 875MHz PA 8700 CPU and 256Gb of RAM. The current range of workstations includes the b2600 with a single 500MHz PA8600 CPU with up To 4Gb of RAM through the j 6750 with dual 875 MHz PA8700 cpus, with up to 16Gb of RAM. There are many legacy systems that operate on CPU's as slow as 96MHz and with as little as 128Mb of RAM. The last of the 32 bit servers ran at a maximum processor speed of 240MHz. 10.X systems are OS limited to 3.75Gb of RAM.

SAM can be used to determine system properties , look under Performance Monitors -> System Properties . The available categories are Processor , Memory , Operating System, Network, and Dynamic System properties can also be accessed via the adb command from the command line :

To determine processor speed: echo itick_per_usec/D | adb -k /stand/vmunix /dev/mem itick_per_usec: itick_per_usec: 650 The itick_per_usec = MHz

RAM
Current HP servers range from 256Mb to 256Gb of RAM. For 32 bit architecture, the maximum amount of usable RAM is 4Gb. Any 32 bit process is limited to a 4Gb memory map. A 64 bit system has a 4 Tb limitation of addressable space. To determine physical memory: echo phys_mem_pages/D | adb64 /stand/vmunix /dev/mem phys_mem_pages: phys_mem_pages: 262144 The result is expressed in 4 Kb memory pages. To determine the size in Megabytes ( phys_mem_pages x 4096 /1024 /1024) NOTE : If Performance Optimized Pages Sizing is implemented be sure to adjust your calculations accordingly , more about this topic will be disccussed later.

2 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Disk Type and configuration.


The type and configuration of the systems disks is a key factor in determining the I/O speed the system is capable of. Standard disks will require more RAM for buffering file system writes, most modern disk arrays have onboard memory for this. The number of controllers and the amount of read and write transactions on the disk also is a major factor on overall performance. To determine specific informations on the type of disks on the system run: ioscan -fnC disk then diskinfo -v /dev/rdsk/cXtXdX These commands will let you determine the size, type and hardware path of the system disks. The type of bus is also a determining factor. The maximum data transfer rate depends on the adapter type. Currently the fastest is the A6892A PCI Dual-Channel ULTRA160 SCSI Adapter. The firmware suggested default for the A6829A adapter's maximum data transfer rate is the adapter's maximum speed (160 MB/s). The A6829A can communicate with all LVD or SE devices that have speeds up to 160 MB/s. This includes the following speeds (synchronous communication over a Wide bus): Fast (20 MB/s) Ultra (40 MB/s) Ultra2 (80 MB/s) Ultra160 (160 MB/s) Note that the actual transfer rate between the adapter and a SCSI device depends on the transfer rate that was negotiated between them. The maximum speed will be that of slowest device. As of 11i the SCSI Interface drivers for HP-PB, HSC, EISA SCSI cards have been obsoleted

Buses on HP-UX Servers:


PCI This bus is specd at a peak of 120 MB/s. Because of the excellent performance of this bus it is possible to have multiple high speed SCSI and/or network cards installed into a single bus. HSC This bus is sped at 132 MB/s. Again because of the speed of this bus it is possible to have multiple high speed SCSI and/or network cards installed into a single bus.

HP-PB (NIO) The HP-PB (NIO) system bus found on many HP Servers including the T500 and H Class Servers is spec'd at 32 MB/s. Realistic performance numbers for this bus are ~10 MB/s.

3 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Module 2 CPU

What tools can be used to determine CPU metrics? GlancePlus,top, ps -elf, sar , vmstat , SAM.

To get an idea of which processes are most CPU intensive, you can use SAMs Performance Monitors, which invokes top , which can be run from the command line . Alternately use GlancePlus, or ps -elf to see which processes have the highest cumulative CPU time. SAR CPU data The system activity reporter contains a number of useful cpu statistics : example: sar Mu 5 100 this will produce 100 data points 5 seconds apart. The output will look similar to :

11:20:05 11:20:10

cpu %usr %sys %wio %idle 0 1 1 0 99 1 17 83 0 0 system 9 42 0 49

After all samples are taken an average is printed This will return data on the cpu load for each processor: cpu - cpu number (only on a multi-processor system and used with the -M option) %usr - This is the percentage of time spent executing code in user mode, as opposed to code within the kernel.

%sys- The percentage of time running in system or kernel mode. %wio - idle with some process waiting for I/O (only block I/O, raw I/O, or Virtual memory pageins/swapins indicated)

4 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

%idle - other idle

To find out what the run queue load is, run : sar q 5 100 this will produce 100 data points 5 seconds apart. The output will look similar to :

10:06:36 10:06:41 10:06:46 10:06:51 Average

runq-sz %runocc swpq-sz %swpocc 0.0 0 0.0 0 1.5 40 0.0 0 3.0 20 0.0 0 1.0 20 0.0 0 1.8 16 0.0 0

runq-sz - Average length of the run queue(s) of processes (in memory and runnable) %runocc - The percentage of time the run queue(s) were occupied by processes (in memory and runnable) swpq-sz - Average length of the swap queue of runnable processes (processes swapped out but ready to run) These cpu reports can be combined using sar -Muq .

Typically the %usr value will be higher than %sys . If the system is making many read/write transactions this may not be true as they are system calls. Out of memory errors can occur when excessive CPU time given to system versus user processes. These can also be caused when maxdsiz is undersized. As a rule , we should expect to see %usr at 80% or less, and %sys at 50% or less. Values higher than these can indicate a CPU bottleneck. The %wio should ideally be 0%, values less than 15% are acceptable. The %idle being low over short periods of time is not a major concern . This is the percentage of time that the CPU is not running processes. However low %idle over a sustained period could be an indication of a CPU bottleneck. If the %wio is greater than 15% and %idle is low , consider the size of the runq (runq-sz). Ideally we would like to see values less than 4 . If the runq-sz is high and the %wio is 0 then there is no bottleneck . This is usually a case of many small processes running that do not overload the processors. If the system is a single processor system under heavy load the CPU bottleneck may be unavoidable.

Other metrics to consider are : Nice Utilization This is the percentage of CPU time spent running user processes with nice values of 21-39. This is typically included in user CPU utilization, but some tools such as Glance , trank this seperately to determine how much CPU time is being spent on lower priority processes.

System Call Rate The system call rate is the rate system calls are being generated by the user processes . Every system call results in a switch between user

5 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

and system or kernel mode. A high system call rate typically coerlates to high system CPU utilization. Context Switch Rate This is the number of times a CPU switches processes on average per second. This is typically included in the system CPU rate , but tools such as Glance can track it seperately . Context switches occur based on the priority of the processes in the run queue and the time set in the kernel by the parameter timeslice which by default is 100 milliseconds .(timeslice =10) Using Glance for CPU metrics Glance allows for a more in depth look at cpu statistics , the Glance cpu report consists of two pages : Page 1 CPU REPORT Users= 5 State Current Average High Time Cum Time -------------------------------------------------------------------------------User 1.5 3.2 4.3 0.08 0.51 Nice 0.0 0.1 0.2 0.00 0.02 Negative Nice 1.1 1.9 16.0 0.06 0.30 RealTime 1.1 0.5 1.1 0.06 0.08 System 2.3 2.8 4.0 0.12 0.44 Interrupt 0.8 0.7 0.8 0.04 0.11 ContextSwitch 0.6 0.6 1.3 0.03 0.10 Traps 0.0 0.0 0.0 0.00 0.00 Vfaults 0.0 0.1 1.3 0.00 0.01 Idle 92.6 90.0 92.6 4.87 14.18 Top CPU user: PID 2206, scopeux Page 2 CPU REPORT Users= 5 1.0% cpu util

State Current Cumulative High -------------------------------------------------------------------------------Load Average 4.4 4.5 4.5 Syscall Rate 1209.5 1320.0 1942.8 Intrpt Rate 412.8 380.1 412.8 CSwitch Rate 359.2 355.1 360.7 Top CPU user: PID 5916, glance 2.4% cpu util

The CPU Report screen shows global processor time allocation for different activities such as User, Nice, Real-time, System, Interrupt, Context Switch and Idle. Several values are given for each activity. On multi-processor systems, the values represent the average over all CPUs. Thus the percentage columns never exceed 100. For individual processor detail, use the 'a' (CPU By Processor) screen. For each of the activities, the Current column displays the percentage of CPU time devoted to The this activity during the last interval. The Average column shows the average percentage of time spent in this activity since data collection was started or the statistics were reset using the 'z' (zero) command. The High column shows the highest percentage ("high water mark") for this activity over all intervals. The Time column displays the amount of time spent in this activity during the last interval (displayed as a percentage in the Current column). The Cum Time column stores the cumulative total CPU time allocated to this activity since the start of data collection. The final entry indicates the current highest CPU consumer process on the system.

6 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

The CPU report second screen accessed by hiiting the + key shows the Load Average (run queue) and event rate statistics for System Calls, Interrupts and Context Switches. For each event current, cumulative and high rates are shown. The final entry indicates the current highest CPU consumer process on the system.

Glance Metrics for CPU Metric common parameters : The cumulative collection times are defined from the point in time when either: a) the process or kernel thread was first started b) the performance tool was first started c) the cumulative counters were reset (relevant only to GlancePlus), whichever occurred last. On systems with multiple CPUs, these metrics are normalized. That is, the CPU used over all processors is divided by the number of processors online This represents the usage of the total processing capacity available. * GBL_CPU_NORMAL_UTIL The percentage of time that the CPU was in user mode at normal priority during the interval. Normal priority user mode CPU excludes CPU used at real-time and nice priorities.

* GBL_CPU_NORMAL_UTIL_CUM The percentage of time that the CPU was in user mode at normal priority over the cumulative collection time. Normal priority user mode CPU excludes CPU used at real-time and nice priorities.

* GBL_CPU_NORMAL_UTIL_HIGH The highest percentage of time that the CPU was in user mode at normal priority during any one interval over the cumulative collection time. Normal priority user mode CPU excludes CPU used at real-time and nice priorities. * GBL_CPU_NORMAL_TIME The time, in seconds, that the CPU was in user mode at normal priority during the interval. Normal priority user mode CPU excludes CPU used at real-time and nice priorities.

* GBL_CPU_NORMAL_TIME_CUM The time, in seconds, that the CPU was in user mode at normal priority over the cumulative collection time. Normal priority user mode CPU excludes CPU used at real-time and nice priorities.

Nice Nice common metric parameters The NICE metrics include positive nice value CPU time only. Negative nice value CPU is broken out into NNICE (negative nice) metrics. Positive nice values range from 20 to 39. Negative nice values range from 0 to 19. * GBL_CPU_NICE_UTIL

7 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

The percentage of time that the CPU was in user mode at a nice priority during the interval. * GBL_CPU_NICE_UTIL_CUM The percentage of time that the CPU was in user mode at a nice priority over the cumulative collection time. * GBL_CPU_NICE_UTIL_HIGH The highest percentage of time during any one interval that the CPU was in user mode at a nice priority over the cumulative collection time. * GBL_CPU_NICE_TIME The time, in seconds, that the CPU was in user mode at a nice priority during the interval. * GBL_CPU_NICE_TIME_CUM The time, in seconds, that the CPU was in user mode at a nice priority over the cumulative collection time. * GBL_CPU_NNICE_UTIL The percentage of time that the CPU was in user mode at a nice priority computed from processes with negative nice values during the interval.

* GBL_CPU_NNICE_UTIL_CUM The percentage of time that the CPU was in user mode at a nice priority computed from processes with negative nice values over the cumulative collection time. * GBL_CPU_NNICE_UTIL_HIGH The highest percentage of time during any one interval that the CPU was in user mode at a nice priority computed from processes with negative nice values over the cumulative collection time.

* GBL_CPU_NNICE_TIME The time, in seconds, that the CPU was in user mode at a nice priority computed from processes with negative nice values during the interval.

* GBL_CPU_NNICE_TIME_CUM The time, in seconds, that the CPU was in user mode at a nice priority computed from processes with negative nice values over the cumulative collection time.

* GBL_CPU_NNICE_UTIL_HIGH The highest percentage of time during any one interval that the CPU was in user mode at a nice priority computed from processes with negative nice values over the cumulative collection time.

* GBL_CPU_NNICE_TIME The time, in seconds, that the CPU as in user mode at a nice priority computed from processes with negative nice values during the interval.

8 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

* GBL_CPU_NNICE_TIME_CUM The time, in seconds, that the CPU was in user mode at a nice priority computed from processes with negative nice values over the cumulative collection time. Real Time

* GBL_CPU_REALTIME_UTIL The percentage of time that the CPU was in user mode at a realtime priority during the interval. Running at a realtime priority means that the process or kernel thread was run using the rtprio command or the rtprio system call to alter its priority. Realtime priorities range from zero to 127 and are absolute priorities, meaning the realtime process with the lowest priority runs as long as it wants to. Since this can have a huge impact on the system, the realtime CPU is tracked separately to make visible the effect of using realtime priorities

* GBL_CPU_REALTIME_UTIL_CUM The percentage of time that the CPU was in user mode at a realtime priority over the cumulative collection time.

* GBL_CPU_REALTIME_UTIL_HIGH The highest percentage of time that the CPU was in user mode at a realtime priority during any one interval over the cumulative collection time. * GBL_CPU_REALTIME_TIME The time, in seconds, that the CPU was in user mode at a realtime priority during the interval.

* GBL_CPU_REALTIME_TIME_CUM The time, in seconds, that the CPU was in user mode at a realtime priority over the cumulative collection time.

System * GBL_CPU_SYSCALL_UTIL The percentage of time that the CPU was in system mode (excluding interrupt, context switch, trap, or vfault CPU) during the interval.

* GBL_CPU_SYSCALL_UTIL_CUM The percentage of time that the CPU was in system mode (excluding interrupt, context switch, trap, or vfault CPU) over the cumulative collection time. * GBL_CPU_SYSCALL_UTIL_HIGH The highest percentage of time that the CPU was in system mode

* GBL_CPU_SYSCALL_TIME The time, in seconds, that the CPU was in system mode

* GBL_CPU_SYSCALL_TIME_CUM

9 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

The time, in seconds, that the CPU was in system mode over the cumulative collection time * GBL_CPU_SYSCALL_UTIL The percentage of time that the CPU was in system mode * GBL_CPU_SYSCALL_UTIL_CUM The percentage of time that the CPU metric is normalized.

Interrupt * GBL_CPU_INTERRUPT_UTIL The percentage of time that the CPU spent processing interrupts during the interval.

* GBL_CPU_INTERRUPT_UTIL_CUM The percentage of time that the CPU spent processing interrupts over the cumulative collection time.

* GBL_CPU_INTERRUPT_UTIL_HIGH The highest percentage of time that the CPU spent processing interrupts during any one interval over the cumulative collection time.

* GBL_CPU_INTERRUPT_TIME The time, in seconds, that the CPU spent processing interrupts during the interval.

* GBL_CPU_INTERRUPT_TIME_CUM The time, in seconds, that the CPU spent processing interrupts over the cumulative collection time.

ContextSwitch * GBL_CPU_CSWITCH_UTIL The percentage of time that the CPU spends context switching during the interval. This includes context switches that result in the execution of a different process and those caused by a process stopping, then resuming, with no other process running in the meantime.

* GBL_CPU_CSWITCH_UTIL_CUM The percentage of time that the CPU spent context switching over the cumulative collection time.

* GBL_CPU_CSWITCH_UTIL_HIGH The highest percentage of time during any one interval that the CPU spent context switching over the cumulative collection time. * GBL_CPU_CSWITCH_TIME The time, in seconds, that the CPU spent context switching during the interval.

* GBL_CPU_CSWITCH_TIME_CUM The time, in seconds, that the CPU spent context switching over the cumulative collection time.

10 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Traps * GBL_CPU_TRAP_UTIL The percentage of time the CPU was executing trap handler code during the interval.

* GBL_CPU_TRAP_UTIL_CUM The percentage of time the CPU was in trap handler code over the cumulative collection time.

* GBL_CPU_TRAP_UTIL_HIGH The highest percentage of time during any one interval the CPU was in trap handler code over the cumulative collection time.

* GBL_CPU_TRAP_TIME The time the CPU was in trap handler code during the interval.

* GBL_CPU_TRAP_TIME_CUM The time, in seconds, the CPU was in trap handler code over the cumulative collection time.

Vfaults * GBL_CPU_VFAULT_UTIL The percentage of time the CPU was handling page faults during the interval. * GBL_CPU_VFAULT_UTIL_CUM The percentage of time the CPU was handling page faults over the cumulative collection time. * GBL_CPU_VFAULT_UTIL_HIGH The highest percentage of time during any one interval the CPU was handling page faults over the cumulative collection time.

* GBL_CPU_VFAULT_TIME The time, in seconds, the CPU was handling page faults during the interval. * GBL_CPU_VFAULT_TIME_CUM The time, in seconds, the CPU was handling page faults over the cumulative collection time.

Idle * GBL_CPU_IDLE_UTIL The percentage of time that the CPU was idle during the interval.

* GBL_CPU_IDLE_UTIL_CUM The percentage of time that the CPU was idle over the cumulative collection time.

11 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

* GBL_CPU_IDLE_UTIL_HIGH The highest percentage of time that the CPU was idle during any one interval over the cumulative collection time.

* GBL_CPU_IDLE_TIME The time, in seconds, that the CPU was idle during the interval. * GBL_CPU_IDLE_TIME_CUM The time, in seconds, that the CPU was idle over the cumulative collection time.

CPU process priority Processes are assigned priority by the system in 3 categories : Real Time 0-127 System Mode 128-177 User Mode 178-255 While on processor ,by default the priority will age and go up within their nice range. The nice value determines how fast a priority regains priority while waiting on cpu . This can be defeated by implementing the sched_noage policy. This prevents processes from losing priority while on CPU , caution shoulf be used when implementing this .

Process Scheduling
To understand how threads of a process run, we have to understand how they are scheduled. Although processes appear to the user to run simultaneously, in fact a single processor is executing only one thread of execution at any given moment. Several factors contribute to process scheduling: Kind of scheduling policy required -- timeshare or real-time. Scheduling policy governs how the process (or thread of execution) interacts with other processes (or threads of execution) at the same priority. Choice of scheduler. Four schedulers are available: HP-UX timeshare scheduler (SCHED_HPUX), HP Process Resource Manager (a timeshare scheduler), HP-UX real-time scheduler (HPUX_RTPRIO), and the POSIX-compliant real-time scheduler. Priority of the process. Priority denotes the relative importance of the process or thread of execution. Run queues from which the process is scheduled. Kernel routines that schedule the process.

Scheduling Policies
HP-UX scheduling is governed by policy that connotes the urgency for which the CPU is needed, as either timeshare or real-time. The following table compares the two policies in very general terms.

Comparison of Timeshare vs Real-time scheduling Timeshare


Typically implemented round-robin. Kernel lowers priority when process is running; that is, timeshare priorities decay. As you use CPU, your priority becomes weaker. As you become starved for CPU, your priority becomes stronger. Scheduler tends to regress toward the mean. Runs in timeslices that can be preempted by process running at higher priority.

Real-Time
Implemented as either round-robin or first-in-first-out (FIFO), depending on scheduler. Priority not adjusted by kernel; that is, real-time priorities are non-decaying. If a real-time priority is set at 50 and another real-time priority is set at 40 (where 40 is stronger than 50), the process or thread of priority 40 will always be more important than the process or thread of priority 50. Runs until exits or is blocked. Always runs at higher priority than timeshare.

The principle behind the distribution of CPU time is called a timeslice. A timeslice is the amount of time a process can run before the

12 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

kernel checks to see if there is an equal or stronger priority process ready to run. If a timeshare policy is implemented, a process might begin to run and then relinquish the CPU to a process with a stronger priority. Real-time processes running round-robin typically run until they are blocked or relinquish CPU after a certain timeslice has occurred. Real-time processes running FIFO run until completion, without being preempted. Scheduling policies act upon sets of thread lists, one thread list for each priority. Any runnable thread may be in any thread list. Multiple scheduling policies are provided. Each nonempty list is ordered, and contains a head (th_link) as one end of its order and a tail (th_rlink) as the other. The purpose of a scheduling policy is to define the allowable operations on this set of lists (for example, moving threads between and within lists). Each thread is controlled by an associated scheduling policy and priority. Applications can specify these parameters by explicitly executing the sched_setscheduler() or sched_setparam() functions.

Hierarchy of Priorities
All POSIX real-time priority threads have greater scheduling importance than threads with HP-UX real-time or HP-UX timeshare priority. By comparison, all HP-UX real-time priority threads are of greater scheduling importance than HP-UX timeshare priority threads, but are of lesser importance than POSIX real-time threads. Neither POSIX nor HP-UX real-time threads are subject to degradation.

Schedulers
As of release 10.0, HP-UX implements four schedulers, two time-share and two real-time. To choose a scheduler, you can use the user command, rtsched(1), which executes processes with your choice of scheduler and enables you to change the real-time priority of currently executing process ID. rtsched -s scheduler -p priority command [arguments] rtsched [ -s scheduler ] -p priority -P pid Likewise, the system call rtsched(2) provides programmatic access to POSIX real-time scheduling operations.

RTSCHED (POSIX) Scheduler


The RTSCHED POSIX-compliant real-time deterministic scheduler provides three scheduling policies, whose characteristics are compared in the following table.

RTSCHED policies Scheduling Policy


SCHED_FIFO

How it works
Strict first in-first out (FIFO) scheduling policy. This policy contains a range of at least 32 priorities. Threads scheduled under this policy are chosen from a thread list ordered according to the time its threads have been in the list without being executed. The head of the list is the thread that has been in the list the longest time; the tail is the thread that has been in the list the shortest time. Round-robin scheduling policy with a per-system time slice (time quantum). This policy contains a range of at least 32 priorities and is identical to the SCHED_FIFO policy with an additional condition: when the implementation detects that a running process has been executing as a running thread for a time period of length returned by the function sched_rr_get_interval(), or longer, the thread becomes the tail of its thread list, and the head of that thread list is removed and made a running thread. Round-robin scheduling policy, with a per-priority time slice (time quantum). The priority range for this policy contains at least 32 priorities. This policy is identical to the SCHED_RR policy except that the round-robin time slice interval returned by sched_rr_get_interval() depends upon the priority of the specified thread.

SCHED_RR

SCHED_RR2

SCHED_RTPRIO Scheduler
Realtime scheduling policy with nondecaying priorities (like SCHED_FIFO and SCHED_RR) with a priority range between the POSIX real-time policies and the HP-UX policies. For threads executing under this policy, the implementation must use only priorities within the range returned by the functions sched_get_priority_max() and sched_get_priority_min() when SCHED_RTPRIO is provided as the parameter.

NOTE : In the SCHED_RTPRIO scheduling policy, smaller numbers represent higher (stronger) priorities, which is the opposite of the POSIX scheduling policies. This is done to provide continuing support for existing applications that depend on this priority ordering. The strongest priority in the priority range for SCHED_RTPRIO is weaker than the weakest priority in the priority ranges for any of the POSIX policies, SCHED_FIFO, SCHED_RR, and SCHED_RR2.

13 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

SCHED_HPUX Scheduler
The SCHED_OTHER policy, also known as SCHED_HPUX and SCHED_TIMESHARE, provides a way for applications to indicate, in a portable way, that they no longer need a real-time scheduling policy. For threads executing under this policy, the implementation can use only priorities within the range returned by the functions sched_get_priority_max() and sched_get_priority_min() when SCHED_OTHER is provided as the parameter. Note that for the SCHED_OTHER scheduling policy, like SCHED_RTPRIO, smaller numbers represent higher (stronger) priorities, which is the opposite of the POSIX scheduling policies. This is done to provide continuing support for existing applications that depend on this priority ordering. However, it is guaranteed that the priority range for the SCHED_OTHER scheduling policy is properly disjoint from the priority ranges of all of the other scheduling policies described and the strongest priority in the priority range for SCHED_OTHER is weaker than the weakest priority in the priority ranges for any of the other policies, SCHED_FIFO, SCHED_RR, and SCHED_RR2.

Scheduling Priorities
All processes have a priority, set when the process is invoked and based on factors such as whether the process is running on behalf of user or system and whether the process is created in a time-share or real-time environment. Associated with each policy is a priority range. The priority ranges foreach policy can (but need not) overlap the priority ranges of other policies. Two separate ranges of priorities exist: a range of POSIX standard priorities and a range of other HP-UX priorities. The POSIX standard priorities are always higher than all other HP-UX priorities. Processes are chosen by the scheduler to execute a time-slice based on priority. Priorities range from highest priority to lowest priority and are classified by need. The thread selected to run is at the head of the highest priority nonempty thread list.

Internal vs. External Priority Values


With the implementation of the POSIX rtsched, HP-UX priorities are enumerated from two perspectives -- internal and external priority values. The internal value represents the kernel's view of the priority. The external value represents the user's view of the priority, as is visible using the ps(1) command. In addition, legacy HP-UX priority values are ranked in opposite sequence from POSIX priority values: In the POSIX standard, the higher the priority number, the stronger the priority. In legacy HP-UX implementation, the lower the priority number, the stronger the priority.

The following macros are defined in pm_rtsched.h to enable a program to convert between POSIX and HP-UX priorities and internal to external values: PRI_ExtPOSIXPri_To_IntHpuxPri To derive the HP-UX kernel (internal) value from the value passed by a user invoking the rtsched command (that is, using the POSIX priority value). PRI_IntHpuxPri_To_ExtPOSIXPri() To convert HP-UX (kernel) internal priority value to POSIX priority value. PRI_IntHpuxPri_To_ExtHpuxPri To convert HP-UX internal to HP-UX external priority values. rtsched_numpri Parameter A configurable parameter, rtsched_numpri, controls: The number of scheduling priorities supported by the POSIX rtsched scheduler. The range of valid values is 32 to 512 (32 is default) Increasing rtsched_numpri provides more scheduling priorities at the cost of increased context switch time, and to a minor degree, increased memory consumption.

Schedulers and Priority Values


There are now four sets of thread priorities: (Internal to External View)

Scheduler priority values Type of Scheduler


POSIX Standard Real-time System, timeshare User, timeshare

External Values
512 to 480 512 to 640 640 to 689 690 to 767

Internal Values
0 to 31 0 to 127 128 to 177 178 to 255

14 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

NOTE: For the POSIX standard scheduler, the higher the number, the stronger the priority. For the RTPRIO scheduler, the lower the number, the stronger the priority.

The following lists categories of priority, from highest to lowest: RTSCHED (POSIX standard) ranks as highest priority range, and is separate from other HP-UX priorities. RTSCHED priorities range between 32 and 512 (default 32) and can be set by the tunable parameter rtsched_numpri. SCHED_RTPRIO (real-time priority) ranges from 0-127 and is reserved for processes started

with

rtprio() system calls.


Two priorities used in a timeshare environment: User priority (178-255), assigned to user processes in a time-share environment. System priority (128-177), used by system processes in a time-share environment. The kernel can alter the priority of time-share priorities (128-255) but not real-time priorities (0-127). The following priority values, internal to the kernel, are defined in param.h: PRTSCHEDBASE Smallest (strongest) RTSCHED priority MAX_RTSCHED_PRI PRTBASE PTIMESHARE Maximum number of RTSCHED priorities Smallest (strongest) RTPRIO priority. Defined as PRTSCHED + MAX_RTSCHED_PRI. Smallest (strongest) timeshare priority. Defined as PRTBASE + 128.

PMAX_TIMESHARE Largest (weakest) timeshare priority. Defined as 127 + PTIMESHARE. Priorities stronger (smaller number) than or equal to PZERO cannot be signaled. Priorities weaker (bigger number) than PZERO can be signaled.

RTSCHED Priorities
The following discussion illustrates the HP-UX internal view, based on how the user specifies a priority to the rtsched command. Each available real-time scheduler policy has a range of priorities (default values shown below).

Scheduler Policy
SCHED_FIFO SCHED_RR SCHED_RR2 SCHED_RTPRIO

highest priority
31 31 31 0

lowest priority
0 0 0 127

The user may invoke the rtsched(1) command to assign a scheduler policy and priority. For example:
rtsched -s SCHED_RR -p 31 ls Within kernel mode sched_setparam() is called to set the scheduling parameters of a process. It (along with sched_setscheduler()) is the mechanism by which a process changes its (or another process') scheduling parameters. Presently the only scheduling parameter is priority, sched_priority. The sched_setparam() and sched_setscheduler() system calls look up the process associated with the user argument pid, and call the internal routine sched_setcommon() to complete the execution. sched_setcommon() is the common code for sched_setparam() and sched_setscheduler(). It modifies the threads scheduling priority and policy. The scheduler information for a thread is kept in its thread structure. It is used by the scheduling code, particularly setrq(), to decide when the thread runs, with respect to the other threads in the system. sched_setcommon() is called with the sched_lock held. sched_setcommon() calls the macro PRI_ExtPOSIXPri_To_IntHpuxPri, defined in pm_rtsched.h. The priority requested is then converted. Since priorities in HP-UX are stronger for smaller values, and the POSIX specification requires the opposite behavior, we merge the two by running the rtsched priorities from ((MAX_RTSCHED_PRI-1) - rtsched_info.rts_numpri) (strongest) to (MAX_RTSCHED_PRI-1) (weakest). Based on the macro definition using the value passed by the user, the internal value seen by the kernel is calculated as follows: ((MAX_RTSCHED_PRI - 1) - (ExtP_pri)) 512 - 1 - 31 = 480

15 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

The kernel priority of the user's process is 480. The value of 480 is the strongest priority available to the user.

Run Queues
A process must be on a queue of runnable processes before the scheduler can choose it to run. Processes get linked into the run queue based on the process's priority, set in the process table. Run queues are link-listed in decreasing priority. The scheduler chooses the process with the highest priority to run for a given time-slice. Each process is represented by its header on the list of run queue headers; each entry in the list of run queue headers points to the process table entry for its respective process. The kernel maintains separate queues for system-mode and user-mode execution. System-mode execution takes precedence for CPU time. User-mode priorities can be preempted -- stopped and swapped out to secondary storage; kernel-mode execution cannot. Processes run until they have to wait for a resource (such as data from a disk, for example), until the kernel preempts them when their run time exceeds a time-slice limit, until an interrupt occurs, or until they exit. The scheduler then chooses a new eligible highest-priority process to run; eventually, the original process will run again when it has the highest priority of any runnable process. When a timeshare process is not running, the kernel improves the process's priority (lowers its number). When a process is running, its priority worsens. The kernel does not alter priorities on real-time processes. Timeshared processes (both system and user) lose priority as they execute and regain priority when they do not execute

Run Queue Initialization


Run queues are initialized by the routine rqinit(), which is called from init_main.c after system monarch processor is established and before final kernel initialization. rqinit examines all potential entries in the system global per-processor information structure (struct mpinfo), gets the run queue information and pointers to the linked list of running threads. It then clears the run queue data in bestq (an index into the array of run queue points which points to the highest priority non-empty queue), newavg_on_rq (the run queue average for the processor), nready_locked and nready_free (sums provided the total threads in the processor's run queues). rqinit then sets the current itimer value for all run queues, links the queue header as the sole element, and sets up the queue. Next, the RTSCHED-related global run data structures are initialized with the global structure rtsched_info (defined in pm_rtsched.h), which describes the RTSCHED run queues.

Entries in rtsched_info Entry


rts_nready rts_bestq rts_numpri rts_rr_timeslice *rts_timeslicep *rts_qp *rts_lock

Purpose
Total number of threads on queues Hint of which queue to find threads Number of RTSCHED priorities Global timeslice for SCHED_RR threads Round-robin timeslices for each priority (used by SCHED_RR2 threads) Pointer to run queues Spinlock for the run queues

The tunable parameter rtsched_numpri determines how many run queues exist: The minimum value allowed is 32, imposed by the POSIX.4 specification and defined as RTSCHED_NUMPRI_FLOOR. The maximum supported value of 512 is a constant of the implementation, defined as RTSCHED_NUMPRI_CEILING and set equal to MAX_RTSCHED_PRI. If a higher maximum is required, the latter definition must be changed. malloc() is called to allocate space for RTSCHED run queues. (rtsched_numpri * sizeof (struct mp_threadhd)) bytes are required. The resulting pointer is stored in rtsched_info.rts_qp. Timeslice is checked to ensure that it is set to a valid value, which may be either -l (meaning no timeslicing) or positive integers. If it is invalid, it is set to the default, HZ/10. rtsched_info.rts_rr_timeslice is set to timeslice, which round-robins with that many clock ticks. For each of the rtsched_numpri run queues, the struct mp_threadhd header block is linked circularly to itself. Finally, a spinlock is allocated to lock the run queue.

Note: There is one RTSCHED run queue systemwide, though separate track is kept for each processor. The queue for given thread is
based on how the scheduling policy is defined. One global set of run queues is maintained for RTSCHED (SCHED_FIFO, SCHED_RR, SCHED_RR2) threads. Run queues are maintained for each SPU for SCHED_TIMESHARE and SCHED_RTPRIO threads.

RTSCHED Run Queue

16 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

The following figure shows threads set to run at various RTSCHED priorities. The global RTSCHED run queues are searched for the strongest (most deserving) thread to run; the best candidate is returned as a kthread_t. Each priority has one thread list. Any runnable thread may be in any thread list. Multiple scheduling policies are provided. Each nonempty list is ordered, and contains a head (th_link) at one end of its order and a tail (th_rlink) at the other. rtsched_info.rts_qp points to the strongest RTSCHED queue. rtsched_info.rts_bestq points to the queue to begin the search. The search (by the routine find_thread_rtsched()) proceeds from rts_bestq downwards looking for non-empty run queues. When the first non-empty queue is found, its index is noted in the local first_busyq. All threads in that queue are checked to determine if they are truly runnable or blocked on a semaphore. If there is a runnable thread, the rts_bestq value is updated to the present queue and a pointer to the thread found is returned to the caller. If no truly runnable thread is found, threads blocked on semaphores are considered. If first_busyq is set, the rts_bestq value is updated to it and the thread at the head of that queue is returned to the caller. If first_busyq did not get set in the loop, the routine panics, because it should be called only if rtsched_info.rts_nready is non-zero. Although the threads scheduler is set to a default value of 32 (RTSCHED_NUMPRI_FLOOR), it can be expanded to a system limit of PRTSCHEDBASE (a value of 0).

The Combined SCHED_RTPRIO and SCHED_TIMESHARE Run Queue


The SCHED_RTPRIO and SCHED_TIMESHARE priorities use the same queue. The SCHED_RTPRIO and SCHED_TIMESHARE queue is searched with the same technique as the RTSCHED queue. The most deserving thread is found to run on the current processor. The search starts at bestq, which is an index into the table of run queues. There is one thread list for each priority. Any runable thread may be in any thread list. Multiple scheduling policies are provided. Each nonempty list is ordered, and contains a head (th_link) as one end of its order and a tail (th_rlink) as the other. The mp_rq structure constructs the run queues by linking threads together. The structure qs is an array of pointer pairs that act as a doubly linked list of threads. Each entry in qs[] represents a different priority queue. sized by NQS, which is 160. The qs[].th_link pointer points to the first thread in the queue and the qs[].th_rlink pointer points to the tail.

SCHED_RTPRIO (HP-UX REAL TIME) run queue SCHED_RTPRIO (HP-UX REAL TIME) run queue

Priorities 0 (highest realtime priority) through 127 (least realtime priority) are reserved for real time threads. The real time priority thread will run until it sleeps, exits, or is preempted by a higher priority real time thread. Equal priority threads will be run in a round robin fashion. The rtprio(1) command may be used to give a thread a real time priority. To use the rtprio(1) command a user must belong in the PRIV_RTPRIO privilege group or be superuser (root). The priorities of real time threads are never modified by the system unless explicitly requested by a user (via a command or system call). Also a real time thread will always run before a time share thread. The following are a few key points regarding a real-time thread: Priorities are not adjusted by the kernel Priorities may be adjusted by a system call Real-time priority is set in kt_pri The p_nice value has no effect

SCHED_TIMESHARE run queue

17 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

SCHED_TIMESHARE run queue

Timeshare threads are grouped into system priorities (128 through 177) and user priorities (178 through 255). The queues are four priorities wide. The system picks the highest priority timeshare thread, and lets it run for a specific period of time (timeslice). As the thread is running its priority decreases. At the end of the time slice, a new highest priority is chosen. Waiting threads gain priority and running threads lose priority in order to favor threads that perform I/O and give lesser attention to compute-bound threads. SCHED_TIMESHARE priorities are grouped as follows: Real-time priority thread: range 0-127 Time-share priority thread: range 128-255 System-level priority thread: range 128-177 User-level priority thread: range 178-255 RTSCHED priority queues are one priority wide; timeshare priority queues are four priorities wide.

Thread Scheduling
The thread of a parent process forks a child process. The child process inherits the scheduling policy and priority of the parent process. As with the parent thread, it is the child thread whose scheduling policy and priority will be used. Each thread in a process is independently scheduled. Each thread contains its own scheduling policy and priority Thread scheduling policies and priorities may be assigned before a thread is created (in the threads attributes object) or set dynamically while a
thread is running. Each thread may be bound directly to a CPU. Each thread may be suspended (and later resumed) by any thread within the process. The following scheduling attributes may be set in the threads attribute object. The newly created thread will contain these scheduling attributes:

contentionscope

PTHREAD_SCOPE_SYSTEM specifies a bound (1 x 1, kernel-spacel) thread. When a bound thread is created, both a user thread and a kernel-scheduled entity are created. PTHREAD_SCOPE_PROCESS will specify an unbound (M x N, combination user- and kernel-space) thread. (Note, HP-UX release 10.30 does not support unbound threads.)

inheritsched

PTHREAD_INHERIT_SCHED specifies that the created thread will inherit its scheduling values from the creating thread, instead of from the threads attribute object. PTHREAD_EXPLICIT_SCHED specifies that the created thread will get its scheduling values from the threads attribute object.

schedpolicy schedparam

The scheduling policy of the newly created thread The scheduling parameter (priority) of the newly created thread.

Timeline

18 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

A process and its thread change with the passage of time. A thread's priority is adjusted four key times

Thread priority adjustments Interval


10 milliseconds

What happens
The clock-interrupt handling routine clock_int() adjusts a time interval on the monarch every clock tick. The monarch processor calls hardclock() to handle clock ticks on the monarch for general maintenance (such as disk and LAN states). hardclock() calls per_spu_hardclock() to charge the running thread with cpu time accumulated (kt_cpu). per_spu_hardclock() determines the running thread has accumulated 40ms of time and calls setpri(). setpri() calls calcusrpri() to adjust the running thread's user priority (kt_usrpri). By default, 10 clock ticks represents the value of timeslice, the configurable kernel parameter that defines the amount of time one thread is allowed to run before the CPU is given to the next thread. Once a timeslice interval has expired a call to swtch() is made to enact a context switch. statdaemon() loops on the thread list and once every second calls schedcpu() to update all thread priorities. The kt_usrpri priority is given to the thread on the next context switch; if in user mode kt_usrpri is given immediately.

40 milliseconds 100 milliseconds one second

Thread scheduling routines Routine


hardclock() per_spu_hardclock() setpri()

Purpose
Runs on the monarch processor to handle clock ticks. handles per-processor hardclock activities. Called with a thread as its argument and returns a user priority for that thread. Calls calcusrpri() to get the new user priority. If the new priority is stronger than that of the currently running thread, setpri() generates an MPSCHED interrupt on the processor executing that thread, stores the new user priority in kt_usrpri and returns it to its caller. The user priority (kt_usrpri) portion of setpri(). calcusrpri() uses the kt_cpu and p_nice(proc) fields of the thread, tt, to determine tt's user priority and return that value without changing any fields in *tt. If tt is a RTPRIO or RTSCHED thread, kt_usrpri is the current value of kt_pri. Finds the most deserving runnable thread, takes it off the run queue, and sets it to run. A general-purpose kernel process run once per second to check and update process and virtual memory artifacts, such as signal queueing and free protection IDs. Calls schedcpu() to recompute thread priorities and statistics. Once a second, schedcpu() loops through the thread list to update thread scheduling priorities. If the system has more than one SPU, it balances SPU loads. schedcpu updates thread usage information (kt_prevrecentcycles and kt_fractioncpu), calculates new kt_cpu for the current thread (info used by setpri(), updates the statistics of runnable threads on run queues and those swapped out, and awakens the swapper. Calls setpri(). Routine used to put threads onto the run queues. Set the appropriate protection (spl7 in UP case, thread lock in MP case). Assert valid HP-UX priority and scheduling policy and perform policy-specific setup Routine used to remove a thread from its run queue. With a valid kt_link, set the appropriate protection (spl7 in the UP case or thread lock in MP case). Find the processor on which the thread is running. Decrement the thread count on run queues. Update the mpinfo structure. Restore the old spl level, update RTSCHED counts if necessary. Adjust the kt_pri, return to schedcpu.

calcusrpri()

swtch() statdaemon()

schedcpu()

setrq() remrq()

Adjusting a Thread Priority


Every 10 msecs, the routine hardclock() is called with spinlock SPL5 to disable I/O modules and software interrupts. hardclock() calls the per-processor routine per_spu_hardclock(), which looks for threads whose priority is high enough to run. ( Searching the processor

19 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

run queues depends on the scheduling policy). If a thread is found, the MPSCHED_INT_BIT in the processor EIRR (External Interrupt Request Register) is set. When the system receives an MPSCHED_INT interrupt while running a thread in user mode, the trap handler puts the thread on a run queue and switches context, to bring in the high- priority thread. If the current executing thread is the thread with the highest priority, it is given 100ms (one timeslice) to run. hardclock() calls setpri() every 40ms to review the thread's working priority (kt_pri). setpri() adjusts the user priority (kt_usrpri) of a time-share thread process based on cpu usage and nice values. While a time-share thread is running, kt_cpu time increases and its priority (kt_pri) worsens. RTSCHED or RTPRIO thread priorities do not change. Every 1 second, schedcpu() decrements the kt_cpu value for each thread on the run queue. setpri() is called to calculate a new priority of the current thread being examined in the schedcpu() loop. remrq() is called to remove that thread from the run queue and then setrq() places the thread back into the run queue according to its new priority. If a process is sleeping or on a swap device (that is, not on the run queue), the user priority (kt_usrpri) is adjusted in setpri() and kt_pri is set in schedcpu().

Context Switching
In a thread-based kernel, the kernel manages context switches between kernel threads, rather than processes. Context switching occurs when the kernel switches from executing one thread to executing another. The kernel saves the context of the currently running thread and resumes the context of the next thread that is scheduled to run. When the kernel preempts a thread, its context is saved. Once the preempted thread is scheduled to run again, its context is restored and it continues as if it had never stopped. The kernel allows context switch to occur under the following circumstances: Thread exits Thread's time slice has expired and a trap is generated. Thread puts itself to sleep, while awaiting a resource. Thread puts itself into a debug or stop state Thread returns to user mode from a system call or trap A higher-priority thread becomes ready to run If a kernel thread has a higher priority than the running thread, it can preempt the current running thread. This occurs if the thread is awakened by a resource it has requested. Only user threads can be preempted. HP-UX does not allow preemption in the kernel except when a kernel thread is returning to user mode. In the case where a single process can schedule multiple kernel threads (1 x 1 and M x N), the kernel will preempt the running thread when it is executing in user space, but not when it is executing in kernel space (for example, during a system call).

The swtch() Routine


The swtch() routine finds the most deserving runnable thread, takes it off the run queue, and starts running it.

swtch() routines Routine


swidle()(asm_utl.c)

Purpose
Performs an idle loop while waiting to take action. Checks for a valid kt_link. On a uniprocessor machine without a threadlock thread, goes to spl7. Finds the thread's spu. Decrements the count of threads on run queues. Updates ndeactivated, nready_free, nready_locked in the mpinfo() structure.Removes the thread from its run queue. Restores the old spl level. Updates RTSCHED counts. Routine called to save states.Saves the thread's process control block (pcb) marker For the current CPU, find the most deserving thread to run and remove the old. Search starts at bestq, an index into the table of run queues. When found, set up the new thread to run. Mark the interval timer in the spu's mpinfo.Set the processor state as MPSYS. Remove the thread from its run queue. Verify that it is runnable (kt_stat== TSRUN). Set the EIRR to MPSCHED_INT_ENABLE. Set the thread context bit to TSRUNPROC to indicate the thread is running. Restores the register context from pcb and transfers control to enable the thread to resume execution.

save()(resume.s) find_thread_my_spu() (pm_policy.c)

resume()(resume.s)

Process and Processor Interval Timing


Timing intervals are used to measure user, system, and interrupt times for threads and idle time for processors. These measurements are taken and recorded in machine cycles for maximum precision and accountability. The algorithm for interval timing is described in pm_cycles.h. Each processor maintains its own timing state by criteria defined in struct mpinfo, found in mp.h.

20 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Processor timing states Timing state


curstate starttime prevthreadp idlecycles

Purpose
The current state of the processor (spustate_t) Start time (CR16) of the current interval Thread to attribute the current interval. Total cycles the SPU has spent idling since boot (cycles_t)

Processor states SPU state


SPUSTATE_NONE SPUSTATE_IDLE SPUSTATE_USER SPUSTATE_SYSTEM

Meaning
Processor is booting and has not yet entered another state Processor is idle. Processor is in user mode Processor is in syscall() or trap.

Time spent processing interrupts is attributed to the running process as user or system time, depending on the state of the process when the interrupt occurred. Each time the kernel calls wakeup() while on the interrupt stack, a new interval starts and the time of the previous interval is attributed to the running process. If the processor is idle, the interrupt time is added to the processor's idle time.

State Transitions
A thread leaves resume(), either from another thread or the idle loop. Protected by a lock, the routine resume_cleanup() notes the time, attributes the interval to the previous thread if there was one or the processor's idle time if not, marks the new interval's start time, and changes the current state to SPUSTATE_SYSTEM. When the processor idles, the routine swtch(), protected by a currently held lock, notes the time, attributes the interval to the previous thread, marks the new interval as starting at the noted time, and changes the current state to SPUSTATE_IDLE.

A user process makes a system call. A user process makes a system call.

A user process running in user-mode at (a) makes a system call at (b). It returns from the system call at (e) to run again in user-mode. Between (b) and (e) it is in running in system-mode. Toward the beginning of syscall() at (c), a new system-mode interval starts. The previous interval is attributed to the thread as user time. Toward the end of syscall() at (d), a new user-mode interval starts. The previous interval is attributed to the thread as system-time. For timing purposes, traps are handled identically, with the following exceptions: (c) and (d) are located in trap(), not syscall(), and whether or not (d) starts a user- or system-mode interval depends on the state of the thread at the time of the trap.

An interrupt occurs

21 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

An interrupt occurs

Interrupts are handled much like traps, but any wakeup that occurs while on the interrupt stack (such as w1 and w2 in the figure above) starts a new interval and its time is attributed to the thread being awakened rather than the previous thread. Interrupt time attributed to processes is stored in the kt_interrupttime field of the thread structure. Concurrent writes to this field are prevented because wakeup is the only routine (other than allocproc()) that writes to the field, and it only does so under the protection of a spinlock. Reads are performed (by pstat() and others) without locking, by using timecopy() instead. Conceptually, the work being done is on behalf of the thread being awakened instead of the previously running thread.

CPU Bottlenecks
To determine which processes are taking up the majority of cpu time run : # ps ef | sort rnk 8 | more The time column is the 8th one.

CPU bottlenecks show up as a high %wio or wait on I/O from sar -u . For multi processor systems add the capital M prefix. High wait on I/O can be caused by a number of factors: The total amount of jobs in cpu run queue. This can be detected with sar -q by looking at the runq-sz column, typically this value is 1.0. as the amount of jobs increases the effect is logarhytmic. Priority The highest priority processes in the cpu run queue receive processor time, as processes run their priority ages to allow other processes access. If there are processes in the run queue with significantly more important priority running, low priority process may get very little or no CPU time, as a result more jobs will accumulate in the run queue, which increases the wait on I/O.

Kernel parameters that effect CPU


Timeslice This kernel parameter defaults to 10 , which equals 100 milliseconds . The implementation of some tuned parameter sets or user-implemented changes can alter this with some dramatic effects. Shorter values can increase the cpu overhead by causing excessive context switching. Every context switch requires the system to re-prioritize the jobs in the run queue based on their relative importance. Processes will context switch out at either the end of their allotted timeslice or if a more important process enters the CPU run queue.

System Tables The system tables include the process, inode and file tables. These are the most frequently misconfigured kernel parameters. These are represented in the kernel as nproc , ninode , vx_ninode and nfile. As they are by default controlled by maxusers , they are frequently oversized. Most vendor recommendations do not specify values for the individual tables, instead they recommend setting maxusers . This may be an appropriate starting point, however as the vast majority of systems do not use HFS file systems outside of the requirement for /stand this creates a problem. The HFS inode table, controlled by ninode is frequently oversized. System tables should be sized based on system load. Over sizing the tables causes excessive system overhead reading the tables, and excessive kernel memory use. Inode Table On 10.20 the inode table and dnlc (directory name lookup cache) are combined. As most systems run only the /stand file system as HFS the size of this parameter does not need to be any larger than the amount of HFS inodes the system requires and enough space for an adequate HFS dnlc, 1024 is an adequate value to address this. On 11.00 the dnlc is now configurable using the ncsize and vx_ncsize kernel parameters.

22 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

By default ncsize =(ninode+vx_ncsize) +(8*dnlc_hash_locks) . The parameter vx_ncsize defines the memory space reserved for VxFS directory path-name cache (in bytes) The default value for vx_ncsize is 1024, dnlc_hash_ locks defaults to 512. A VxFS file system obtains the value of vx_ninode from the system configuration file used for making the kernel (/stand/system for example). This value is used to determine the number of entries in the VxFS inode table. By default, vx_ninode initializes at zero; the file system then computes a value based on the system memory size (see Inode Table Size). To change the computed value of vx_ninode, you can add an entry to the system configuration file. For example: vx_ninode sets the inode table size to 1,000,000 inodes after making a new kernel using mk_kernel and then rebooting. The number of inodes in the inode table is calculated according to the following table. The first column is the amount of system memory, the second is the number of inodes. If the available memory is a value between two entries, the value of vx_ninode is interpolated. Table 1 Inode Table Size Total Memory in Mbytes 8 16 32 64 128 256 512 1024 2048 8192 32,768 131,072 Maximum Number of Inodes 400 1000 2500 6000 8000 16,000 32,000 64,000 128,000 256,000 512,000 1,024,000

Inode Tables The HFS Inode Cache The HFS Inode cache contains information about the file type , size , timestamp , permissions and block map. This information is stored in the On disk inode . The In-memory inode contains information on on-disk inode, linked list and other pointers inode number and lock primitives. One inode entry for every open file must exist in memory . Closed file inode are kept on the free list. The HFS inode table is controlled by the kernel parameter ninode . Memory costs for the HFS inode cache in bytes for inode/vnode /hash entry 10.20 424 11.0 32 bit 444 11.0 64 bit 680 11i 32 bit 475 11i 64 bit 688

On 10.20 the inode table and dnlc (directory name lookup cache) are combined. The tunable parameter for dnlc ncsize was introduced in patch PHKL_18335.

On 11.00 the dnlc is now configurable using the ncsize and vx_ncsize kernel parameters. By default ncsize =(ninode+vx_ncsize) +(8*dnlc_hash_locks) . The parameter vx_ncsize defines the memory space reserved for VxFS directory path-name cache (in bytes) The default value for vx_ncsize is 1024, dnlc_hash_ locks defaults to 512. As of JFS 3.5 vx_ncsize became obsolete. The JFS Inode Cache A VxFS file system obtains the value of vx_ninode from the system configuration file used for making the kernel (/stand/system for example). This value is used to determine the number of entries in the VxFS inode table. By default, vx_ninode initializes at zero; the file system then computes a value based on the system memory size (see Inode Table Size). To change the computed value of vx_ninode, you can hard code the value in SAM . For example: Set vx_ninode=16,000. The number of inodes in the inode table is calculated according to the following table. The first column is the amount of system memory, the second is the number of inodes. If the available memory is a value between two entries, the value of vx_ninode is interpolated.

23 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

The memory requirements for JFS are dependent on the revision of JFS and system memory.

Maximum VxFS inodes in the cache based on system memory System Memory in MB 256 512 1024 2048 8192 32768 131072 JFS 3.1 18666 37333 74666 149333 149333 149333 149333 JFS.3.3-3.5 16000 32000 64000 128000 256000 512000 1024000

To check how many inodes are in the JFS inode cache for JFS 3.1 or 3.3 : # echo vxfs_nidnode/D | adb k /stand /vmunix /dev/mem for JFS 3.5 use the vxfsstat command : vxfsstat v / | grep maxino vxi_icache_maxino 128000 vxi_icache_peakino 128002 The JFS daemon ( vxfsd ) scans the free list , if inodes are on the free list for given length of time the inode is freed back to the kernel memory allocator . The amount of time this takes , and the amount freed varies by revison .

Maximum time in seconds before being freed Maximum inodes to free per second

JFS 3.1 300 1/300th of current

JFS 3.3 500 50

JFS 3.5 1800 1-25

Memory cost per in bytes for JFS inode by revision for inode/vnode/locks : JFS 3.1 11.0 32 bit 1220 64 bit 2244 JFS 3.3 11.0 32 bit 1494 64 bit 1632 JFS 3.3 11.11 32 bit 13.52 64 bit 1902 JFS 3.5 11.11 64 bit 1850

Tuning the maximum size of the JFS Inode Cache Remember each environment is different There must be one inode entry for each file opened at any given time . Most systems will run fine with 2% or less of memory used for the JFS Inode Cache Large file sservers ie Web servers , NFS servers which randomly access a large set of inodes benefit from a large cache The inode cache typically appears full after accessing many files sequentially ie . find,ll , backups The HFS ninode parameter has no impact on the JFS Inode Cache

While a static cache ( setting a non 0 value for vx_ninode ) may save memory , there are factors to keep in mind : Inodes freed to the kernel memory allocator may not be available for immidiate use by other objects Static inode caches keep inodes in the cache longer

Process Table The process table has 2 levels of control , nproc for the system wide limit and maxuprc for the user limit . When configuring these parameters it is important to take into account the amount of configured memory . Ideally all running processes will use no more than the amount of device swap configured. When configuring maxuprc it is prudent to consider user environments that require large numbers of

24 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

user processes, the most common of these would be databases and environments with a large number of printers and print jobs. As databases typically fall in the user domain, i.e. Oracle is considered a user; a value of 60% of nproc is a good starting point. As remote and network print jobs require 4 processes per job a value of 4 times the number of printers is suggested. Can't fork errors will result if the limits of table size or virtual memory are reached . If possible sar -v should be run to check the process table use. If there is not an overflow , the total number of system processes can be determined. If there is insufficient virtual memory to prevent the fork call, the system will indicate can't fork out of virtual memory . This is not a process table problem , increasing nproc or maxuprc will only make matters worse . Increasing device swap is appropriate if this error is encountered. File Table The file table imposes the lightest impact on performance. High values may be needed to satisfy the systems need for many concurrent file opens. By using Glance or sar , a system high can be determined for both process and file tables . Setting the process table to 25% above the peak usage value provides a sufficient worst-case load buffer. Setting nfile 50% above the peak usage will provide sufficient buffer .

Module 3 Process Management Process creation


Process 0 is created and initialized at system boot time but all other processes are created by a fork() or vfork() system call. The fork() system call causes the creation of a new process. The new (child) process is an exact copy of the calling (parent) process. vfork() differs from fork() only in that the child process can share code and data with the calling process (parent process). This speeds cloning activity significantly at a risk to the integrity of the parent process if vfork() is misused.

The use of vfork() for any purpose except as a prelude to an immediate exec() or exit() is not supported. Any program that relies upon the differences between fork() and vfork() is not portable across HP-UX systems.

25 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Comparison of fork() and vfork() fork()


Sets context to point to parent.Child process is an exact copy of the parent process. (See fork(2) manpage for inherited attributes.) Copy on access

vfork()
Can share parent's data and code. vfork() returns 0 in the child's context and (later) the pid of the child in the parent's context. Child borrows the parent's memory and thread of control until a call to exec() or exit().Parent must sleep while the child is using its resources, since child shares stack and uarea No reservation of swap

Must reserve swap

At user (application) level, processes or threads can create new processes via fork() or vfork(). At kernel level, only threads can fork new processes.

When fork'd, the child process inherits the following attributes from the parent process: Real, effective, and saved user IDs. Real, effective, and saved group IDs. List of supplementary group IDs (see getgroups(2)). Process group ID. File descriptors. Close-on-exec flags (see exec(2)). Signal handling settings (SIG_DFL, SIG_IGN, address). Signal mask (see sigvector(2)). Profiling on/off status (see profil(2)). Command name in the accounting record (see acct(4)). Nice value (see nice(2)). All attached shared memory segments (see shmop(2)). Current working directory Root directory (see chroot(2)). File mode creation mask (see umask(2)). File size limit (see ulimit(2)). Real-time priority (see rtprio(2)). Each child file descriptor shares a common open file description with the corresponding parent file descriptor. Thus, changes to the file offset, file access mode, and file status flags of file descriptors in the parent also affect those in the child, and vice-versa. The child process differs from the parent process in the following ways: The child process has a unique process ID. The child process has a different parent process ID (which is the process ID of the parent process).The set of signals pending for the child process is initialized to the empty set. The trace flag (see the ptrace(2) PT_SETTRC request is cleared in the child process. The AFORK flag in the ac_flags component of the accounting record is set in the child process. Process locks, text locks, and data locks are not inherited by the child (see plock(2)). All semadj values are cleared (see semop(2)). The child process's values for tms_utime, tms_stime, tms_cutime, and tms_cstime are set to zero.. The time left until an alarm clock signal is reset to 0 (clearing any pending alarm), and all interval timers are set to 0 (disabled).

Process Execution
Once a process is created with fork() and vfork(), the process calls exec() (found in kern_exec.c) to begin executing program code. For example, a user might run the command /usr/bin/ll from the shell and to execute the command, a call is made to exec(). exec(), in all its forms, loads a program from an ordinary, executable file onto the current process, replacing the existing process's text with a new copy of an executable file.

26 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

An executable object file consists of a header (see a.out(4)), text segment, and data segment. The data segment contains an initialized portion and an uninitialized portion (bss). The path or file argument refers to either an executable object file or a script file of data for an interpreter. The entire user context (text, data, bss, heap, and user stack) is replaced. Only the arguments passed to exec() are passed from the old address space to the new address space. A successful call to exec() does not return because the new program overwrites the calling program.

Process states
Through the course of its lifetime, a process transits through several states. Queues in main memory keep track of the process by its process ID. A process resides on a queue according to its state; process states are defined in the proc.h header file. Events such as receipt of a signal cause the process to transit from one state to another.

Process states State


idle (SIDL)

What Takes Place Process is created by a call to fork, vfork, or exec; can be scheduled to run.
Process is on a run queue, available to execute in either kernel or user mode. Executing process is stopped by a signal or parent process Process is not executing; may be waiting for resources Having exited, the process no longer exists, but leaves behind for the parent process some record of its execution.

run (SRUN) stopped


(SSTOP)

sleep (SSLEEP) zombie


(SZOMB)

When a program starts up a process, the kernel allocates a structure for it from the process table. The process is now in idle state, waiting for system resources. Once it acquires the resource, the process is linked onto a run queue and made runnable. When the process acquires a time-slice, it runs, switching as necessary between kernel mode and user mode. If a running process receives a SIGSTOP signal (as with control-Z in vi) or is being traced, it enters a stop state. On receiving a SIGCONT signal, the process returns to a run queue (in-core, runnable). If a running process must wait for a resource (such as a semaphore or completion of I/O), the process goes on a sleep queue (sleep state) until getting the resource, at which time the process wakes up and is put on a run queue (in-core, runnable). A sleeping process might also be swapped out, in which case, when it receives its resource (or wakeup signal) the process might be made runnable, but remain swapped out. The process is swapped in and is put on a run queue. Once a process ends, it exits into a zombie state.

The sleep*() Routines


Unless a thread is running with real-time priority, it will exhaust its time slice and be put to sleep. sleep() causes the calling thread (not the process) to suspend execution for the required time period. A sleeping thread gives up the processor until a wakeup() occurs on the channel on which the thread is placed. During sleep() the thread enters the scheduling queue at priority (pri). When pri <= PZERO, a signal cannot disturb the sleep If pri > PZERO the signal request will be processed. In the case of RTPRIO scheduling, a signal can be disturbed only if SSIGABL is set. Setting SSIGABL is dependent on the value of pri.

Note: The sleep.h header file has parameter and sleep hash queue definitions for use by the sleep routines. The ksleep.h header file has structure definitions for the channel queues to which the kernel thread is linked when asleep. sleep() is passed the following parameters: Address of the channel on which to sleep.
Priority at which to sleep and sleep flags.

27 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Address of thread that called sleep(). The priority of the sleeping thread is determined. If the thread is scheduled real-time, sleep() makes its priority the stronger of the requested value and kt_pri. Otherwise, sleep() uses the requested priority. The thread is placed on the appropriate sleep queue and the sleep-queue lock is unlocked. If sleeping at an interruptable priority, the thread is marked SSIGABL and handle any signals received. If sleeping at an uninterruptable priority, the thread is marked !TSSIGABL and will not handle any signals. The thread's voluntary context switches are increased and swtch() is called to block the thread. Once time passes and the thread awakens, it checks to determine if a signal was received, and if so, handles it. Semaphores previously set aside are now called again.

wakeup() The wakeup() routine is the counterpart to the sleep() routine. If a thread is put to sleep with a call to sleep(), it must be awakened by calling wakeup(). When wakeup() is called, all threads sleeping on the wakeup channel are awakened. The actual work of awakening a thread is accomplished by the real_wakeup() routine, called by wakeup() with the type set to ST_WAKEUP_ALL. When real_wakeup() is passed the channel being aroused, it takes the following actions: Determines appropriate sleep queue (slpque) data structure, based on the type of wakeup passed in. Acquires the sleep queue lock if needed in the multiprocessing (MP) case; goes to spl6 in the uniprocessing (UP) case. Acquires the thread lock for all threads on the appropriate sleep queue. If the kt_wchan matches the argument chan, removes them from the sleep queue and updates the sleep tail array, if needed. Clears kt_wchan and its sleeping time. If threads were TSSLEEP and not for a beta semaphore, real_wakeup() assumes they were not on a run queue and calls force_run() to force the thread into a TSRUN state. Otherwise, if threads were swapped out (TSRUN && !SLOAD), real_wakeup() takes steps to get them swapped in. If the thread is on the ICS, attributes this time to the thread being awakened. Starts a new timing interval attributing the previous one to the thread being awakened. Restores the spl level, in the UP case; releases the sleep queue lock as needed in the MP case. force_run() The force_run subroutine marks a thread TSRUN, asserts that the thread is in memory (SLOAD), and puts the thread on a run queue with setrq(). If its priority is stronger than the one running, force a context switch. Set the processor's wakeup flag and notify the thread's processor (kt_spu) with the mpsched_set() routine. Otherwise, force_run() improves the the swapper's priority if needed, sets wantin, and wakes up the swapper.

Process Termination
When a process finishes executing, HP-UX terminates it using the exit system call. Circumstances might require a process to synchronize its execution with a child process. This is done with the wait system call, which has several related routines. During the exit system call, a process enters the zombie state and must dispose of child processes. Releasing process and thread structures no longer needed by the exiting process or thread is handled by three routines -- freeproc(), freethread(), and kissofdeath(). This section will describe each process-termination routine in turn.

The exit System Call


exit() may be called by a process upon completion, or the kernel may have made the call on behalf of the process due to a problem. If the parent process of the calling process is executing a wait(), wait3(), or waitpid(), it is notified of the calling process's termination. If the parent of the calling process is not executing a wait(), wait3(), or waitpid(), and does not have SIGCLD ( death of a child) signal set to SIG_IGN (ignore signal), the calling process is transformed into a zombie process. The parent process ID is set to 1 for all of the

28 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

calling process's existing child processes and zombie processes. This means the process 1 (init) inherits each of the child processes.

Process Management Structures


The process management system contains the kernel's scheduling subsystem and interprocess communication (IPC) subsystem. The process management system interacts with the memory management system to make use of virtual memory space. The process control system interacts with the file system when reading files into memory before executing them. Processes communicate with other processes via shared memory or system calls. Communication between processes (IPC) includes asynchronous signaling of events and synchronous transmission of messages between processes. System calls are requests by a process for some service from the kernel, such as I/O, process coordination, system status, and data exchange. The effort of coordinating the aspects of a process in and out of execution is handled by a complex of process management structures in the kernel. Every process has an entry in a kernel process table and a uarea structure, which contains private data such as control and status information. The context of a process is defined by all the unique elements identifying it -- the contents of its user and kernel stacks, values of its registers, data structures, and variables -- and is tracked in the process management structures. Process management code is divided into external interface and internal implementation parts. The proc_iface.h defines the interface, contains the utility and access functions, external interface types, utility macros. The proc_private.h defines the implementation, contains internal functions, types, and macros. Kernel threads code is similarly organized into kthread_iface.h and kthread_private.h. Process structure, virtual layout overview

Process structure, virtual layout overview

Principal structures of process management Structure


proc

Purpose
Allocated at boot time; remains resident in memory (non-swappable). For every process contains an entry of the process's status, signal, and size information, as well as per-process data that is shared by the kernel thread One of two structures representing the kernel thread (the other is the user structure). Contains the scheduling, priority, state, CPU usage information of a kernel thread. Remains resident in memory. The vas structure contains all the information about a process's virtual space. It is dynamically allocated as needed and is memory resident. Contains process and thread information about use of virtual address space for text, data, stack, and shared memory, including page count, protections, and starting addresses of each. User structure contains the per-thread data that is swappable.

table

kthread

structure
vas pregion uarea

29 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

proc Table The proc table is comprised of identifying and functional information about every individual process. Each active process has a proc table entry, which includes information on process identification, process threads, process state, process priority and process signal handling. The table resides in memory and may not be swapped, as it must be accessable by the kernel at all times. Definitions for the proc table are found in the proc_private.h header file.

Principal fields in the proc structure Type of Field


Process identification threads

Name and Purpose


Process ID ( p_pid)Parent process ID ( p_ppid)Read user ID ( p_uid) used to direct tty signals Process group ID (p_pgrp)Pointer to the pgroup structure (*p_pgrp_p)Maximum number of open files allowed (p_max)Pointer to the region containing the uarea (p_upreg) Values for first and subsequent threads (p_created_threads) Pointer to first and last thread in the process (p_firstthreadp, p_lastthreadp)Number of live threads in the process, excluding zombies (p_livethreads)List of cached threads (*p_cached_threads) Current process state (p_stat)Priority (p_pri)Per-process flags ( p_flag) Signals pending on the process (p_sig)Active list of pending signals (*p_ksiactive)Signals being ignored ( p_sigignore )Signals being caught by user ( p_sigcatch)Number of signals recognized by process (p_nsig) Thread lock for all threads (*thread_lock)Per-process lock(*p_lock)

process state process signaling Locking information

What are Kernel Threads?


A process is a representation of an entire running program. By comparison, a kernel thread is a fraction of that program. Like a process, a thread is a sequence of instructions being executed in a program. Kernel threads exist within the context of a process and provide the operating system the means to address and execute smaller segments of the process. It also enables programs to take advantage of capabilities provided by the hardware for concurrent and parallel processing. The concept of threads is interpreted numerous ways, but to quote a definitive source on the HP-UX implementation (S.J. Norton and M.D. DiPasquale, `ThreadTime: Multithreaded Programming Guide , (Upper Saddle River, NJ: Prentice Hall PTR, Hewlett-Packard Professional Books), 1997, p.2):

A thread is "an independent flow of control within the process", composed of a [process's register] context, program counter, and a sequence of instructions to execute. An independent flow of control is an execution path through the program code in a process. The register context and program counter contain values that indicate the current state of program execution. The sequence of instructions to execute is the actual program code.
Further, threads are A programming paradigm and associated set of interfaces allowing applications to be broken up into logically distinct tasks that when supported by hardware, can be run in parallel. Multiple, independent, executable entities within a process, all sharing the process' address space, yet owning unique resources within the process. Each thread can be scheduled, synchronized, prioritized, and can send and receive signals. Threads share many of the resources of a process, eliminating much of the overhead involved during creation, termination, and synchronization. A thread's "management facilities" (register context et al) are used to maintain the thread's "state" information throughout its lifetime. State information monitors the condition of an entity (like a thread or process); it provides a snap-shot of an entity's current condition. For

30 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

example, when a thread context switch takes place, the newly scheduled thread's register information tells the processor where the thread left off in its execution. More specifically, a thread's program counter would contain the current instruction to be executed upon start up. As of release 10.30, HP-UX has kernel threads, which change the role of processes. A process is now just a logical container used to group related threads of an application. Each process contains at least one thread. This single (initial) thread is created automatically by the system when the process starts up. An application must explicitly create the additional threads.A process with only one thread is a "single-threaded process." A process with more than one thread is a "multi-threaded process." Currently, the HP-UX kernel manages single-threaded processes as executable entities that can be scheduled to run on a processor (that is, each process contains only one thread.) Development of HP-UX is moving toward an operating system that supports multi-threaded processes.

Comparison of Threads and Processes


The following lists process resources shared by all threads within a process: File descriptors, file creation mask User and group IDs, tty Root working directory, current working directory semaphores, memory, program global variables signal actions, message queues, timers

The following lists thread resources private to each thread within a process: User registers Error number (errno) Scheduling policy and priority Processor affinity Signal mask Stack Thread-specific data Kernel uarea Like the context of a process, the context of a thread consists of instructions, attributes, user structure with register context, private storage, thread structure, and thread stack. Two kernel data structures -- proc and user -- represent every process in a process-based kernel. (The proc structure is non-swappable and user is swappable.) In addition, each process has a kernel stack allocated with the user structure in the uarea. A threads-based kernel also uses a proc and a user structure. Like the proc structure of the process-based kernel, the threads-based proc structure remains memory resident and contains per-process data shared by all the kernel threads within the process. Each thread shares its host process' address space for access to resources owned or used by the process (such as a process' pointers into the file descriptor table). Head and tail pointers to a process' thread list are included in the proc structure. Each thread manages its own kernel resources with private data structures to maintain state information and a unique counter. A thread is represented by a kthread structure (always memory resident), a user structure (swappable), and a separate kernel stack for each kernel thread. Every kthread structure contains a pointer to its associated proc structure, a pointer to the next thread within the same process. All the active threads in the system are linked together on the active threads list. Like a process, a thread has a kind of life cycle based on the execution of a program or script. Through the course of time, threads like processes are created, run, sleep, are terminated.

User and Kernel Mode


A kernel thread, like a process, operates in user and kernel modes, and through the course of its lifetime switches between the stacks maintained in each mode. Stacks for each mode accumulate information such as variables, addresses, and buffer counts accumulate and it is through these stacks that the thread executes instructions and switches modes. Certain kinds of instructions trigger mode changes. For example, when a program invokes a system call, the system call stub code passes the system call number through a gateway page that adjusts privilege bits to switch to kernel mode. When a thread switches mode to the kernel, it executes kernel code and uses the kernel stack.

Thread's Life Cycle


Like the process, the thread can be understood in terms of its "life cycle. 1. Process is created via a call to fork() or vfork(); the fork1() routine sets up the process's pid (process id) and tid (thread id). The process and its thread are linked to the active list. The thread is given a creation state flag of TSIDL. 2. fork1() calls newproc() to create the thread and process, and to set up the pointers to the parent. newproc() calls procdup() to create a duplicate copy of the parent and allocate the uarea for the new child process. The new child thread is flagged runnable and given a flag of TSRUN. Once the thread has this flag, it is placed in the run queue. 3. The kernel schedules the thread to run; its state becomes TSRUNPROC (running). While in this state, the thread is given the resources it requests. This continues until a clock interrupt occurs, or the thread relinquishes its time to wait for a requested resource, or the thread is preempted by another (higher priority) thread. If this occurs, the thread's context is switched out.

31 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

4. A thread is switched out if it must wait for a requested resource. This causes the thread to go into a state of TSLEEP. The thread sleeps until its requested resource returns and makes it eligible to run again. During the thread's TSLEEP state, the kernel calls hardclock() every click tick (10ms) to charge the currently running thread with cpu usage. After 4 clock ticks (40ms), hardclock() calls setpri() to adjust the thread's user priority. The thread is given this value on the next context switch. After 10 click tics (100ms), a context switch occurs. The next thread to run will be the threadwith the highest priority in a state of TSRUN. For the remaining threads in TSRUN state, schedcpu() is called after 100 clock tics (1 second). schedcpu() adjusts all thread priorities at this time. 5. Once a thread acquires the requested resource, it calls the wakeup() routine and again changes states from TSLEEP to TSRUN. This makes the thread eligible to run again. 6. On the next context switch the thread is allowed to run, provided it is the next eligible candidate. When allowed to run, the thread state changes again to TSRUNPROC. 7. Once the thread completes its task it calls exit(). It releases all resources and transfers to the TSZOMB state. Once all resources are released, the thread and the process entries are released. 8. If the thread is being traced, it enters the TSSTOP state. 9. Once the thread is resumed, it transfers from TSSTOP to TSRUN.

Multi-Threading
When a task has two or more semi-independent subtasks, multiple threading can increase throughput, give better response time, speed operations, improve program structure, use fewer system resources, and make more efficient use of multiprocessors. With multi-threading, a process has many threads of control. Note, order of execution is still important! The following terminology will be useful to understand multi-threading: User threads Handled in user space and controlled using the threads APIs provided in the threads library. Also referred to as user-level or application-level threads. Kernel threads Handled in kernel space and created by the thread functions in the threads library. Kernel threads are kernel schedulable entities visible to the operating system. Lightweight processes (LWPs) Threads in the kernel that execute kernel code and system calls. Bound threads Threads that are permanently bound to LWPs. A bound thread is a user thread bound directly to a kernel thread. Both a user thread and a kernel-scheduled entity are created when a bound thread is created. Unbound threads Threads that attach and detach from among the LWP pool. An unbound thread is a user thread that can execute on top of any available LWP. Both bound and unbound threads have their advantages and disadvantages, depending entirely on the application that uses them. Concurrency At least two threads are in progress at the same time Parallelism At least two threads are executing simultaneously.

Kernel Thread Structure


Each process has an entry in the proc table; this information is shared by all kernel threads within the process. One kernel thread structure (kthread) is allocated per active thread. The kthread structure is not swappable. It contains all thread-specific data needed while the thread is swapped out, including process ID, pointer to the process address space, file descriptors, current directory, UID, and GID. Other per-thread data (in user.h) is swapped with the thread. Information shared by all threads within a process is stored in the proc structure, rather than the kthread structure. The kthread structure contains a pointer to its associated proc structure. (In a multi-threads environment the kthread would point to other threads that make up the process and controlled by a threads listing maintained in the proc table.) In a threads-based kernel, the run and sleep queues consist of kthreads instead of processes. Each kthread contains forward and backward pointers for traversing these queues. All schedule-related attributes, such as priority and states, are kept at the threads level. Definitions for the kernel threads structure are found in the kthread_private.h header file include general information, scheduling information, CPU affinity information, state and flag information, and signal information.

Principal entries in kernel thread structure Entry in struct kthread *kt_link, *kt_rlink *kt_procp Purpose pointers to forward run/sleep queue link and backward run queue link Pointer to proc structure

32 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

kt_fandx, kt_pandx kt_nextp, kt_prevp kt_flag, kt_flag2 kt_cntxt_flags kt_fractioncpu kt_wchan *kt_upreg kt_deactime kt_sleeptime kt_usrpri kt_pri kt_cpu kt_stat kt_cursig kt_spu kt_spu_wanted kt_spu_group kt_spu_mandatory; kt_sleep_type kt_sync_flag kt_interruptible kt_wake_suspend kt_active kt_halted kt_tid kt_user_suspcnt, kt_user_stopcnt kt_suspendcnt *kt_krusagep

Free active and kthread structure indices Other threads in the same process Per-thread flags thread context flags fraction of cpu during recent p_deactime Event thread is sleeping on pointer to the pregion containing the uarea seconds since last deact or react seconds since last sleep or wakeup User priority (based on kt_cpu and p_nice) priority (lower numbers are stronger) decaying cpu usage for scheduling Current thead state number of current pending signal, if any SPU number to which thread is assigned preference to desired SPU SPU group to which thread is associated Assignment as to whether SPU is mandatory or advisory; directive to wake up all or one SPU Reader synchronization flags Is the thread interruptible? Is a resource waiting for the thread to suspend? Is the thread alive? Is the thread halted cleanly? unique thread ID user-initiated suspend and job-control stop counts Suspend count Pointer to kernel resource usages Machine cycles spent in user-mode, system mode and handling interrupts.

kt_usertime; kt_systemtime; kt_interrupttime

kt_sig kt_sigmask kt_schedpolicy kt_ticksleft *kt_timers

signals pending to the thread Current signal mask scheduling policy for the thread Round-robin clock ticks left Pointer to thread's timer structures

33 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

*kt_slink *kt_sema *kt_msem_info *kt_chanq_infop kt_dil_signal *kt_cred *kt_cdir, *kt_rdir *kt_fp *kt_link, *kt_rlink

Pointer to linked list of sleeping threads Head of per-thread alpha semaphore list Pointer to msemaphore info structure Pointer to channel queue info structure Signal to use for DIL interrupts Pointer to user credentials
1

Curent and root directories of current thread, as shown in struct vnode Current file pointer to struct file. pointers to forward run/sleep queue link and backward run queue link

Role of the vas structure vas structure


Every process has a proc entry containing a pointer (p_vas) to the process's virtual address space. The vas maintains a doubly linked list of pregions that belong to a given process and thread. The vas is always memory resident and provides information based on the process's virtual address space.

Note : Do not confuse the vas structure with virtual address space (VAS) in memory. The vas structure is a few bytes;
VAS is 4 gigabytes The following table (derived from vas.h) shows the principal entries in struct

Entries in vas structure Entry in struct vas


va_ll va_refcnt va_rss, va_prss, va_dprss *va_proc va_flags va_wcount va_vaslock *va_cred va_hdl va_ki_vss va_ki_flag va_ucount

Purpose Doubly linked list of pregions


Number of pointers to the vas Cached approximation of shared and private resident set size, and private RSS in memory and on swap Pointer to existing process in struct proc Various flags (itemized after this table) number of writable memory-mapped files sharing pseudo-vas field in struct rw_lock that controls access to vas Pointer to process credentials in struct ucred vas hardware-dependent information Total virtual memory Indication of whether vss has changed Total virtual memory of user space.

The following definitions correspond to va_flags: VA_HOLES VA_IOMAP VA_WRTEXT VA_PSEUDO VA_MULTITHEADED vas might have holes within pregions IOMAP pregion within the vas writable text pseudo vas, not a process vas vas conected to a multithreaded process

34 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

VA_MCL_FUTURE VA_Q2SHARED

new pages that must be mlocked quadrant 2 used for shared data

Pregion Structure
The pregion represents an active part of the process's Virtual Address Space (VAS). This may consist of the text, data, stack, and shared memory. A pregion is memory resident and dynamically allocated as needed. Each process has a number of pregions that describe the regions attached to the process. In this module we will only discuss to the pregion level. The HP-UX Memory Management white paper provides more information about regions.

pregion types Type


PT_UNUSED PT_UAREA PT_TEXT PT_DATA PT_STACK PT_SHMEM PT_NULLDREF PT_SIGSTACK PT_IO

Definition
unused pregion User area Text region Data region Stack region Shared memory region Null pointer dereference Signal stack I/O region

These pregion types are defined based on the value of p_type within the pregion structure and can be useful to determine characteristics of a given process. This may be accessed via the kt_upreg pointer in the thread table. A process has a minimum of four defined pregions, under normal conditions. The total number of pregion types defined may be identified with the definition PT_NTYPES.

Entries comprising a pregion Type Structure information Flags and type Scheduling information Thread information Purpose
Pointers to next and previous pregions.Pointer and offset into the region.Virtual space and offset for region.Number of pages mapped by the pregion.Pointer to the VAS. Referenced by p_flags and p_type. Remaining pages to age (p_ageremain).Indices of next scans for vhand's age and steal hands (p_agescan, p_steadscan).Best nice value for all processes sharing the region used by the pregion (p_bestnice).Sleep address for deactivation (p_deactsleep). Value to identify thread, for uarea pregion (p_tid).

Traversing pregion Skip List


Pregion linked lists can get quite large if a process is using many discrete memory-mapped pregions. When this happens, the kernel spends a lot of time walking the pregion list. To avoid the list being walked linearly, we use skip lists,2 which enable HP-UX to use four forward links instead of one. These are found in the beginning of the vas and pregion structures, in the p_ll element.

User Structures (uarea)


The user area is a per-process structure containing data not needed in core when a process is swapped out.

35 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

The threads of a process point to the pregion containing the process's user structure, which consists of the uarea and kernel stack. The user structure contains the information necessary to the execution of a system call by a thread. The kernel thread's uarea is special in that it resides in the same address space as the process data, heap, private MMFs, and user stack. In a multi-threaded environment, each kernel thread is given a separate space for its uarea. Each thread has a separate kernel stack. Addressing the uarea is analogous to the prior process-based kernel structure. A kernel thread references its own uarea through struct user. However, you cannot index directly into the user structure as is possible into the proc table. The only way into the uarea is through the kt_upreg pointer in the thread table.

Principal entries in the uarea (struct user) Type


user structure pointers system call fields signal management

Purpose
Pointers to proc and thread structures (u_procp, u_kthreadp).Pointers to saved state and most recent savestate (u_sstatep,u_pfaultssp). Arguments to current system call (u_arg[]).Pointer to the arglist (u_ap).Return error code (u_error).System call return values (r_val(n)). Signals to take on sigstack (u_sigonstack).Saved mask from before sigpause (u_oldmask).Code to trap (u_code).

The user credentials pointer (for uid, gid, etc) has been moved from the uarea and is now accessed through the p_cred() accessor for the proc structure and the kt_cred()accessor for the kthread structure. See comments under the kt_cred() field in kthread.h for details governing usage.

Process Control Block (pcb) Note : HP-UX now handles context switching on a per-thread basis
A process control block (pcb) is maintained in the user structure of each kernel thread as a repository for thread scheduling information. The pcb contains all the register states of a kernel thread that are saved or restored during a context switch from one threads environment to another. The context of a current running thread is saved in its associated uarea pcb when a call to swtch() is made. The save() routine saves the current thread state in the pcb on the switch out. The resume() routine maps the user-area of the newly selected thread and restores the process registers from the pcb. When we return from resume(), the selected thread becomes the currently running thread and its uarea is automatically mapped into the virtual memory address of the system's global uarea.

The register's context includes: General-purpose Registers Space registers Control registers Instruction Address Queues (Program Counter) Processor Status Word (PSW) Floating point register

Contents of theProcess Control Block (pcb) Context element


General registerspcb_r1 --> pcb_r31 [GR0 - GR31] Space registers pcb_sr0 --> pcb_sr7 [SR0 - SR7] Control registers pcb_cr0 --> pcb_cr31 [CR0,CR8 - CR31]

Purpose
Thirty two general registers that provide the central resource for all computation. These are available for programs at all privilege levels. Eight space ID registers for virtual addressing. Twenty-five control registers that contain system state information.

36 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Program counters( pcb_pc)

Two registers that hold the virtual address of the current and next instruction to be executed. The Instruction Address Offset Queue (IAOQ) is 32 bits long. The upper 30 bits contain the work offset of the instruction and the lower 2 bits maintain the privilege level of the corresponding instruction. The Instruction Address Space Queue(IASQ) is 32 bits long in a PA-RISC 2.0 (64-bit) system or 16 bits a PA-RISC 1.x (32-bit) system. Contains the Space ID for instructions

Processor Status Word (pcb_psw) Floating point registers pcb_fr1 --> pcb_fr32

Contains the machine level status that relates to a process as it does operations and computations. Maintains the floating point status for the process.

Footnotes UID, GID, and other credentials are pointed to as a snapshot of the process-wide cred structures when the thread enters the kernel. 1 These are only valid when a thread operates in kernel mode. Permanent changes to the cred structure (e.g., setuid()) should be made to the cred structure pointed to by the proc structure element p_cred. Skip lists were developed by William Pugh of the University of Maryland. An article he wrote for CACM can be found at 2 ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.ps.Z.

PRM - Process Resource Manager


Process Resource Manager (PRM) is a resource management tool used to control the amount of resources that processes use during peak system load (at 100% CPU, 100% memory, or 100% disk bandwidth utilization). PRM can guarantee a minimum allocation of system resources available to a group of processes through the use of PRM groups. A PRM group is a collection of users and applications that are joined together and assigned certain amounts of CPU, memory, and disk bandwidth. The two types of PRM groups are FSS PRM groups and PSET PRM groups. An FSS PRM group is the traditional PRM group, whose CPU entitlement is specified in shares. This group uses the Fair Share Scheduler (FSS) in the HP-UX kernel within the system's default processor set (PSET). A PSET PRM group is a PRM group whose CPU entitlement is specified by assigning it a subset of the system's processors (PSET). Processes in a PSET have equal access to CPU cycles on their assigned CPUs through the HP-UX standard scheduler. PRM has four managers: CPU Ensures that each PRM group is granted at least its allocation of CPU. Optionally for FSS PRM groups, this resource manager ensures no more than its capped amount of CPU. For PSET PRM groups, processes are capped on CPU usage by the number of processors assigned to the group. Memory Ensures that each PRM group is granted at least its share, but (optionally) no more than its capped amount of memory. Additionally, under prm2d memory management, you can specify memory shares be isolated so that a group's assigned memory shares cannot be loaned out to, or borrowed from, other groups. Disk

Ensures that each FSS PRM group is granted at least its share of disk bandwidth. PRM disk bandwidth management can only control disks that are managed by HP's Logical Volume Manager (LVM) or by VERITAS Volume ManagerTM (VxVM). PSET PRM groups are treated as part of PRM_SYS (PRMID 0) for disk bandwidth purposes. Application Ensures that specified applications and their child processes run in the appropriate PRM groups. The managers control resources, user processes, and applications based on records in the configuration. Each manager has its own record type. The most important records are PRM group/CPU records, because all other records must reference these defined PRM groups. The various records are described below. Group / CPU

37 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Specifies a PRM group's name and its CPU allocation. The two types of PRM group records are FSS PRM group records and PSET PRM group records. An FSS PRM group is the traditional PRM group, whose CPU entitlement is specified in shares. This group uses the Fair Share Scheduler (FSS) in the HP-UX kernel within the system's default processor set (PSET). A PSET PRM group is a PRM group whose CPU entitlement is specified by assigning it a subset of the system's processors (PSET). Processes in a PSET have equal access to CPU cycles on their assigned CPUs through the HP-UX standard scheduler Memory Specifies a PRM group's memory shares, and its optional cap. In addition, the prm2d memory manager allows you to specify memory isolation field values. This allows you to isolate a group's memory shares so that memory is not loaned out to or borrowed from other groups

Disk Bandwidth Specifies an FSS PRM group's disk bandwidth shares for a given logical volume group (LVM) or disk group (VxVM). You cannot specify disk bandwidth records for PSET PRM groups. PSET PRM groups are treated as part of PRM_SYS (PRMID 0) for disk bandwidth purposes. Application Specifies an application (either explicitly or by regular expression) and the PRM group in which the application should run. Optionally, it specifies alternate names the application can take at execution. (Alternate names are most common for complex programs such as database programs that launch many processes and rename them.) User Specifies a user or a collection of users (through a netgroup) and assigns the user or netgroup to an initial PRM group. Optionally, it specifies alternate PRM groups. A user or netgroup member then has permissions to use these PRM groups with the prmmove and prmrun commands.

PRM Resource Management


PRM places limits on resource use based on values specified in a configuration file. These values always indicate a minimum amount and in some cases can indicate a maximum amount of a resource

Note: Do not use PRM with gang scheduling, which is the concurrent scheduling of multiple threads from a single process as a group (gang).

PRM groups PRM groups are integral to how PRM works. These groups are assigned per process and are independent of any other groups, such as user groups that are defined in /etc/group. You assign applications and users to PRM groups. PRM then manages each group's CPU, disk bandwidth, and real memory resources according to the current configuration. If multiple users or applications within a PRM group are competing for resources, standard HP-UX resource management determines the resource allocation. There are two types of PRM groups: FSS PRM groups are the traditional and most commonly used PRM group. These groups have CPU, memory and disk bandwidth resources allocated to them using the shares model. FSS PRM groups use the Fair Share Scheduler in the HP-UX kernel within the system's default processor set (PSET). PSET PRM groups are the second type of PRM group. In PSET PRM groups, the CPU entitlement is specified by assigning them a subset of the system's processors--instead of using the shares model. The memory allocation is still specified in shares, however the PSET PRM groups are treated as part of PRM_SYS (PRMID 0) for disk bandwidth purposes. Processes in a PSET PRM group have equal access to CPU cycles through the HP-UX time-share scheduler. Because resource management is performed on a group level, individual users or applications may not get the resources required in a group consisting of many users or applications. In such cases, reduce the size of the group or create a group specifically for the resourceintensive user or application. Resource allocation Resources are allocated to PRM groups differently depending on the resource and the type of PRM group. You allocate CPU resources to

38 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

PSET PRM groups using processor sets. All resources for FSS PRM groups and real memory resources for PSET PRM groups are allocated in shares. You cannot allocate disk bandwidth resources to PSET PRM groups. What are processor sets? Processor sets allow CPUs on your system to be grouped together in a set by the system administrator and assigned to a PSET PRM group. Once these processors are assigned to a PSET PRM group, they are reserved for use by the applications and users assigned to that group. Using processor sets allows the system administrator to isolate applications and users that are CPU-intensive, or that need dedicated on-demand CPU resources. How processor sets work Processor sets are a way of allocating dedicated CPU resources to designated applications and users. At system initialization time, a default PSET is created. This default PSET initially consists of all of your system's processors. All FSS PRM group CPU allocation can only occur in the default PSET. The system administrator can create additional PSET PRM groups and assign processors, applications, and users to those groups. Once processors are assigned to a PSET PRM group, they cannot be used by another group until a new configuration is loaded. Applications and users that are assigned to a PSET PRM group have dedicated CPU cycles from the CPUs assigned to the group. Competition for CPU cycles within the processor set are handled using the HP-UX time-share scheduler. Parent, child, sibling, and leaf PRM groups shows a 16-processor system that has four FSS PRM groups defined within the default PSET, and two additional system-administrator-defined PSET PRM groups. The default PSET contains eight processors, one of which is Processor 0. This is the only processor that is required to be in the default PSET. The remaining processors in the default PSET are used by the Dev, Appl, and OTHERS FSS PRM groups. There are two databases on this system that each have four processors assigned to them. Unlike the processors in the default PSET, the processors in the database PSET PRM groups are dedicated CPUs using the HP-UX time-share scheduler. This creates an isolated area for the databases. Processor sets example PRM Group Type Default, FSS PRM groups PSET PRM group PSET PRM group Group Name PRM_SYS, OTHERS, Dev, Appl SalesDB FinanceDB CPU ID 0, 1, 4 ,5, 8 ,9 12, 13 2, 3, 6, 7 10, 11, 14, 15 Use System processes, General users and developers Sales database Financial database

What are shares? Resource shares are the minimum amounts of a resource assigned to each PRM group in a PRM configuration file (default name /etc/prmconf). For FSS PRM groups, you can assign CPU, disk bandwidth, and real memory shares, although only CPU share assignments are required. For PSET PRM groups, you can only assign real memory in shares. In addition to minimum amounts, you can specify maximum amounts of of some resources that PRM groups can use. For FSS PRM groups, you can specify maximum amounts of CPU and memory. For PSET PRM groups, you can only assign maximum amounts of memory. These maximum amounts, known as caps, are not available for disk bandwidth for either type of PRM group.

How shares work A share is a guaranteed minimum when the system is at peak load. When the system is not at peak load, PRM shares are not enforced-unless CPU capping is enabled, in which case CPU shares are always enforced. Valid values for shares are integers from one to MAXINT (the maximum integer value allowed for the system). PRM calculates the sum of the shares, then allocates a percentage of the system resource to each PRM group based on its shares relative to the sum. Converting shares to percentages shows how shares determine CPU percentage. The total number of shares assigned is four. Divide each group's number of shares by four to find that group's CPU percentage. This CPU percentage applies only to those CPUs available to FSS PRM groups. If PSET PRM groups are configured, the processors assigned to them are no longer available to the FSS PRM groups. In this case, the CPU percentage would be based on a reduced number of CPUs.

39 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Converting shares to percentages PRM group GroupA GroupB OTHERS CPU shares 1 2 1 CPU % 1/4 = 25.00% 2/4 = 50.00% 1/4 = 25.00%

Shares allow you to add or remove a PRM group to a configuration, or alter the distribution of resources in an existing configuration, concentrating only on the relative proportion of resources and not the total sum. For example, assume we add another group to our configuration in Converting shares to percentages, giving us the new configuration in Altered configuration. To give the new group 50% of the available CPU, we assign it four shares, the total number of shares in the old configuration, thereby doubling the total number of shares in the new configuration.

Altered configuration PRM group GroupA GroupB GroupC OTHERS CPU shares 1 2 4 1 CPU percentage determined by PRM 12.50% 25.00% 50.00% 12.50%

Hierarchical PRM groups In addition to the flat divisions of resources presented so far, you can nest FSS PRM groups inside one another--forming a hierarchy of groups similar to a directory structure. Hierarchies allow you to divide groups and allocate resources more intuitively than you can with flat allocations. Note that PSET PRM groups cannot be part of a hierarchy. When forming a hierarchy, any group that contains other groups is known as a parent group. Naturally, the groups it contains are known as child groups. All the child groups of the same parent group are called sibling groups. Any group that does not have child groups is called a leaf group. There is also an implied parent group of all groups where the implied parent has 100% of the resource to distribute. Parent, child, sibling, and leaf PRM groups illustrates a configuration with hierarchical groups, indicating the parent, child, sibling, and leaf PRM groups.

40 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Parent, child, sibling, and leaf PRM groups Parent, child, sibling, and leaf PRM groups

In Parent, child, sibling, and leaf PRM groups, parent groups are the Development and Development/Compilers groups. There is also an implied parent group to the Finance, Development, and OTHERS groups. The Development group has the children Development/Compilers, Development/Debuggers, and Development/Profilers. The Compilers group is broken down further with two children of its own: Development/Compilers/C and Development/Compilers/Fortran. These two groups are also known as sibling groups. Leaf groups are groups that have no children. In the illustration above, leaf groups include the Finance, Development/Debuggers, and OTHERS groups, among others. You specify resource shares for each group in a hierarchy. If a group has child groups, the parent group's resource shares are distributed to the children based on the shares they are assigned. If a group has no children, it uses the shares. More explicitly, the percentage that a group's shares equate to is determined as follows: 1. Start at the top level in the hierarchy. Consider these groups as sibling groups with an implied parent. This implied parent has 100% of the CPU to distribute. (Shares work the same way for CPU, memory and disk bandwidth.) 2. Add all the CPU shares of the first level of sibling groups together into a variable, TOTAL. 3. Each sibling group receives a percentage of CPU equal to its number of shares divided by TOTAL. 4. If the sibling group has no child groups, it uses the CPU itself. 5. If the sibling group does have child groups, the CPU is distributed further based on the shares assigned to the child groups. Calculate the percentages of the resource they receive by repeating items 2 through 5. 6. Consider the example in Hierarchical PRM groupstop level, which shows the PRM groups at the top-level.

41 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Hierarchical PRM groups--top level Group Finance Development OTHERS CPU shares 3 5 2 Percent of system's available CPU 30.00% 50.00% 20.00%

Hierarchical PRM groupsDevelopment's child groups shows how the CPU percentages for the child groups of the Development group are determined from their shares. It also shows how the child groups for the Development/Compilers group further divide the CPU. Hierarchical PRM groups--Development's child groups Group Development Development/Debuggers Development/Profilers Development/Compilers Development/Compilers/C Development/Compilers /Fortran CPU shares 5 1 1 2 4 4 Percent of system's available CPU 5/10 = 50.00% passed to child groups 1/4 of its parent's CPU (50.00%) = 12.50% of system CPU 1/4 of its parent's CPU (50.00%) = 12.50% of system CPU 2/4 of its parent's CPU (50.00%) = 25.00% passed to child groups 4/8 of its parent's CPU (25.00%) = 12.50% of system CPU 4/8 of its parent's CPU (25.00%) = 12.50% of system CPU

There is no requirement that the sum of the shares for a set of sibling groups be less than their parent's shares. For example, Hierarchical PRM groupsDevelopment's child groups shows the Development/Compilers group has 2 shares, while the sum of the shares for its child groups is 8. You can assign any group any number of shares between one and MAXINT (the system's maximum integer value), setting the proportions between groups as you consider appropriate. The maximum number of leaf nodes is 64, which is the maximum number of PRM groups you can have.

NOTE : Application records must assign applications only to leaf groups--not parent groups. Similarly, user records must assign users only to leaf groups . In group/CPU records, each PRM group--regardless of where it is in the hierarchy--must be assigned resource shares.

Hierarchies offer a number of advantages, as explained below: Facilitates less intrusive changes--Similar to how shares in a flat configuration allow you to alter one record while leaving all the others alone, hierarchies enable you to alter the hierarchy in one area, leaving the rest unchanged. Enables you to use a configuration template--Create a configuration file that provides each department access to the system, then distribute the configuration and assign resources giving preference to certain departments on different machines. Allows continued use of percentages--If you prefer using percentages instead of shares, you can assign each level in the hierarchy only 100 resource shares. Facilitates giving equal access--If you want each PRM group to have equal access to a resource, simply assign each group the same number of shares. When you add a group, you do not have to recalculate resources and divide by the new number of groups; just assign the new group the same number of shares as the other groups. Similarly, removing a group does not require a recalculation of resources; just remove the group. Allows for more intuitive groups--Hierarchies enable you to place similar items together, such as all databases or a business entity/goal, and assign them resources as a single item.

42 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Enables making higher-level policy decisions--By placing groups in a hierarchy, you can implement changes in policy or funding at a higher level in a configuration without affecting all elements of the configuration. Facilitates system upgrades, capacity planning, and partitioning--If you are moving from a two-CPU system to a four-CPU system, you can reserve the two additional CPUs by adding a place-holder group at the top level in the hierarchy, assigning it shares equal to 50% of the CPU, and enabling capping. This place-holder prevents users from getting a boost in performance from the new CPUs, then being frustrated by poor performance when more applications are added to the system. The syntax for hierarchical groups is explained in Group/CPU record syntax. By default, PRM utilities (prmconfig, prmlist, prmmonitor) include only leaf groups in their output. Use the -h option to display information for parent groups as well. Precision of shares PRM's calculation of groups' resources is most accurate when the maximum number of shares assigned divided by the minimum number of shares assigned is less than or equal to 100, as shown in When resource percentages are most precise.

When resource percentages are most precise When resource percentages are most precise

For example, Example with large difference in assigned max/min shares shows a situation in which the expected percentage is not achieved due to a large difference in the maximum and minimum shares. Example with large difference in assigned max/min shares PRM group GroupA GroupB GroupC OTHERS Shares 1 200 199 25 Expected percentage 1/425 = 0.24% 200/425 = 47.06% 199/425 = 46.82% 25/425 = 5.88% Actual percentage 0.48% 46.89% 46.89% 5.74%

How PRM manages CPU


. To understand PRM's CPU management, it is useful to know how the standard HP-UX scheduler works. The HP-UX scheduler chooses which process to run based on priority. Except for real-time processes, the system dynamically adjusts the priority of a process based on resource requirements and resources used. In general, when processes are not running, the HP-UX scheduler raises their priorities; and while they are running, their priorities are lowered. The rate at which priority declines during execution is linear. The rate at which priority increases while waiting is exponential, with the rate of increase fastest when the CPU load is low and slowest when the CPU load is high. When a process other than the current process attains a higher priority, the scheduler suspends the current process and starts running the higher priority process. Because the rate at which the priority increases is slowest when CPU load is high, the result is that a process with a heavy demand for

43 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

CPU time is penalized by the standard HP-UX scheduler as its CPU use increases. With PRM you can reverse the effects of the standard scheduler. By placing users with greater demands for CPU in an FSS PRM group with a higher relative number of CPU shares than other groups, you give them a higher priority for CPU time. In a similar manner, you can assign an application to an FSS PRM group with a higher relative number of shares. The application will run in its assigned FSS PRM group, regardless of which user invokes it. This way you can ensure that critical applications have enough CPU resources. You can also isolate applications and users with greater demands for CPU by placing them in a PSET PRM group and assigning the desired number of processors to the group. The applications and users will have dedicated access to the processors in the PSET PRM group, ensuring CPU cycles when needed. This method of isolating applications and users effectively creates a partition on your system. PRM manages CPU by using the fair share scheduler (FSS) for FSS PRM groups. When the PRM CPU manager is enabled, FSS runs for FSS PRM groups instead of the HP-UX standard scheduler. When PSET PRM groups are configured, FSS still runs for FSS PRM groups, but the standard HP-UX scheduler is used within PSET PRM groups. PRM gives higher-priority FSS PRM groups more opportunities to use CPU time. Free CPU time is available for use by any FSS PRM group and is divided up between FSS PRM groups based on relative number of CPU shares. As a result, tasks are given CPU time when needed, in proportion to their stated importance, relative to others with a demand. PRM itself has low system overhead. Example: PRM CPU management PRM CPU management illustrates PRM's CPU management for two FSS PRM groups. In this example, Group1 has 33 CPU shares, and Group2 has 66 CPU shares. Note that the percentage of CPU referred to may not be total system CPU if PSET PRM groups are configured. The percentage is of CPU available on the processors assigned to the default PSET. If PSET PRM groups are not configured, then the available CPU is the same as the system CPU.

PRM CPU management PRM CPU management

At Time A: Group1 is using 40% of the available CPU, which is more than its share. Group2 is using 15% of the available CPU, which is less than its share. 45% of the available CPU is not used. PRM scheduling is not in effect.

At Time B:

44 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Group1's processes are now using 80% of available CPU time, which consists of all of Group1's shares and an unused portion of Group2's share. Group2 processes continue at a steady 15%. PRM scheduling is not in effect.

Between Time B and Time C: Group2's demands start to increase. With available CPU use approaching 100%, PRM starts to have an effect on CPU allocation. Both groups' CPU use begins moving toward their assigned number of shares. In this case, the increasing demand of Group2 causes Group1 to be pulled toward the 33% mark despite its desire for more CPU. At Time C: CPU use for Group1 and Group2 is limited to the assigned shares.

After Time C: PRM holds each group to its assigned available CPU percentage until total available CPU demand is less than 100%. This gives Group2 a priority for CPU over Group1. In contrast, in the standard HP-UX scheduler, processor time is allocated based upon the assumption that all processes are of equal importance. Assuming there is one process associated with each PRM group, the standard HP-UX scheduler would allocate each process 50% of the available CPU after Time C. CPU allocation and number of shares assigned PRM favors processes in FSS PRM groups with a larger number of CPU shares over processes in FSS PRM groups with fewer CPU shares. Processes in FSS PRM groups with a larger number of CPU shares are scheduled to run more often and are given more opportunities to consume CPU time than processes in other FSS PRM groups. This preference implies that the process in an FSS PRM group with a larger number of shares may have better response times with PRM than with the standard HP-UX scheduler. PRM does not prevent processes from using more than their CPU share when the system is at nonpeak load, unless a CPU maximum has been assigned. Capping CPU use PRM gives you the option of capping CPU use. When enabled, CPU capping is in effect for all user-configured FSS PRM groups on a system--regardless of CPU load. CPU use can be capped for either all FSS PRM groups or no FSS PRM groups. When CPU usage is capped, each FSS PRM group takes its entire CPU allocation. Thus, no group can obtain more CPU. The FSS PRM group's minimum allocation becomes its maximum allocation. The PRM_SYS group is exempt from capping, however. If it gets CPU time and has no work, the PRM scheduler immediately goes to the next FSS PRM group.

For PSET PRM groups, capping is a result of the number of CPUs assigned to the group. Capping CPU usage can be a good idea when migrating users and applications to a new system. When the system is first introduced, the few users on the system may become accustomed to having all of the machine's resources. However, by setting CPU caps early after the system's introduction, you can simulate the performance of the system under heavier use. Consequently, when the system becomes more heavily used, performance is not noticeably less. For information on capping CPU use, see Specifying PRM groups/controlling CPU use. How PRM manages CPU for real-time processes Although PRM is designed to treat processes fairly based upon their assigned shares, PRM does not restrict real-time processes. Real-time processes using either the POSIX.4 real-time scheduler (rtsched) or the HP-UX real-time scheduler (rtprio) keep their assigned priorities because timely scheduling is crucial to their operation. Hence, they are permitted to exceed their group's CPU share and cap. The CPU they use is charged to their groups. Thus, they can prevent other processes in their groups from running. Multiprocessors and PRM PRM takes into account architectural differences between multiprocessor (MP) and single-processor systems. In the case of memory management, Hewlett-Packard multiprocessor systems share the same physical address space. Therefore PRM

45 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

memory management is the same as on a single-processor system. However, in the case of CPU management, PRM makes accommodations for MP systems. The normal HP-UX scheduling scheme for MP systems keeps the CPU load average at a uniform level across the processors. PRM tries to even the mix of FSS PRM groups on each available processor (those not assigned to PSET PRM groups). This is done by assigning each process in an FSS PRM group to a different processor, stepping round-robin through the available processors. Only processes that can be run or processes that are likely to run soon are actually assigned in this manner. For example, on a two-way MP system, FSS PRM Group1 has two active processes A and B, and FSS PRM Group2 has two active processes C and D. In this example, PSET PRM groups are not configured. PRM assigns process A to the first processor, process B to second processor, process C to the first processor, and finally process D to the second processor--as shown in PRM's process scheduling on MP systems.

If a process is locked down on a particular processor, PRM does not reassign it, PRM's process scheduling on MP systems but does take it into account when distributing other processes across the processors. PRM manages the CPU only for the processors on a single system, it cannot distribute processes across processors on different systems. As implied above, PRM provides a PRM group its entitlement on a symmetricmultiprocessing (SMP) system by granting the group its entitlement on each CPU. If the group does not have at least one process for each CPU, PRM increases the entitlements for the processes to compensate. For example, a PRM group with a 10% entitlement on a 4-CPU system, gets 10% of each CPU. If the group is running on only one CPU because it has only one process, the 10% entitlements from the three unused CPUs are given to the group on the CPU where it has the process running. Thus, it gets 40% on that one CPU.

PRM's process scheduling on MP systems

NOTE: A PRM group may not be able to get its entitlement because it has too few processes. For example, if the PRM group above--with only one single-threaded process--were to have a 50% entitlement for the 4-CPU system, it would never get its entitlement. PRM would give the group its 50% of the CPU where the process is running and its 50% from one other CPU. However, the group cannot get the 50% entitlements from the two remaining CPUs. As a result, the PRM group only gets a 25% entitlement (one CPU out of four).

How PRM manages real memory Memory management refers to the rules that govern real and virtual memory and allow for sharing system resources by user and system processes. In order to understand how PRM manages real memory, it is useful to understand how PRM interacts with standard HP-UX memory management. How HP-UX manages memory
The data and instructions of any process (a program in execution) must be available to the CPU by residing in real memory at the time of execution. Real memory is shared by all processes and the kernel. To execute a process, the kernel executes through a per-process virtual address space that has been mapped into real memory. Memory management allows the total size of user processes to exceed real memory by using an approach termed demand-paged virtual memory. Virtual memory enables you to execute a process by bringing into real memory parts of the process only as needed and pushing out parts of a process that have not been recently used.

46 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

The system uses a combination of paging and swapping to manage virtual memory. Paging involves writing unreferenced pages from real memory to disk periodically. Swapping takes place if the system is unable to maintain a large enough free pool of memory. In such a case, entire processes are swapped. The pages associated with these processes can be written out by the pager to secondary storage over a period of time. The more real memory a system has available, the more data it can access and the more (or larger) processes it can execute without having to page or cause swapping.

Available memory
A portion of real memory is always reserved for the kernel (/stand/vmunix) and its data structures, which are dynamically allocated. The amount of real memory not reserved for the kernel and its data structures is termed available memory. Available memory is consumed by user processes and also nonkernel system processes such as network daemons. Because the size of the kernel varies depending on the number of interface cards, users, and values of the tunable parameters, available memory varies from system to system. For example, Example of available memory on a 1024-Mbyte system shows a system with 1024 Mbytes of physical memory. Approximately 112 Mbytes of that memory is used by the kernel and its data structures, leaving 912 Mbytes of memory available for all processes, including system processes. In this example, 62 Mbytes are used by system processes, leaving 850 Mbytes of memory available for user processes. PRM reserves 11% of the remaining memory to ensure processes in PRM_SYS have immediate access to needed memory. Although you cannot initially allocate this reserve to your PRM groups, it is still available for your PRM groups to borrow from when needed. So, in this example, the prmavail command would show 850 Mbytes of available memory before PRM is configured, and 756 Mbytes of available memory after PRM is configured. Example of available memory on a 1024-Mbyte system Mbyte 1024 912 850 756 Memory type Physical memory available on the system Memory available for all processes Memory available for user processes Memory available after PRM is configured

How PRM controls memory usage


PRM memory management allows you to prioritize how available memory is allocated to user and application processes. This control enables you to ensure that critical users and applications have enough real memory to make full use of their CPU time. Processes in the PRM_SYS group (PRMID 0) and the kernel get as much memory as they need. They are not subject to PRM constraints. PRM provides two memory managers: /opt/prm/bin/prm0d /opt/prm/bin/prm2d

The prm0d manager is the original memory manager. It is the default on HP-UX versions prior to 11i. The prm2d manager is the default as of HP-UX 11i V1.0 (B.11.11). prm2d is the recommended memory manager. When the prm0d memory manager is enabled, available memory continues to be distributed to active processes using the standard HP-UX method. However, when system memory use is at a peak and a PRM group is exceeding its share of memory, the prm0d memory manager suppresses processes in that group. These suppressed processes give memory back to the pool, and therefore more memory is available for use by other PRM groups, which may not be getting their fair share. PRM suppresses a process by stopping it. Once the PRM group's memory use is below its share or memory pressure ceases, the process is re-activated. The prm0d memory manager selects processes for suppression based on the method specified in the memory records in the PRM configuration file. The selection methods are:

47 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

ALL--Suppress all processes in the group. LARGEST--Suppress the largest processes in the group, then continue suppressing smaller and smaller processes until the goal is met. Typically, you might assign the ALL parameter to a PRM group with low priority so that PRM will be more aggressive in suppressing processes within the group. Groups with higher priority would typically be assigned the LARGEST parameter, causing PRM to be more selective in suppressing processes. prm0d stops processes by attaching to the processes like a debugger process would. You can restrict the processes that prm0d can stop as explained the section in the section Exempting processes from memory control.

NOTE : PRM does not suppress processes using locked memory. For more information, see How PRM manages locked memory The prm2d memory manager uses the in-kernel memory feature to partition memory (when a configuration is loaded) with each PRM group getting a partition. A partition includes x Mbytes of memory, where x Mbytes is equivalent to the group's entitled percent of the available memory. Each partition pages separately. When system memory use is not at 100%, a PRM group that does not have its memory use capped or isolated can freely borrow excess memory pages from other PRM groups. If a process requires memory and its memory use is capped, processes in the same PRM group as the original process are forced to page to free up memory. When system memory use is at a peak, any borrowed memory pages are returned to the owning PRM groups. The time involved for the borrowed memory pages to be returned is dependent on the swap rate and the order in which old pages are paged out. If a group is exceeding its memory shares on a system that is under stress, prm2d uses proportional overachievement logic to determine which groups need their import shares reduced. Overachievement for a group is the ratio of memory used to memory entitlement. This value is then compared to the average overachievement of all groups. If a PRM group is overachieving compared to the average, then the import shares for that group are lowered. This allows other groups to start importing the newly available memory. Groups are not allowed to exceed their memory caps with the prm2d memory manager. Reducing shares under prm2d If a PRM group's memory share is reduced while the group is using most of its memory pages, the reduction is not immediately visible. The memory must be paged out to the swap device. The time involved for the reduction to take effect is determined by the memory transfer rate (for example, 2 Mbytes/second), and the order in which the old pages are paged out. When changing shares, give them time to take effect before implementing new shares. Exempting processes from memory control You can prevent the prm0d memory manager from suppressing (stopping) certain processes. Specify the processes that the PRM memory manager should not suppress by adding their path names (one per line) to the file /opt/prm/exempt. The prm0d memory manager consults the files /opt/prm/shells and /etc/shells to properly identify shell scripts. These interactive shells are not stopped and do not need to be added to the /opt/prm/exempt file. The following processes are exempt: Login shells PRM commands Applications listed in /opt/prm/exempt Processes with locked memory The kernel Processes in the PRM_SYS group (PRMID 0)

Capping memory use You can optionally specify a memory cap for a PRM group. With the prm0d memory manager, a memory cap is a soft upper bound. With prm2d, a PRM group cannot exceed its memory cap. Typically, you might choose to assign a memory cap to a PRM group of relatively low priority, so that it does not place excessive memory demands on the system. For information on setting a memory cap, see Controlling memory use. Implementation of shares and caps

48 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

In addition to specifying memory shares (a lower bound), you can optionally specify a memory cap (upper bound) for a PRM group. It is important to note the difference between memory shares and a memory cap. Shares guarantee the minimum amount of real memory that a group is allowed to consume at times of peak system load. The memory cap is an upper bound. The prm0d memory manager has different criteria for suppressing processes when group memory use exceeds these boundaries. With the prm0d memory manager, memory caps are not really upper bounds: Processes are allowed to exceed the caps. By placing a memory cap on a group, you instruct PRM to suppress the processes in that group before suppressing the processes in groups that do not have a cap. If memory is still being requested by a group below its share, prm0d continues to suppress processes until no PRM group is exceeding its memory share. The prm2d memory manager handles caps more strictly than prm0d. prm2d does not allow the memory use of processes in a PRM group to exceed the memory cap of that PRM group. Isolating a group's memory resources In addition to specifying memory shares, the prm2d memory manager allows you to optionally specify a group's memory resources to be restricted from use from other groups and processes on the system. This type of restriction is called memory isolation. When a group's memory shares are isolated, those memory shares cannot be loaned out to other groups. Memory isolation also means that memory cannot be borrowed from other groups. PRM allows groups that do not have memory isolation turned on to freely borrow memory from other groups as needed. The lending groups are restricted in their giving by their physical entitlement size. A group cannot lend its memory resources if memory isolation is turned on. Memory isolation can be useful for applications that need dedicated memory resources, or that tune their own memory needs based on their allocated resources. How PRM manages locked memory Real memory that can be locked (that is, its pages kept in memory for the lifetime of a process) by the kernel, by the plock() system call, or by the mlock() system call, is known as lockable memory. Locked memory cannot be paged or swapped out. Typically, locked real memory holds frequently accessed programs or data structures, such as critical sections of application code. Keeping them memory-resident improves system performance. Lockable memory is extensively used in real-time environments, like hospitals, where some processes require immediate response and must be constantly available. prm0d does not suppress a process that uses locked memory once the process has the memory because suppressing the process will not cause it to give back memory pages. However, the memory resources that such a process consumes are still charged against its PRM group. If processes using locked memory consume as much or more memory than their group is entitled to, other processes in that group may be suppressed until the demands of the processes with locked memory are lower than the group's share. With the prm2d memory manager, locked memory is distributed based on the assigned memory shares. For example, assume a system has 200 Mbytes of available memory, 170 Mbytes of which is lockable. Lockable memory divided by available memory is 85%. If GroupA has a 50% memory share, it gets 100 Mbytes of real memory. Of that amount, 85% (or 85 Mbytes) is lockable. Notice that 85 Mbytes/170 Mbytes is 50%, which is the group's memory share. Locked memory distribution by prm2d memory manager illustrates this idea.

Locked memory distribution by prm2d memory manager

49 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Locked memory distribution by prm2d memory manager

How PRM manages shared memory

With the prm0d memory manager, PRM charges a PRM group for its use of shared memory based on the number of the group's processes that are attached to the shared memory segment, relative to the total number of attached processes. For example, assume a system has a 100-Mbytes shared memory segment and two PRM groups, Group1 and Group2. If Group1 has three processes attached to the segment and Group2 has one attached, Group1 is charged with 75 Mbytes, while Group2 is charged with 25 Mbytes. With the prm2d memory manager, if a group is exceeding its memory shares as system memory utilization approaches 100%, prm2d determines which groups are importing the most memory above their entitlement, as compared with the average overachievement of all groups. If a PRM group is overachieving compared to the average, then the import shares for that group are lowered. This allows other groups to start importing the newly available memory. Example: prm2d memory management This example shows how prm2d manages the competing memory demands of three PRM groups as system memory utilization approaches 100%.

prm2d

memory management

prm2d memory management

At Time A There is plenty of memory available on the system for the processes that are running. Group1 is using its share, and Group2 is using slightly more than its share, borrowing excess from Group3. Group3 is using much less than its share. At Time B: System memory use approaches 100%. Group1 is borrowing excess memory from Group3. Group2 processes reach the group's 30% memory cap. Unlike prm0d, prm2d does not allow a group to exceed its memory cap. Consequently, Group2's processes are forced to page, causing a performance hit. Between Time B and Time C, Group3's demands continue to increase. At Time C: System memory use is near 100%. Group3 is not getting sufficient memory and needs its loaned-out memory back. PRM then determines which groups are

50 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

overachieving with respect to their memory entitlement. In this case, the increasing demand of Group3 causes Group1 and Group2 to be pulled toward their shares of 30% and 10% respectively despite their desire for more memory. Group3 is allowed to freely consume up to 60% of available memory, which it reaches at Time D. After Time D: PRM now holds each group to its entitled memory percentage. If a group requests more memory, the request is filled with pages already allocated to the group.

How resource allocations interact


You can assign different numbers of shares for CPU (for FSS PRM groups), memory, and disk bandwidth to a PRM group depending on the group's requirements for each type of resource. To optimize resource use, it is important to understand the typical demands for resources within a PRM group. For example, suppose the DesignTool application is assigned to PRM group DTgroup, and it is the only application running in that group. Suppose also that the DesignTool application uses CPU and memory in an approximate ratio of two to three. For optimal results, you should assign the resource shares for DTgroup in the same ratio. For example, assign 10 CPU shares and 15 memory shares or 20 CPU shares and 30 memory shares. If the percentages assigned do not reflect actual usage, then a PRM group may not be able to fully utilize a resource to which it is entitled. For instance, assume you assign 50 CPU shares and 30 memory shares to DTgroup. At times of peak system load, DTgroup is able to use only approximately 20 CPU shares (although it is assigned 50 shares) because it is limited to 30 memory shares. (Recall that DesignTool uses CPU and memory at a ratio of two to three.) Conversely, if DTgroup is assigned 10 CPU shares and 30 memory shares, then at times of peak system load, DTgroup is only able to utilize 15 memory shares (not its 30 shares), because it is restricted to 10 CPU shares. To use system resources in the most efficient way, monitor typical resource use in PRM groups and adjust shares accordingly. You can monitor resource use with the prmanalyze command, the prmmonitor command, or the optional HP product GlancePlus.

How PRM manages disk bandwidth


PRM manages disk bandwidth at the logical volume group/disk group level. As such, your disks must be mounted and under the control of either HP's Logical Volume Manager (LVM) or VERITAS Volume ManagerTM (VxVM) to take advantage of PRM disk bandwidth management. PRM controls disk bandwidth by re-ordering the I/O requests of volume groups and disk groups. This has the effect of delaying the I/O requests of low-priority processes and accelerating those of higher-priority processes NOTE: Disk bandwidth management works only when there is contention for disk bandwidth, and it works only for actual I/O to the disk. (Commonly, I/O on HP-UX is staged through the buffer cache to minimize or eliminate as much disk I/O as possible.) Also, note that you cannot allocate disk bandwidth shares for PSET PRM groups. PSET PRM groups are treated as part of PRM_SYS (PRMID 0) for disk bandwidth purposes.

Disk bandwidth management works on disk devices, stripes, and disk arrays. It does not work on tape or network devices. When you change share allocations on a busy disk device, it typically takes 30 seconds for the actual bandwidth to conform to the new allocations. Multiple users accessing raw devices (raw logical volumes) will tend to spend most of their time seeking. The overall throughput on this group will tend to be very low. This degradation is not due to PRM's disk bandwidth management. When performing file system accesses, you need approximately six disk bandwidth consumers in each PRM group before I/O scheduling becomes noticeable. With two users, you just take turns. With four, you still spend a lot of your time in system call overhead relative to the peak device bandwidth. At six, PRM disk bandwidth management begins to take effect. The more demand you put on the system, the closer the disk bandwidth manager approaches the specified values for the shares.

How PRM manages applications


When an application is started, it runs in the initial PRM group of the user that invoked it. If the application is assigned to a PRM group by a record in the configuration file, the application manager soon moves the application to its assigned group. A user who does not have access to an application's assigned PRM group can still launch the application as long as the user has execute permission to the application. An application can be assigned to only one PRM group at a time. Child processes inherit their parent's PRM group. Therefore, all the application's child processes run in the same PRM group as the parent application by default. You can explicitly place an application in a PRM group of your choosing with two commands. Use the prmmove command to move an

51 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

existing application to another group. Use the prmrun command to start an application in a specified group. These rules may not apply to processes that bypass login

How application processes are assigned to PRM groups at start-up PRM's group assignments at process start-up describes what PRM groups an application process is started in based on how the application is started.
PRM's group assignments at process start-up Process initiated By user By at By cron Upon login By prmrun {-g targetgrp | -i} By prmrun application (-g targetgrp is not specified) By prmmove {targetgrp | -i} By another process Process runs in PRM group as follows Process runs in the user's initial group. If the user does not have an initial group, the process runs in the user default group, OTHERS. (If the process has an application record, it still starts in the invoking user's initial group. However, the application manager will soon move the process to its assigned group.) Process runs in the PRM group specified by targetgrp or in the user's initial group. The PRM application manager cannot move a process started in this manner to another group. Process runs in the application's assigned PRM group. If the application does not have a group, an error is returned.

Process runs in the PRM group specified by targetgrp or in the user's initial group. The PRM application manager cannot move a process started in this manner to another group. Process runs in the parent process's group.

How PRM handles child processes


When they first start, child processes inherit the PRM groups of their parent processes. At configurable polling intervals, the application manager checks the PRM configuration file against all processes currently running. If any processes should be assigned to different PRM groups, the application manager moves those applications to the correct PRM groups. If you move a parent process to another PRM group (with the prmmove command), all of its child processes remain in the original PRM group. If the parent and child processes should be kept together, move them as a process group or by user login name.

Pattern matching for filenames


Application filenames in application records can contain pattern matching notation as described in the regexp(5) man page. This feature allows you to assign all appropriate applications that reside in a single directory to a PRM group--without creating an application record for each individual application.

The wildcard characters ([, ], *, and ?) can be used to specify application filenames. However, these characters cannot be used in directory names. For example, the following record is valid: /opt/prm/bin/x[opq]rm::::PRM_SYS However, the next record uses a wildcard in the directory name and is not valid: /opt/pr?/bin/xprm::::PRM_SYS # INVALID To assign all the applications in a directory to a PRM group, create an application record similar to the following, with the filename specified only by an asterisk (*): /opt/special_apps/bin/*::::GroupS Filenames are expanded to their complete names when a PRM configuration is loaded. Explicit application records take precedence over application records that use wildcards. If an application is matched by several records that use pattern matching, the application is assigned to the PRM group specified in the "first" matching record. The "first" matching record is determined by sorting--in ASCII dictionary order-- the matching patterns. NOTE: If you use wildcards in an application record to specify the application filename, you cannot use alternate names for that application record

52 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Pattern matching for renamed application processes


Alternate names specified in application records can also contain pattern matching notation as described in the regexp(5) man page.

NOTE: Use pattern matching only when it is not practical to list all possible alternate names

Many complex applications, such as database applications, may assign unique names to new processes or rename themselves while running. For example, some database applications rename processes based on the database instance, as shown in this list of processes associated with a payroll database instance: db02_payroll db03_payroll db04_payroll dbsmon_payroll dbwr_payroll dbreco_payroll To make sure all payroll processes are put in the same PRM group, use pattern matching in the alternate names field of the application record, as shown below: /usr/bin/database::::business_apps,db*payroll For alternate names and pattern matching to work, the processes must share the same file ID. (The file ID is based on the file system device and the file's inode number.) PRM performs this check to make sure that only processes associated with the application named in the application record are put in a configured PRM group. The only case where alternate names might not share the file ID is if you have specified a symbolic link as the fully qualified executable in the application record. For this reason, avoid using (or referencing) symbolic links in application records. If there are multiple application records that match an application name due to redundant pattern matching resolutions, the "first" record to match the application name takes precedence. For example, the application abb matches both of the following application records:

/opt/foo/bin/bar::::GroupA,a* /opt/foo/bin/bar::::GroupB,*b Because the *b record is first (based on ASCII dictionary order), the application abb would be assigned to the PRM group GroupB. Knowing the names of all the processes spawned and renamed by the applications can help in creating pattern matching that is only as general as it needs to be. Eliminate redundant name resolutions whenever possible, and make sure pattern matching does not cause unwarranted moves.

The PRM application manager checks that applications are running in the correct PRM groups every interval seconds. The default interval is 30 seconds

53 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Module 4 Memory
Memory use and configurations is one of the most complex and misunderstood areas of performance tuning. Factors to consider when configuring memory on a system : What are the hardware limitations for physical memory (RAM) What are the disk limitations for device swap What architecture are the OS , and the application What are the memory requirements for the OS and applications. What are the cost limitations

Current HP Servers memory ranges


Model superdome 64 superdome 32 superdome 16 rp 8400 rp 7410 rp 7400 rp5470 rp5430 rp2470 Physical memory range 16-256GB 2-128GB 2-64GB 2-64 GB 2-32 GB 1-32 GB 1-16 GB 1-8 GB 1-2 GB

Recent HP servers V Class 1-128GB N Class 512MB-16GB L-Class 256-16GB K class 128Mb-8GB D Class 64Mb-3Gb R Class 128Mb-3Gb A Class 128Mb-2GB

Current HP Workstations Model hp b2600 Physical memory range 512MB- 4GB

54 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

hp c3700 hp c3750 hp j6700/6750

512MB- 4GB 512MB- 8 GB 1MB- 16 GB

For complete specs on current hp servers see:


http://welcome.hp.com/country/us/eng/prodserv/servers.html For workstations see : http://www.hp.com/workstations/risc/index.html

Commonly tuned memory paging parameters:

bufpages Pages of static buffer cache Minimum: 0 or 6 (nbuf*2 or 64 pages) Maximum: Memory limited Default: 0 nbuf Number of static buffer headers Minimum: 0 or 16 Maximum: Memory limited Default: 0 These parameters are used when the dynamic buffer cache is disabled.

dbc_min_pct Minimum dynamic buffer cache Minimum: 2 Maximum: 90 Default: 5 dbc_max_pct Maximum dynamic buffer cache Minimum: 2 Maximum: 90 Default: 50 As these represent a percentage of RAM , care should be taken when selecting a value . Values in excess of 400Mb are counter productive from a performance perspective. For systems with greater than 16Gb of RAM it may be advisable to disable the dynamic buffer cache and hard code the values . The buffer cache, whether static or dynamic is used to facilitate faster disk read / write transactions . If the total amount of maximum disk I/O volume is determined the amount of buffer cache can be set accordingly .

maxswapchunks maximum swap space available to client Minimum: 1 Maximum: 16384 Default: 256 swchunk Minimum: client swap-chunk size 2048

To determine the amount of swap configurable on system multiply maxswapchunks by swapchunks . By default the maximum configurable swap area is 32Gb . Typically swchunk is not altered from its default. If greater than 32 Gb of device swap is configured , set swchunk to 4096 . nswapdev number of available swap devices Minimum: 1 Maximum: 25 Default: 10 nswapfs number of file systems available for swap Minimum: 1 Maximum: 25

55 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Default: 10 page_text_to_local enable/disable text swap on client Minimum: 0 (stand-alone or client uses file-system server) Maximum: 1 (use client local swap) Default: 1 (use client local swap) remote_nfs_swap enable/disable swap to remote NFS Minimum: 0 Maximum: 1 Default: 0 swapmem_on enable/disable pseudo-swap reservation Minimum: 0 (disable pseudo-swap reservation) Maximum: 1 (enable pseudo-swap reservation) Default: 1 The swapmem_on parameter should be left at the default of 1 unless the total lockable memory exceeds 25% of RAM . Typically the total lockable memory on a system is between 15-20 % of RAM .

Configurable IPC Shared Memory Parameters


shmem Enable/disable shared memory (Series 700 only) Minimum: 0 (exclude System V IPC shared memory code from kernel) Maximum: 1 (include System V IPC shared memory code in kernel) Default: 1 shmmax Maximum shared memory segment size Minimum: 2 Kbytes Maximum: memory limited Default: 0x04000000 (64 Mbytes) The shmmax parameter should be set to no greater than 1 memory quadrant i.e. of total system memory For 32 bit this has a maximum value of 1 Gb For 64 bit systems it should be set to 1 quadrant . shmmni Maximum segments on system Minimum: 3 Maximum: (memory limited) Default: 200 identifiers shmseg Maximum segments per process Minimum: 1 Maximum: shmmni Default: 120

Process Management Subsystem


maxdsiz maximum process data segment size (32-bit) Minimum: 0x400000 (4 Mbytes) Maximum: 0x7B03A000 1.92GB Default: 0x4000000 (64 Mbytes) The practical limit for maxdsiz is the free space in quadrants 1&2 i.e. 2Gb -maxtsiz,-maxssiz, -64Mb Uarea .

56 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

maxdsiz_64bit maximum process data segment size (64-bit) Minimum: 0x400000 (4 Mbytes) Maximum: 4396972769279 Default: 0x40000000 1 Gb

The same sizing rule applies to the 64bit equivalent . The practical limit for maxdsiz_64bit is the free space in quadrants 2& 3. If maxdsiz is exceeded, the process will be terminated, usually with a SIGSEGV (segmentation violation) and you will probably see the following message: Memory fault(coredump)

maxssiz maximum process storage segment size (32-bit) Minimum: 0x4000 (16 Kbytes) Maximum: 0x17F00000 383Mb Default: 0x800000 (8 Mbytes) maxssiz_64bit maximum process storage segment size (64-bit) Minimum: 0x4000 (16 Kbytes) Maximum: 1073741824 Default: 0x800000 (8 Mbytes) Kernel stack is assigned memory before data in its quadrant , unless a SIGSEGV , stack growth error is received or a vendor specifically recommends a size keep this parameter at default maxtsiz maximum process text segment size (32-bit) Minimum: 0x40000 (4 Mbytes) Maximum: 0x7B033000 (approx 2 Gbytes) Default: 0x4000000 (64 Mbytes) maxtsiz_64bit maximum process text segment size (64-bit) Minimum: 0x40000 (4 Mbytes) Maximum: 4398046511103 (approx 4 Tbytes) Default: 0x4000000 (64 Mbytes) Text rarely requires more space than the default value , unless the error /usr/lib/dld.sl: Call to mmap() failed - TEXT is received or a vendor specifically recommends a size keep the memory for text at the default value.

Other memory configurables max_mem_window Enables/configures number of Memory Windows in system Minimum: 0 Maximum: memory limited Default: 0 unlockable_mem Memory size reserved for system use Minimum: 0 Maximum: Available memory indicated at power-up Default: 0 (system sets to appropriate value) Typically unlockable memory is best left to the system to set

Memory Bottlenecks:
Process Deactivations

57 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

When a process is deactivated ----------------------------Once a process and its pregions are marked for deactivation, sched() * removes the process from the run queue. * adds its uarea to the active pregion list so that vhand can page it out. * moves all the pregions associated with the target process in front of the steal hand, so that vhand can steal from them immediately. * enables vhand to scan and steal pages from the entire pregion,instead of 1/16. Eventually, vhand pushes the deactivated process's pages to secondary storage. When a process is reactivated ----------------------------Processes stay deactivated until the system has freed up enough memory and the paging rate has slowed sufficiently to return processes to the run queue. The process with the highest reactivation priority is then returned to the run queue. Once a process and its pregions are marked for reactivation, sched(): * removes the process's uarea from the active pregion list. * clears all deactivation flags. * brings in the vfd/dbd pairs. * faults in the uarea. * adds the process to the run queue.

Swap There are three types of swap : Device swap, file system swap and pseudo swap . Device swap is divided into two different types. The first is primary swap. This swap device is usually /dev/vg00/lvol2 and is created when the operating system is installed. Primary swap can only be configured on the boot drive. The second type of device swap is called secondary swap. Secondary swap areas can be configured in any volume group, or on any disk, on the system. Ideally device swap should be configured in equal sized , equal priority partitions with the same priority to promote interleaving. Device swap makes swap requests to disk in 256KB chunks For details on swap configuration refer to the following documents : Configuring Device Swap DocId: KBAN00000218 How to configure device swap in VxVM DocId: VXVMKBRC00005232 File system swap File system swap allows a system administrator to add more swap to the system even when all of the disk space has been allocated to other logical volumes, and there is no space left to create a device swap area. With file system swap, you can configure available space within a file system to be used for swap. This type of swap has the poorest I/O performance and should only be used as a last resort. File system swap requests are made in only 8KB chunks. File system swap can cause corruption with VxFS file systems , this issue has been addressed by patches PHKL_23940 (Critical, Reboot) s700_800 11.00 JFS 3.3 File system swap corruption and PHKL_24026 (Critical, Reboot) s700_800 11.11 JFS File system swap corruption Device Swap In modern HP-UX systems device swap is typically used only to hold reserve memory for open processes. Most systems utilize RAM for active process memory calls , this is done by default with the kernel parameter swapmem_on set to 1 . We

58 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

refer to this feature as pseudo -swap . Normally this area is not used for reserve, however if the system is under configured for device swap, RAM will be used . It is not recommended to disable pseudo swap unless the systems lockable memory exceeds 25% of RAM . This is unusual . The amount of device swap that should be configured on a system depends on the requirement for memory for open processes . For 64 bit systems , there should be a sufficient amount of device swap to allow all processes that will run concurrently to reserve their memory . To allow for a full 32 bit memory map , the system should have a total of 4GB of device swap and RAM . A 1:1 ratio of RAM to device swap is a good starting point . For a system that has local swap and also serves other systems with swap space, make a second estimation in addition to the one above. 1. Include the local swap space requirements for the server machine, based on the estimation from above. 2. Add up the total swap space you estimate each client requires. At a minimum, this number should equal the sum of physical memory for each client. The parameter maxswapchunks limits the number of swap space chunks. The default size is 256 the maximum is 16384 The default size of each chunk of swap space is 2 MB. (swapchunks=2048) This can be increased to 4096 if greater than 32GB of swap is required . The OS limit for swap is 64GB RAM used to satisfy on processor memory calls is called pseudo-swap space. It allows users to execute processes in memory without allocating physical swap. Pseudo-swap is controlled by an operating-system parameter; by default, swapmem_on is set to 1, enabling pseudo-swap.

Pseudo-swap space allows for the use of system memory (RAM) as a third type of swap space. It can range from 75% to 87.5% of RAM depending on how much lockable memory and buffer cache is used. Typically systems use between 15-20% of RAM for lockable memory . This region is used for kernel memory and by applications that want to insure RAM is always available.

To determine lockable memory : echo total_lockable_mem/D | adb /stand/vmunix /dev/mem total_lockable_mem: total_lockable_mem: 185280 This will return the amount in Kbytes of lockable memory in use. Divide this by 1024 to get the size in megabytes, then divide by the amount of RAM in megabytes to determine the percentage. Typically, when the system executes a process, swap space is reserved for the entire process, in case it must be paged out. According to this model ,to run one gigabyte of processes, the system would have to have one gigabyte of configured swap space. . Pseudo swap represent the system memory serves two functions: as process-execution space and as swap space. Because memory calls are able to access RAM directly instead of paging to disk , disk I/O is reduced and the calls to memory are satisfied in a shorter time . As before, if a process attempts to grow or be created beyond this extended threshold, it will fail.

For systems, which perform best when the entire application is resident in memory, pseudo-swap space can be used to enhance performance: you can either lock the application in memory or make sure the total number of processes created does not exceed the allocated space .

The unused portion of physical memory allows a buffer between the system and the swapper to give the system computational flexibility. When the number of processes created approaches capacity, the system might exhibit thrashing and a decrease in system response time. If necessary, you can disable pseudo-swap space by setting the tunable parameter swapmem_on in /usr/conf/master.d/core-hpux to zero.

Estimating Your Swap Space Needs Your swap space must be large enough to hold all the processes that could be running at your system's peak usage times.

59 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

As a result of the larger physical memory limits of the 64-bit hardware platforms introduced at 11.0, you may need to significantly increase the amount of swap space for certain applications on these systems. For optimum performance , all reserve memory should be allocated from the available device swap . All active process calls to memory should be satisfied within RAM . When a system is forced to page to disk , performance will be impacted. Device swap requests are 256Kb in size , files system swap requests are 8Kb in size . If a system is forced to page large quantities of memory to disk the resulting increase in disk I/O will impact the speed of the transaction as well as have an effect on the speed of read/write transactions on that disk due to increased I/O demands. Swap space usage increases with system load. If you are adding (or removing) a large number of additional users or applications, you will need to re-evaluate your swap space needs.

To determine virtual memory configuration run the swapinfo command : Example #swapinfo tam

Mb Mb Mb PCT START/ Mb TYPE AVAIL USED FREE USED LIMIT RESERVE dev 1024 0 1024 0 1 /dev/vg00/lvol1 reserve 184 -184 memory 372 96 276 26 total 1396 280 1116 20

PRI NAME

To determine the processes that are using the most memory run :

# ps elf | sort rnk 10 | more The 10th column SZ referes to the size in memory pages. Take into account whether standard page sizes are implemented .

The McKusick & Karels memory allocator


We first look at a high level picture how the operating system manages memory.

60 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

In HP-UX all memory is divided into 4k pages. Since we have a virtual memory system, pages that are logically contiguous do not need to be adjacent in physical memory. The virtual memory system maintains a mapping between the virtual and physical memory page. As a result the operating system can satisfy request for larger memory allocations by setting up a contiguous virtual memory page that has mappings to physical pages that are not necessarily adjacent in physical memory. The smallest amount of memory we can allocate this way is a full page, 4kb. But the kernel often creates objects that are much smaller than 4k. If we would always allocate a full page for these small objects we would waste a lot of memory. This is where our McKusick & Karels memory allocator come to play. The goal for the kernel memory allocator is to allow quick allocation and release of memory in a efficient way. As the picture shows, the kernel memory allocator is making use of the page level allocator. As pointed out already, the main problem we try to solve with the kernel memory allocator is to be able to satisfy requests for memory allocations of less than 4k. Therefore the kernel memory allocator requests a page from the page allocator which then is broken down into smaller sized chunks of equal size. This is done on a power of two basis, starting with the smallest implemented size of 32 byte going up in powers of two to a whole page. Requests of 1 page up to 8 pages are satisfied through the kernel allocator too, but with a bit different mechanism (calling kmalloc()). The chunks of memory generated from that one 4k page are put on a freelist. For each size of a chunk we do have our own freelist. A page of memory has to be broken down into the same size of chunks, we can not use the same page for different sizes of chunks. So as a example, if we try to allocate a 128 byte chunk, and no entry is available on the free list, we will allocate a new page and break it down into 32 128 byte chunks. The remaining 31 chunks are then put on to the 128 byte freelist. A sample picture to illustrate the matter could look like this:

61 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

We determine the right freelist simply by taking the next larger size of power of two where our req }

The general workings of the allocator did not change much over the course of time, what changed is the list header names and the number of lists we maintain on a multiprocessor system. We only had one spinlock guarding the whole bucket free list. On a multi processor system we might end up in considerable time spent in memory allocation due to lock contention. Therefore it was decided to have free lists per processor beginning with 10.20. When a list runs out of free entries of a certain size it first tries to "steal" a entry from a different processor before allocating a new full page and splitting it up. Another change was the introduction of 32 and 64 bit kernel versions with 11.00. The 32 bit kernel uses a array called bucket_32bit[] and the 64 bit kernel a array called bucket_64bit[]. For memory allocations larger than a page we make use of page_buckets_32bit[] and page_buckets_64bit[]. Besides allocation of new pages to the bucket pool in case we need more memory, we might be in a situation where the system runs low on physical memory. Imagine a subsystem that required thousands of small 32 byte memory chunks due to a certain load situation. Then the load has gone and the sub system returns all the many chunks to the bucket free list. A lot of unused memory is now allocated to the bucket allocator that is not available for user applications or general use anymore. In 10.20 vhand was going to check the bucket pool when under physical memory pressure, trying to coalesce the single chunks belonging together into a full page. If that could be managed the full page was returned to the page free pool or to the superpage pool, depending from which pool the page came originally. In 11.00 this mechanism was switched off due to performance reasons. Unfortunately due to the effect that we could pile up lots of entries now per CPU, we could end up exhausting the virtual address space on 32 bit systems. Therefore the algorithm was changed again to allow reclaiming of unused pages again. This is a rough overfiew of the workings and the background of the McKusic and Karels memory allocator.

The Arena Allocator for 11.11


The arena allocator is a new kernel memory allocator introduced in 11.11 that replaces the older MALLOC()/FREE() interface, which was implemented using the McKusic and Karel allocator (also known as the Bucket allocator). The new Arena allocator is very similar to the slab and zone allocators used in SUN and Mach OS respectively.The important features of the Arena allocator are object caching, improved fault isolation, reduced memory fragmentation and better system balance. The following is only a brief description of arena allocation for the sake of first pass dump analysis. A full discussion of Arena internals is available at 11i Internals. Arenas and objects All memory allocations in the kernel can be categorized as fixed size objects and variable sized objects. A fixed sized object is an object whose size remains the same at every memory allocation request. A variable sized object may vary in size at different allocation requests.

62 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Each type of object will have its own arena. Before any allocation is requested, the user needs to create an arena by calling kmem_arena_create(). Once created, each arena is assigned a unique handle and all allocations are made using the handle. Allocations are made by calling kmem_arena_alloc() for fixed size objects and kmem_arena_varalloc() for variable sized objects. Objects are de-allocated by calling kmem_arena_free(). Each arena will have its own free lists to manage the allocation and deallocation from its private memory pool. This helps improves the performance of allocation/deallocation requests. When the free list is empty, VM will refill the list by allocating more pages from the superpage pool or from the page allocator. When there is memory pressure, vhand() will initiate garbage collection among the arenas' free lists to reclaim memory from the arenas. For smoother transition, the Arena allocator provides source compatibility for the previous MALLOC()/FREE() interface. During kernel bootup, arenas will be implicitly created for each type of memory that will be allocated using the old interface. These arenas are given names starting with "M_" (e.g. M_DYNAMIC, M_MBUF). A full list of names can be found in /usr/include/sys/malloc.h. Currently these arenas are implemented as variable sized arenas.

The following diagram gives a pictoral description of the basic operations of the arena allocator.

As shown in the above diagram, each arena in the kernel is represented by a kmem_arena_t structure, which describes the arena and its attributes as w Implementation of Arena's Free Lists

To facilitate the management of each memory chunk in an arena's free list, we associate each chunk with an object header. To implement object cachin

The free list of fixed sized objects is kept in a one-dimensional array. Each element in the array corresponds to each spu in the system as

63 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

shown in the figure below. The head of the free list is kfh_head and this points to the linked list of all the free fixed size object headers.

The free list management for variable sized objects is similar to fixed size objects. However the free list for variable sized objects is kept in a two-dimensional array. Each element in the array corresponds to a spu and bucket index, similar to the Bucket allocator. The major difference from the Bucket allocator is that the size of each bucket is not necessarily 2 to the power of its corresponding index. A bucket map is used instead to map bucket indices to bucket sizes. Two additional fields are defined in the arena structure, ka_bkt_size_map and ka_bkt_idx_map, to allow the mapping between sizes and bucket indices.

64 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Performance Optimized Page Sizing


In 11.0 a new performance related parameter for memory set became available variable page sizing . To determine the size of memory pages being used use 'getconf PAGE_SIZE' . the following white paper discusses this new option in detail : KBAN00000849 Performance Optimized Page Sizing in HP-UX 11.0. White Paper

65 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

HP-UX 11.0 will be the first release of the operating system to have general support for variable sized pages, also known as POPS (Performance Optimized Page Sizing) or Large Pages. Partial support for variable sized pages has existed since HP-UX 10.20 for kernel text, kernel data, kernel dynamic data, locked user shared memory and text. HP-UX 11.0 allows a customer to configure an executable to use specific variable page sizes and/or configure the system to transparently select variable page sizes based upon program heuristics and size.

The document answers the questions:

Who benefits from variable sized pages and why? What are the drawbacks of using variable sized pages? Where can variable sized pages be used in an application? How are page sizes selected for an application? What configured kernel parameters influence page size selection? How can I select page sizes for an application? Why am I not getting my variable sized pages? What statistics/counters are available to assist me?

1. Objectives
HP-UX 11.0 will be the first release of the operating system to have general support for variable sized pages, also known as POPS (Performance Optimized Page Sizing) or Large Pages. Partial support for variable sized pages has existed since HP-UX 10.20 for kernel text, kernel data, kernel dynamic data, locked user shared memory and text. HP-UX 11.0 allows a customer to configure an executable to use specific variable page sizes and/or configure the system to transparently select variable page sizes based upon program heuristics and size.

1.1 Requirements for Variable Pages.


To get variable pages on an executable, the hardware platform must support it. The only machines supporting variable pages are the PA-8000 and any follow on processors based on the PA 2.0 architecture. 1.1 machines such as the PA-7200 do not support variable pages and will not use any page size other than 4K.

2.0 Who benefits from large pages?


When a memory address is referenced the hardware must know where to locate that memory. The translation lookaside buffer or TLB is the mechanism hardware uses to accomplish this task. If the TLB doesn't contain information for the request then the hardware generates a TLB miss. The software takes over and performs whatever tasks are necessary that result in entering the required information into the TLB so a subsequent request will succeed. This miss handling has been part of all processors to date. However the newer processors, specifically the PA-8000, have fewer TLB entries (96 vs. 120), no hardware TLB walker, and a higher cycle time to handle the miss. All these combined mean applications with large data sets can end up spending more time in TLB misses on newer hardware than old. This has been measured in several real world applications under HP-UX 10.20 on the PA-8000. With variable sized pages a larger portion of the virtual address space can be mapped using a single TLB entry. Consequently, applications with large reference sets can be mapped using fewer TLB entries. Fewer entries means less TLB misses and increased performance. Of course all of this assumes that the application is experiencing TLB miss performance problems to begin with. Applications with large references sets *NOT* experiencing TLB miss problems see no measurable gain in performance with variable sized pages. For example, an application spending 20% of its time handling TLB misses can hope to gain 20% in performance using variable sized pages, while an application spending 1% of its time can only gain 1%.

3.0 What are the drawbacks of using variable sized pages?


Prior to variable sized pages every page fault resulted in the use of a 4K page. When an application consciously (or unconsciously) uses a larger page it is using more physical space per fault than it had previously.

66 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Depending on the reference pattern the application may end up using more physical space than before. For example, if only 4K out of a 16K page are referenced, 12K of the space allocated is not used. The increased physical consumption would result in fewer available pages to other applications in the system, possibly resulting in increased paging activity and possible performance degradation. To avoid this the system takes into consideration the availability of memory when selecting page sizes.

4.0 Where can variable sized pages be used in an application?


The HP-UX 11.0 release will support the use of variable sized pages on the following user objects: Program Text Program Initialized data Program private data and BSS Program private dynamic data (as allocated via sbrk()) Program Stack Shared Memory Anonymous memory mapped files (MAP_ ANONYMOUS) Shared libraries The following will not employ the use of variable sized pages in the 11.0 release: Memory Mapped Files (MAP_ FILE) that are not shared libraries. The MMF restriction is thought to be a temporary, but there is no current date to support the use of variable sized pages on MMF's.

5.0 How are page sizes selected for an application?


Page size selection is driven by two mechanisms, the user and the kernel. The user has control over page size selection by setting the configured kernel parameters (discussed in the next section) or by selecting a specific page size (chatr (1)) for a specific application. The kernel honors page size specification in the following order: 1. User specified a Text/Data page size via the chatr (1) interface. chatr (1) is described in more detail later. 2. If no "chatr()" hint is specified then the kernel selects what it decides is a suitable page size based upon system configuration and the object size. This is what we call transparent selection. That size is then compared to the boot time configured page size (vps_pagesize). If that value is smaller than vps_pagesize then vps_ pagesize is used. One of the more important uses of transparent selection comes into play with dynamic objects like stack and data. How can we determine a suitable size for them if we don't know how big they get? If the user specified a data size via chatr (1), the kernel uses that page size hint when the caller increases their data size (stack or data). If chatr (1) is not specified the kernel tracks the users dynamic growth over time. As the size of the object increases, the kernel increases the size of the page size used by the object. The net effect is a scaling of page size based upon the increase in size of the segment in question. An application with a large data set of say 140 Megabytes ends up selecting larger pages than an application who's size is lets say 64K. As the data/stack is increased the maximum size page is restricted by vps_ ceiling (described later).

Even though the object desires a 64K or 256K page size, there are other restrictions that may result in the denial of such a desired size. To use a variable sized page of 64K both the virtual address and physical page must be aligned on that size. The denial for a physical page of 64K would only result if there are no 64K pages available or if there are no pages larger than 64K (say 256K) that could be broken into 64K pieces. In that case the most the application can hope for is the use of a smaller page like 16K, but there is a good chance it will resort to 4K. The virtual restriction comes into play for starting and ending boundaries. If the starting base address of the fault is not a multiple of the page size desired then we can't accommodate that size. This would be the case for objects such as anonymous MMF's where the starting address is randomly chosen. Suppose our anonymous mapping started at address 0xc0005000 for a length of 1 Megabyte and desires a page size of 64K. Because 0xc0005000 is not aligned to a 64K boundary the first 'x' pages are split up using a mixture of 4K and 16K pages until the 64K boundary at 0xc0010000 is encountered.

6.0 What configured kernel parameters influence page size selection?

67 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

The kernel currently supports 3 configured kernel parameters that influence the use of variable sized pages. They are: 1. vps_pagesize 2. vps_ceiling 3. vps_chatr_ceiling vps_pagesize represents the default or minimum page size the kernel should use if the user has not chatr (1)'d a specific value. vps_pagesize is specified in units of kilobytes and should equate to one of the supported values. In the event vps_pagesize does not correspond to a supported page size, the closest page size smaller than the users specification is used. For example, specifying 20K would result in the kernel using 16K. vps_pagesize is essentially a boot time configured page size for all user objects created. The actual effectiveness of that size for the system is unknown. As described earlier, the actual page size used is dependent on virtual alignment as well. Even though vps_pagesize is configured to 16K, if the virtual alignment is not suitable for a 16K page then 4K pages are used instead. The current default value is 4 kilobytes (vps_pagesize = 4). vps_ ceiling represents the maximum size page the kernel uses when selecting page size "transparently". vps_ceiling is specified in units of kilobytes. Like vps_pagesize, vps_ceiling should be a valid page size. If not, the value is rounded down to the closest valid page size. vps_ceiling places a limit on the size used for process data/stack and the size of pages the kernel selects transparently for non dynamic objects (text, shared memory, etc.). The default value is 16K( vps_ceiling = 16). vps_ chatr_ceiling places a restriction on the largest value a user is able to chatr (1). The command itself is not limited, but the kernel checks the chatr'd value against the maximum and only values below vps_chatr_ceiling are actually used. In the event the value exceeds vps_chatr_ceiling, the actual value used is the value of vps_chatr_ceiling. Like the others, vps_chatr_ceiling is specified in units of kilobytes and will be rounded down to the closest page size if an invalid size is specified. chatr (1) does not require any sort of user privilege and can therefore be used by any user. This configured kernel parameter allows the system administrator to restrict the use of large pages if there are "bad citizens" who abuse the facility. The default value is 64 Megabytes, the largest possible page size (vps_chatr_ceiling = 16384).

7.0 How can I select page sizes for an application? Chatr (1)
Chatr (1) is the user command to change the default page size for a processes text and data segments. chatr (1) has the ability to specify a page size selection for text (+pi option) and for data (+pd option). Valid page sizes are from 4K to 64 Megabytes. When an executable is built its page size value is set to "default". The kernel performs transparent selection on default settings. When a user chatr's a specific page size, that size is used for the objects existence. Note there is a difference between 4K and default. In fact there is a specific value for setting "default" or the transparent selection of pages (see chatr (1)). If an executable is chatr'd to 4K, only 4K pages are used. The +pi option is used for text. The kernel uses the chatr'd value as the desired page size for the users text segment (PT_TEXT). No other objects within the process are affected by the setting of +pi. To set an executable to 16K you would specify: chatr +pi 16K my_ executable The +pd option is used to specify the page size for data. Data is a bit different from text in that it affects more objects than simply the users initialized data. The value specified in +pd of an executable is used when creating any of the following: Program private data, anonymous MMF's, shared memory segments and User stack. In order for the page size to be passed to shared memory segments, the chatr'd process must be the one to create the segment. Processes simply attaching to an existing segment have no affect on the desired page size. To set data to 64K specify: chatr +pd 64K my_ executable What about those pesky shared libraries? Shared libraries do not inherit page size. Shared libraries themselves must be chatr'd. If the user wants to change the text/data page size of a shared library then the caller must chatr the shared library. Because chatr (1) must write to the executable header the shared library may not be active at the time of chatr.

8.0 Why am I not getting my variable sized pages?


You've set the configured kernel parameters (vps_pagesize/vps_ceiling) and/or chatr'd your executable, but you don't see any gain in performance. Are you getting large pages? If not, why? First of all, the next section details the statistics kept on a system as well as process level. Verifying system/process counters can certainly help. Let us suppose you verify your executable that has been chatr'd to 256K for data, but its not receiving 256K pages. What could be happening? "Do I have enough physical memory?" If the system is paging because of an over commitment of memory, there may be no physical pages of the desired size available. Variable sized pages are allocated not only on the page size desired, but also the availability of that physical page size. Using pstat_getvminfo() determine just how much memory is available. If its extremely low relative to your page size, then the system does not allocate a page of that size. One way you can tell this might be happening is by examining the psv_select_success[ ]/psv_select_failure[ ] statistics. You'd expect to see failures for your page size. You should also look at psv_pgalloc_failure [ ] to see what physical allocation failures have occurred as well. One reason memory could be low and resulting in variable page size request failures could be the dynamic buffer cache. If the system has been configured with a dynamic buffer cache and the maximum percentage is reasonably large compared to system size, its possible the buffer cache is expanding and using available free memory.

68 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Perhaps you're getting large pages, but not for the entire object. What's the starting virtual alignment? Using pstat_getprocvm() determine the starting virtual address for the object. pstat_getprocvm() will also return the page sizes being used by the object. The alignment of that address dictates what page size can be used starting from that location. If its not a multiple of the specified page size then smaller pages are used until such a point where the virtual address becomes aligned with the page size. You would expect to see some number of 4K's, then some number of 16K's, then some number of 64K's, etc. You may not get any benefit if you chatr(1) a program you have run recently to a new larger size and then run that program with the new size. Note, this only happens if you have run the program after the last mount of the file system on which the program resides and before the chatr(). The reason for this is that on the first running of the program, pages would have been brought into the page cache with the old page size. When you rerun after chatr'ing, the pages are found in the page cache at a different size than what is required. To get around this problem you need to unmount the file system on which the program resides and remount it. So all the counters look good, but the statistics show you don't have any large pages within your object. There is the possibility your pages are being demoted. There are operations both user generated and kernel generated that result in the need for a page to be specifically 4K. For example, the mprotect() system call can operate on a per-page basis. Having 3 out of 4 pages be read/write and 1 page being read-only won't work for a large page. So prior to the mprotect() operation the underlying large page is converted from its size 'X' to 4K. Operations resulting in demotion are: mprotect() partial munmap() negative sbrk() mlock() of 4K pieces in large page Note, you can determine if demotions are occurring by examining psv_demotions[] returned from pstat(). Along the lines of "memory depletion", I want to point out the possible side effect of physical size inflation. This occurs when a large page is used and not all the 4K pieces are accessed. By using too large a page, the object itself uses more physical memory than before and can create paging activity where there was none before. Performance may actually decrease because the process spends time "waiting" for page faults to complete.

9.0 What statistics/counters are available to assist me? 9.1 vps_stats.32 & vps_stats.64
vps_stats.{32/64} is an unsupported command provided by HP to report large page size statistics. The .32 version is for 32 bit systems and the .64 is for 64 bit systems. To get a copy of the tool, please contact your HP representative and ask him/her to extract the unsupported tool shar file from: ftp://hpchs.cup.hp.com/tools/11.X/vps_stats.shar What vps_stats.{32/64} reports can be accessed through existing interfaces, each of which is described below.

9.2 Large page statistics


For Large Pages we maintain several kernel statistics of system activity to track performance. These statistics are accessible to user space via pstat (2). The statistics and the pstat() calls that access them are given below (note that only the 32-bit versions are shown).

9.3 Supported page sizes


struct pst_ static { ... int32_t pst_supported_pgsize[PST_N_PG_SIZES]; ... }; This system-static value is accessible via pstat_getstatic(). It returns an array of valid page sizes, each given as a number of 4K pages. If there are less than PST_N_PG_SIZES page sizes, the array is padded with zeroes.

9.4 User-supplied hints of running processes


struct pst_status { ... int32_t pst_text_size; /* Page size used for text objects. */ int32_t pst_data_size; /* Page size used for data objects. */ ... }; These per-process values are accessible via pstat_getproc(). They reflect the executable's desired text BR> and/or data page size supplied via chatr (1). The page size value is given as a number of 4K pages. If no chatr has been performed on the executable so that the default page size selection heuristic is being used,

69 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

the field value is PST_SZ_DEFAULT.

9.5 Per-region statistics


struct pst_vm_status { ... int32_t pst_pagesize_hint; pst_vps_pgsizes[PST_N_PG_SIZES]; ... }; These values are accessible via pstat_getprocvm(). They are per-region values, i. e. there are separate values for text, data, stack, each memory-mapped and shared memory region, etc. pst_pagesize_hint is a usually-static value that indicates the preferred page size for the region. It is set at region creation time, either from the default page size selection heuristic or from explicit user page size information supplied via chatr. The hint remains the same throughout the life of the region, except for data and stack regions, whose hints can be adjusted upwards as they grow in size. pst_vps_pgsizes[ ] gives the total number of pages by page size currently in use by the region. The array index is the base-2 log of the number of 4K pages in a particular page size, e.g. 0=4K, 2=16K, 4=64K, etc. Note that only translated pages are accounted.

9.6 Global statistics


struct pst_vminfo { ... int32_t psv_select_success[PST_N_PG_SIZES]; int32_t psv_select_failure[PST_N_PG_SIZES]; int32_t psv_pgalloc_success[PST_N_PG_SIZES]; int32_t psv_pgalloc_failure[PST_N_PG_SIZES]; int32_t psv_demotions[PST_N_PG_SIZES]; ... }; These global values are accessible via pstat_getvminfo(). They tally the success/failure of different stages of large page creation. First, we select a page size to attempt to create. We start with pst_pagesize_hint and adjust due to conditions such as: virtual address misalignment neighboring pages already in memory or with different Copy-On-Write status neighboring pages backed by different sources (e.g. some from file system and some from swap space) After selecting a size, we increment the psv_select_success counter corresponding to the size. If the size is less than pst_pagesize_hint, we increment psv_select_filure for all the page sizes up to and including pst_pagesize_hint. In this fashion we can determine which page sizes are asked for but are failing, and which are actually being used. Note that the counters may be inflated in some cases. Under certain conditions, we may select a size to try, encounter an exceptional event (e.g. a wait for memory or I/O alignment), and go back and redo the selection stage. Thus, we may tally several times for the same large page creation, possibly on different size counters. We expect this situation to be rare. After settling on a page size to try, we allocate physical space with the page allocator. psv_pgalloc_success and psv_pgalloc_failure count success/failure of the allocator. The counts are broken down by page size. We tally a success if we ask for a page of a particular size and successfully allocate it, failure otherwise. In some cases, we specify both a desired and minimum acceptable page size. If we succeed at a page size smaller than desired, we increment failure for all the page sizes up to the desired one (similar to the above). Thus, failure counts may appear larger than expected. Note that a psv_select_failure doesn't necessarily generate a psv_pgalloc_failure. The allocator doesn't know if we've adjusted downward before asking for physical space; it only knows if it handed us the page size we requested. Finally, we may incur trouble even after a large page is allocated and in use. Certain system operations work only on 4K pages; if they encounter a large page, they must demote it to a series of 4K pages. For example, we might need to temporarily re-map or copy an existing large page, and cannot get the resources for the temporary large page. In order to do the re-map/copy, we demote the original page and retry for (less restrictive) resources. Demotions are tallied by page size in psv_demotions. Almost all demotions result in 4K pages, though in rare cases we demote to the next smaller page size.

The kernel currently supports 3 configured kernel parameters that influence the use of variable sized pages.:

70 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Variable-Page-Size Parameters: vps_ceiling Maximum system-selected page size in Kbytes Minimum: 4 Maximum: 65536 Default: 16 vps_chatr_ceiling Maximum chatr-selected page size in Kbytes Minimum: 4 Kbytes Maximum: 65536 Kbytes Default: 65536 Kbytes vps_pagesize Default user page size in Kbytes Minimum: 4 Maximum: 65536 Default: 4 vps_pagesize represents the default or minimum page size the kernel should use if the user has not chatr (1)'d a specific value. vps_pagesize is specified in units of kilobytes and should equate to one of the supported values. In the event vps_pagesize does not correspond to a supported page size, the closest page size smaller than the users specification is used. For example, specifying 20K would result in the kernel using 16K. vps_pagesize is essentially a boot time configured page size for all user objects created. The actual effectiveness of that size for the system is unknown. As described earlier, the actual page size used is dependent on virtual alignment as well. Even though vps_pagesize is configured to 16K, if the virtual alignment is not suitable for a 16K page then 4K pages are used instead. The current default value is 4 kilobytes (vps_pagesize = 4).

vps_ ceiling represents the maximum size page the kernel uses when selecting page size "transparently". vps_ceiling is specified in units of kilobytes. Like vps_pagesize, vps_ceiling should be a valid page size. If not, the value is rounded down to the closest valid page size. vps_ceiling places a limit on the size used for process data/stack and the size of pages the kernel selects transparently for non-dynamic objects (text, shared memory, etc.). The default value is 16K( vps_ceiling = 16).

vps_ chatr_ceiling places a restriction on the largest value a user is able to chatr (1). The command itself is not limited, but the kernel checks the chatr'd value against the maximum and only values below vps_chatr_ceiling are actually used. In the event the value exceeds vps_chatr_ceiling, the actual value used is the value of vps_chatr_ceiling. Like the others, vps_chatr_ceiling is specified in units of kilobytes and will be rounded down to the closest page size if an invalid size is specified. chatr (1) does not require any sort of user privilege and can therefore be used by any user. This configured kernel parameter allows the system administrator to restrict the use of large pages if there are "bad citizens" who abuse the facility. The default value is 64 Megabytes, the largest possible page size (vps_chatr_ceiling = 16384). There are commands available for 32 and 64 bit to provide statistics for variable page size : for 32 bit vps_stats.32 for 64 bit vps_stats.64

vps_stats.{32/64} is an unsupported command provided by HP to report large page size statistics. These are avaiable at : ftp://hpchs.cup.hp.com/tools/11.X/vps_stats.shar

The 32 bit memory map The current 32-bit address space layout can be depicted by comparing how that virtual address space is used in kernel mode and user mode.
Figure 1 32-bit address space layout on PA1.x

71 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

32-bit address space layout on PA1.x

Shared memory
Shared object space otherwise known as shared memory occupies 2 quadrants in the system memory map. The size of the memory map is determined by the total amount of memory on the system , i.e. all RAM +SWAP. The memory map is divided into 4 quadrants . For 32 bit systems shared object space is allocated in quadrants 3&4 . Within quadrant 3 there is a maximum address space of 1Gb, for quadrant 4 there is a maximum address space of 768 Mb . The last 256Mb of quadrant 4 is reserved for kernel I/O . If additional shared object space is required for 32 bit operation , there are 2 alternatives available. Regardless of MAGIC type , or if memory windows are implemented there is a 1Gb 32 bit architectural limit for a single segment . There is further information on shared memory in the WTEC tools section under SHMEMINFO

Shared memory use can be monitored by using the ipcs utility. ipcs mob You will see an output similar to this : IPC status from /dev/kmem as of Tue Apr 17 09:29:33 2001 T ID KEY MODE OWNER GROUP NATTCH SEGSZ Shared Memory: m 0 0x411c0359 --rw-rw-rwroot root 0 348 m 1 0x4e0c0002 --rw-rw-rwroot root 1 61760 m 2 0x412006c9 --rw-rw-rwroot root 1 8192 m 3 0x301c3445 --rw-rw-rwroot root 3 1048576 m 4004 0x0c6629c9 --rw-r----root root 2 7235252 m 5 0x06347849 --rw-rw-rwroot root 1 77384 m 206 0x4918190d --rw-r--rwroot root 0 22908 m 6607 0x431c52bc --rw-rw-rw- daemon daemon 1 5767168

72 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

The two fields of the most interest are NATTCH and SEGSZ. NATTCH -The number of processes attached to the associated shared memory segment. Look for those that are 0, they indicate processes who have not released their shared memory segment. If there are multiple segments showing with an NATTACH of zero , especially if they are owned by a database, this can be an indication that the segments are not being efficiently released . This is due to the program not calling detachreg . These segments can be removed using ipcrm -m shmid.

Note : Even though there is no process attached to the segment , the data structure is still intact. The shared memory segment and data structure associated with it are destroyed by executing this command. SEGSZ The size of the associated shared memory segment in bytes. The total of SEGSZ for a 32-bit system using EXEC_MAGIC cannot exceed 1879048192 bytes or 1.75Gb, or 2952790016 bytes or 2.75Gb for SHMEM_MAGIC.

By default systems operate under EXEC_MAGIC , it is possible to utilize quadrant 2 for additional shared object space by converting via chatr -m to SHMEM_MAGIC. An existing application may be relinked as the new executable type SHMEM_MAGIC, or, the application can be linked as type EXEC_MAGIC, and then chatr'd to be the new executable type SHMEM_MAGIC. It is important to remember that if this choice is made , the available memory for data , kernel stack , text and Uarea will be confined to the 1GB maximum in quadrant 1 . The second alternative is to implement memory windows . This alternative allows for discrete shared object space , called by the getmemwindow command . The ability to create a unique memory window removes the current system wide 1.75 gigabyte limitation, 2.75 gigabytes if compiled as SHMEM_MAGIC. A 32-bit process can create a unique memory window for shared objects like shared memory. Other processes can then use this window for shared objects as well. To enable the use of memory windows, the kernel tunable, max_mem_window, must be set to the desired amount of memory windows. The disabled value is 0. The amount of memory windows is limited by the total system memory . The theoretical limit is 8192 1Gb windows , at this time the OS and available hardware prevent this .

Magic number review There are 3 magic numbers that can be used for a 32 bit executable at 11.00. They are SHARE_MAGIC (DEMAND_MAGIC), EXEC_MAGIC, and SHMEM_MAGIC. For 64 bit 11.00 executables there is currently no need to have different memory maps available as the standard one allows up to 4TB for the program text, another 4TB for its private data and a total of 8TB for shared areas. SHARE_MAGIC is the default at 11.0. SHARE_MAGIC is also called DEMAND_MAGIC. With SHARE_MAGIC, quadrant 1 is used for program text, quadrant 2 is used for program data, kernel stack and Uarea and quadrants 3 and 4 are for shared objects. EXEC_MAGIC allows a greater process data space by allowing text and data to share quadrant 1.

Note: Even with SHMEM_MAGIC executables, a single shared memory segment must be
contained completely in one quadrant, so 1 GB is still the maximum size of a single shared memory segment. Memory Windows will run on HP-UX 11.0 either 32 or 64 bit installation To implement memory windows for 11.0 the following patches must be installed : PHKL_18543 (Critical, Reboot)s700_800 11.00 PM/VM/UFS/async/scsi/io/DMAPI/JFS/perf patch PHCO_23705 s700_800 11.00 memory windows cumulative patch PHCO_27375 s700_800 11.00 cumulative SAM/ObAM patch PHKL_28766 (Critical, Reboot)s700_800 11.00 Probe,IDDS,PM,VM,PA-8700,AIO,T600,FS,PDC,CLK

73 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

These have dependencies , for the latest revisions check http://itrc.hp.com/

To configure the file /etc/services.window needs to be set up with the correct information . A line should be added in the /etc/services.window file to associate an application with a memory window id. Here is a sample /etc/services.window. for 3 Oracle instances . In the example that follows the Oracledb1 uses memory window id 20 Oracledb2 has id 30 and Oracledb3 has id 40.

Oracledb1 20 Oracledb2 30 Oracledb3 40 Two new commands have been added to support memory windows. The getmemwindow command used to extract window ids of user processes from the /etc/services.window file. The setmemwindow command changes the window id of a running process or starts a specified program in a particular memory window.

The following is a simple script to start a process in a memory window: # more startDB1.sh WinId=$(getmemwindow Oracledb1) setmemwindow -i $WinId /home/dave/memwinbb/startDB1 "DB1 says hello!" Run the script and see the output from the binary. The setmemwindow command didn't produce any of the output: # ./startDB1.sh writing to segment: "DB1 says hello!" Key is 1377042728 Shared memory is allocated in segements . The size of the segments is limited by the free space within the memory quadrants allotted for shared object space. This shared memory segment, must have all of the memory addresses allocated in sequence, with no gaps between addresses. Memory allocated in this manner is known as contiguous memory. No process or shared memory segment may cross a quadrant boundary. This is true regardless of architecture or OS . There is an issue with fragmentation of shared memory . The most common cause for this to happen is excessively sized inode tables . This is true of HFS as well as VxFS. As mentioned in the section on system tables setting the inode tables to appropriate sizes will minimize this issue.

VMSTAT

The vmstat tool reports virtual memory statistics , it can be a key tool to track suspected memory leaks . The vmstat command reports certain statistics kept about process, virtual memory, trap, and CPU activity. It also can clear the accumulators in the kernel sum structure. Options vmstat recognizes the following options: -d Report disk transfer information as a separate section, in the form of transfers per second. -n Provide an output format that is more easily viewed on an 80-column display device.

74 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

This format separates the default output into two groups: virtual memory information and CPU data. Each group is displayed as a separate line of output. On multiprocessor systems, this display format also provides CPU utilization on a per CPU basis. -S Report the number of processes swapped in and out (si and so) instead of page reclaims and address translation faults (re and at). interval Display successive lines, which are summaries over the last interval seconds. If interval is zero, the output is displayed once only. If the -d option is specified, the column headers are repeated. If -d is omitted, the column headers are not repeated. The command vmstat 5 prints what the system is doing every five seconds. This is a good choice of printing interval since this is how often some of the statistics are sampled in the system; others vary every second. count Repeat the summary statistics count times. If count is omitted or zero, the output is repeated until an interrupt or quit signal is received. From the terminal, these are commonly ^C and ^\, respectively (see stty(1) ). -f Report on the number of forks and the number of pages of virtual memory involved since boot-up. -s Print the total number of several kinds of paging-related events from the kernel sum structure that have occurred since boot-up or since vmstat was last executed with the -z option. -z Clear all accumulators in the kernel sum structure. This option is restricted to the super user.

If none of these options is given, vmstat displays a one-line summary of the virtual memory activity since boot-up or since the -z option was last executed.

Column Descriptions The column headings and the meaning of each column are: procs Information about numbers of processes in various states. r in run queue b Blocked for resources (I/O, paging, etc.) w Runnable or short sleeper (< 20 secs) but swapped memory Information about the usage of virtual and real memory. Virtual pages are considered active if they belong to processes that are running or have run in the last 20 seconds. avm Active virtual pages free Size of the free list page Information about page faults and paging activity. These are averaged each five seconds, and given in units per second. re Page reclaims (without -S) at Address translation faults (without -S) si Processes swapped in (with -S) so Processes swapped out (with -S) pi Pages paged in po Pages paged out fr Pages freed per second de Anticipated short-term memory shortfall sr Pages scanned by clock algorithm, per second faults Trap/interrupt rate averages per second over last 5 seconds. in Device interrupts per second (nonclock) sy System calls per second cs CPU context switch rate (switches/sec) cpu Breakdown of percentage usage of CPU time us User time for normal and low priority processes sy System time id CPU idle

75 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

EXAMPLES The following examples show the output for various command options. For formatting purposes, some leading blanks have been deleted. 1. Display the default output. vmstat procs memory page faults cpu r b w avm free re at pi po fr de in sy cs us sy id 0 0 0 1158 511 0 0 0 0 0 0 111 18 7 0 0 100

sr 0

2. Add the disk transfer information to the default output.

vmstat -d procs memory page faults cpu r b w avm free re at pi po fr de in sy cs us sy id 0 0 0 1158 511 0 0 0 0 0 0 111 18 7 0 0 100 Disk Transfers device xfer/sec c0t6d0 0 c0t1d0 0 c0t3d0 0 c0t5d0 0 3. Display the default output in 80-column format.

sr 0

vmstat -n VM memory page avm free re at pi po 1158 430 0 0 0 0 CPU cpu procs us sy id r b w 0 0 100 0 0 0

fr de 0 0

faults sr in sy cs 0 111 18 7

4. Replace the page reclaims and address translation faults with process swapping in the default output.

vmstat -S procs memory faults cpu r b w avm free si so page pi po fr de sr

76 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

in sy cs us sy id 0 0 1158 430 0 111 18 7 0 0 100

5. Display the default output twice at five-second intervals. Note that the headers are not repeated.

vmstat 5 2 procs memory page faults cpu r b w avm free re at pi po fr de in sy cs us sy id 0 0 0 1158 456 0 0 0 0 0 0 111 18 7 0 0 100 0 0 0 1221 436 5 0 5 0 0 0 108 65 18 0 1 99

sr 0 0

6. Display the default output twice in 80-column format at five-second intervals. Note that the headers are not repeated.

vmstat -n 5 2 VM memory page avm free re at pi po 1221 436 0 0 0 0 CPU cpu procs us sy id r b w 0 0 100 0 0 0 1221 435 2 0 2 0 0 1 99 0 0 0

fr de 0 0

faults sr in sy cs 0 111 18 7

109

35 17

7. Display the default output and disk transfers twice-in 80-column format at five-second intervals. Note that the headers are repeated.

vmstat -dn 5 2 VM memory page avm free re at pi po 1221 435 0 0 0 0 CPU cpu procs us sy id r b w 0 0 100 0 0 0 Disk Transfers device xfer/sec c0t6d0 0

fr de 0 0

faults sr in sy cs 0 111 18 7

77 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

c0t1d0 c0t3d0 c0t5d0

0 0 0

VM memory page avm free re at pi po 1219 425 0 0 0 0 CPU cpu procs us sy id r b w 1 8 92 0 0 0 Disk Transfers device xfer/sec c0t6d0 0 c0t1d0 0 c0t3d0 0 c0t5d0 0

fr de 0 0

faults sr in sy cs 0 111 54 15

8. Display the number of forks and pages of virtual memory since boot-up.

vmstat -f 24558 forks, 1471595 pages, average= 59.92

9. Display the counts of paging-related events.

vmstat -s 0 swap ins 0 swap outs 0 pages swapped in 0 pages swapped out 1344563 total address trans. faults taken 542093 page ins 2185 page outs 602573 pages paged in 4346 pages paged out 482343 reclaims from free list 504621 total page reclaims 124 intransit blocking page faults 1460755 zero fill pages created 404137 zero fill page faults 366022 executable fill pages created 71578 executable fill page faults 0 swap text pages found in free list 162043 inode text pages found in free list 196 revolutions of the clock hand 45732 pages scanned for page out 4859 pages freed by the clock daemon 36680636 cpu context switches 1497746186 device interrupts 1835626 traps

78 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

87434493 system calls

WARNINGS
Users of vmstat must not rely on the exact field widths and spacing of its output, as these will vary depending on the system, the release of HP-UX, and the data to be displayed.

Module 5 DISK I/O

Disk Bottlenecks: high disk activity high idle CPU time waiting for I/O requests to finish long disk queues Efforts to optimize disk performance will be wasted if the server has insufficient memory.

This will report activity every five seconds. Look at the bps and sps columns for the disks (device) that hold exported file systems . bps shows the number of kilobytes transferred per second during the period; sps shows the number of seeks per second (ignore msps). To optimize disk I/O consider the following layout: Put your most frequently accessed information on your fastest disks, and distribute the workload evenly among identical, mounted disks so as to prevent overload on a disk while another is under-utilized. Whenever possible, if you plan to have a file system span disks, have the logical volume span identical disk interface types. Best performance results from a striped logical volume that spans similar disks. The more closely you match the striped disks in terms of speed, capacity, and interface type, the better the performance you can expect. So, for example, when striping across several disks of varying speeds, performance will be no faster than that of the slowest disk. Increasing the number of disks may not necessarily improve performance. This is because the maximum efficiency that can be achieved by combining disks in a striped logical volume is limited by the maximum throughput of the file system itself and of the buses to which the disks are attached If you have more than one interface card or bus to which you can connect disks, distribute the disks as evenly as possible among them. That is, each interface card or bus should have roughly the same number of disks attached to it. You will achieve the best I/O performance when you use more than one bus and interleave the stripes of the logical volume. A logical volume's stripe size identifies the size of each of the blocks of data that make up the stripe. You can set the stripe size to four, eight, 16, 32, or 64 kilobytes (KB) (the default is eight KB).

79 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

The stripe size of a logical volume is not related to the physical sector size of a disk, which is typically 512 bytes. Other factors to consider when optimizing file system performance for VxFS are the block size, intent log size and mount options. Block size The unit of allocation in VxFS is a block. There are no fragments because storage is allocated in extents that consist of one or more blocks. The smallest block size available is 1K, which is also the default block size for VxFS file systems created on file systems of less than 8 gigabytes.

Choose a block size based on the type of application being run. For example, if there are many small files, a 1K block size may save space. For large file systems, with relatively few files, a larger block size is more appropriate. The trade-off of specifying larger block sizes is a decrease in the amount of space used to hold the free extent bitmaps for each allocation unit, an increase in the maximum extent size, and a decrease in the number of extents used per file versus an increase in the amount of space wasted at the end of files that are not a multiple of the block size. Larger block sizes use less disk space in file system overhead, but consume more space for files that are not a multiple of the block size. The easiest way to judge, which block sizes provide the greatest system efficiency is to try representative system, loads against various sizes and pick the fastest.

To determine the current block size for the vxfs file system: fstyp -v /dev/vg00/lvol# For example: # fstyp -v /dev/vg00/lvol1 f_bsize: 8192

Note: The f_bsize parameter reports the block size for the vxfs file system. Intent Log size The intent log size is chosen when a file system is created and cannot be subsequently changed. The mkfs utility uses a default intent log size of 1024 blocks. The default size is sufficient for most workloads. If the system is used as an NFS server or for intensive synchronous write workloads, performance may be improved using a larger log size. With larger intent log sizes, recovery time is proportionately longer and the file system may consume more system resources (such as memory) during normal operation. There are several system performance benchmark suites for which VxFS performs better with larger log sizes. As with block sizes, the best way to pick the log size is to try representative system loads against various sizes and pick the fastest. Mount options Standard mount options : Intent Log Options These options control how transactions are logged to disk: 1. Full Logging (log) File system structural changes are logged to disk before the system call returns to the application. If the system crashes, fsck(1M) will complete logged operations that have not completed. 2. Delayed Logging (delaylog) Some system calls return before the intent log is written. This improves the performance of the system, but some changes are not guaranteed until a short time later when the intent log is written. This mode approximates traditional UNIX system guarantees for

80 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

correctness in case of system failure. 3. Temporary Logging (tmplog) The intent log is almost always delayed. This improves performance, but recent changes may disappear if the system crashes. This mode is only recommended for temporary file systems. 4. No Logging (nolog) The intent log is disabled. The other three logging modes provide for fast file system recovery; nolog does not provide fast file system recovery. With nolog mode, a full structural check must be performed after a crash; this may result in loss of substantial portions of the file system, depending upon activity at the time of the crash. Usually, a nolog file system should be rebuilt with mkfs(1M) after a crash. The nolog mode should only be used for memory resident or very temporary file systems.

Write Options These options control how user data is written to disk: 1. Direct

writes (direct)

The direct value causes any writes without the O_SYNC flag and all reads to be handled as if the VX_DIRECT caching advisory was set instead. 2.

Data Synchronous Writes (dsync)

A write operation returns to the caller after the data has been transferred to external media, but the inode is not updated synchronously if only the times in the inode need to be updated. 3. Sync

on Close Writes (closesync)

Sync-on-close I/O mode causes writes to be delayed rather than to take effect immediately, and causes the equivalent of an fsync(2) to be run when a file is closed. 4. Delayed Writes (delay) This causes writes to be delayed rather than to take effect immediately. No special action is performed when closing a file. 5.

Temporary caching (tmpcache)

The tmpcache value disables delayed extended writes, trading off integrity for performance. When this option is chosen, JFS does not zero out new extents allocated as files are sequentially written. Uninitialized data may appear in files being written at the time of a system crash.

The system administrator can independently control the way writes with and without O_SYNC are handled. The mincache mount option determines how ordinary writes are treated; the convosync option determines how synchronous writes are treated mincache=direct|dsync|closesync|tmpcache convosync=direct|dsync|closesync|delay

81 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

In addition to the standard mount mode (log mode), VxFS provides blkclear, delaylog, tmplog, nolog, and nodatainlog modes of operation. Caching behavior can be altered with the mincache option, and the behavior of O_SYNC and D_SYNC (see fcntl(2)) writes can be altered with the convosync option. The delaylog and tmplog modes are capable of significantly improving performance. The improvement over log mode is typically about 15 to 20 percent with delaylog; with tmplog, the improvement is even higher. Performance improvement varies, depending on the operations being performed and the workload. Read/write intensive loads should show less improvement, while file system structure intensive loads (such as mkdir, create, and rename) may show over 100 percent improvement. The best way to select a mode is to test representative system loads against the logging modes and compare the performance results.

Most of the modes can be used in combination. For example, a desktop machine might use both the blkclear and mincache=closesync modes.

If you plan to use the striped logical volume as a raw data partition (for example, for a database application that uses the device directly): The stripe size should be the same as the primary I/O size for the application. This section describes the kernel tunables in VxFS. See the System Tables section regarding vx_ninode . Monitoring free space In general, VxFS works best if the percentage of free space in the file system does not get below 10 percent. This is because file systems with 10 percent or more free space have less fragmentation and better extent allocation. Regular use of the df_vxfs(1M) command to monitor free space is desirable. Full file systems may have an adverse effect on file system performance. Full file systems should therefore have some files removed, or should be expanded (see fsadm_vxfs (1M) for a description of online file system expansion). The reorganization and resize features of fsadm_vxfs(1M) are available only with the optional HP OnLineJFS product.

If Advanced JFS ( Online) is installed defragmentation may yield performance gains . Fragmentation means that files are scattered haphazardly across a disk or disks, the result of growth over time. Multiple disk-head movements are needed to read and update such files, theoretically slowing response time. Defragmentation can be done either from the command line or via SAM. Ideally a servers JFS file systems should be defragmented regularly. The frequency should be based on the volatility of read/write/deletes. The easiest way to ensure that fragmentation does not become a problem is to schedule regular defragmentation runs from cron. Defragmentation scheduling should range from weekly (for frequently used file systems) to monthly (for infrequently used file systems). Extent fragmentation should be monitored with fsadm_vxfs (1M) or the -o s options of df_vxfs(1M). There are three factors, which can be used to determine the degree of fragmentation: 1) percentage of free space in extents of less than eight blocks in length 2) percentage of free space in extents of less than 64 blocks in length 3) percentage of free space in extents of length 64 blocks or greater An unfragmented file system will have the following characteristics: less than 1 percent of free space in extents of less than eight blocks in length less than 5 percent of free space in extents of less than 64 blocks in length more than 5 percent of the total file system size available as free extents in lengths of 64 or more blocks A badly fragmented file system will have one or more of the following characteristics: greater than 5 percent of free space in extents of less than 8 blocks in length more than 50 percent of free space in extents of less than 64 blocks in length less than 5 percent of the total file system size available as free extents in lengths of 64 or more blocks

82 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Note: Defragmentation can be done at the directory level, depending on the volatility of the directory structure, this can be less efficient and not always provide significant increases in performance.

For an extent fragmentation report run : fsadm -E /mount_point To execute an extent defragmentation run fsadm -e /mount_point For a directory fragmentation report run: fsadm -D /mount_point To execute a directory defragmentation run fsadm -d /mount_point Defragmentation can also be done through SAM Execute sam. Select Disks and File Systems functional area. Select the File Systems application. Select the JFS (VxFS) file system. Select Actions Select the VxFS Maintenance menu item. You can choose to view reports: Select View Extent Fragmentation Report Select View Directory Fragmentation Report or perform the defragmentation: Select Reorganize Extents Reorganize Directories Performance of a file system can be enhanced by a suitable choice of I/O sizes and proper alignment of the I/O requests based on the requirements of the underlying special device. VxFS provides tools to tune the file systems.

Tuning VxFS I/O Parameters (Online JFS 3.3 or higher)


The VxFS file system provides a set of tunable I/O parameters that control some of its behavior. If the default parameters are not acceptable, then the /etc/vx/tunefstab file can be used to set values for I/O parameters. The mount_vxfs (1M) command invokes the vxtunefs (1M) command to process the contents of the /etc/vx/tunefstab file. Please note that the mount command will continue even if the call to vxtunefs fails or if vxtunefs detects invalid parameters. While the file system is mounted, any I/O parameters can be changed using the vxtunefs command which can have tunables specified on the command line or can read them from the /etc/vx/tunefstab file. For more details, see vxtunefs (1M) and tunefstab (4). The vxtunefs command can be used to print the current values of the I/O parameters.

Tunable VxFS I/O Parameters read_pref_io The preferred read request size. The file system uses this in conjunction with the read_nstream value to determine how much data to read ahead. The default value is 64K. write_pref_io The preferred write request size. The file system uses this in conjunction with the write_nstream value to determine how to do flush behind on writes. The default value is 64K.

83 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

read_nstream The number of parallel read requests of size read_pref_io to have outstanding at one time. The file system uses the product of read_nstream multiplied by read_pref_io to determine its read ahead size. The default value for read_nstream is 1. write_nstream The number of parallel write requests of size write_pref_io to have outstanding at one time. The file system uses the product of write_nstream multiplied by write_pref_io to determine when to do flush behind on writes. The default value for write_nstream is 1. default_indir_ size On VxFS, files can have up to ten direct extents of variable size stored in the inode. Once these extents are used up, the file must use indirect extents, which are a fixed size that is set when the file first uses indirect extents. These indirect extents are 8K by default. The file system does not use larger indirect extents because it must fail a write and return ENOSPC if there are no extents available that are the indirect extent size. For file systems with a lot of large files, the 8K indirect extent size is too small.

The files that get into indirect extents use a lot of smaller extents instead of a few larger ones. By using this parameter, the default indirect extent size can be increased so large that files in indirects use fewer larger extents. NOTE : The tunable default_indir_size should be used carefully. If it is set too large, then writes will fail when they are unable to allocate extents of the indirect extent size to a file. In general, the fewer and the larger the files on a file system, the larger the default_indir_size can be set. This parameter should generally be set to some multiple of the read_pref_io parameter. default_indir_size is not applicable on Version 4 disk layouts.

discovered_direct_iosz Any file I/O requests larger than the discovered_direct_iosz are handled as discovered direct I/O. A discovered direct I/O is unbuffered similar to direct I/O, but it does not require a synchronous commit of the inode when the file is extended or blocks are allocated. For larger I/O requests, the CPU time for copying the data into the page cache and the cost of using memory to buffer the I/O data becomes more expensive than the cost of doing the disk I/O. For these I/O requests, using discovered direct I/O is more efficient than regular I/O. The default value of this parameter is 256K. initial_extent_ size Changes the default initial extent size. VxFS determines, based on the first write to a new file, the size of the first extent to be allocated to the file. Normally the first extent is the smallest power of 2 that is larger than the size of the first write. If that power of 2 is less than 8K, the first extent allocated is 8K. After the initial extent, the file system increases the size of subsequent extents (see max_seqio_extent_size) with each allocation. Since most applications write to files using a buffer size of 8K or less, the increasing extents start doubling from a small initial extent. initial_extent_size can change the default initial extent size to be larger, so the doubling policy will start from a much larger initial size and the file system will not allocate a set of small extents at the start of file. Use this parameter only on file systems that will have a very large average file size. On these file systems it will result in fewer extents per file and less fragmentation. initial_extent_size is measured in file system blocks. max_buf_data_size The maximum buffer size allocated for file data; either 8K bytes or 64K bytes. Use the larger value for workloads where large reads/writes are performed sequentially. Use the smaller value on workloads where the I/O is random or is done in small chunks. 8K bytes is the default value. max_direct_iosz The maximum size of a direct I/O request that will be issued by the file system. If a larger I/O request comes in, then it is broken up into max_direct_iosz chunks. This parameter defines how much memory an I/O request can lock at once, so it should not be set to more than 20 percent of memory. max_diskq Limits the maximum disk queue generated by a single file. When the file system is flushing data for a file and the number of pages being flushed exceeds max_diskq, processes will block until the amount of data being flushed decreases. Although this doesn't limit the actual disk queue, it prevents flushing processes from making the system unresponsive. The default value is 1 MB.

84 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

max_seqio_extent_size Increases or decreases the maximum size of an extent. When the file system is following its default allocation policy for sequential writes to a file, it allocates an initial extent, which is large enough for the first write to the file. When additional extents are allocated, they are progressively larger (the algorithm tries to double the size of the file with each new extent) so each extent can hold several writes worth of data. This is done to reduce the total number of extents in anticipation of continued sequential writes. When the file stops being written, any unused space is freed for other files to use. Normally this allocation stops increasing the size of extents at 2048 blocks which prevents one file from holding too much unused space. max_seqio_extent_size is measured in file system blocks.

Tips Try to align the parameters to match the geometry of the logical disk. With striping or RAID-5, it is common to set read_pref_io to the stripe unit size and read_nstream to the number of columns in the stripe. For striping arrays, use the same values for write_pref_io and write_nstream, but for RAID-5 arrays, set write_pref_io to the full stripe size and write_nstream to 1. For an application to do efficient disk I/O, it should issue read requests that are equal to the product of read_nstream multiplied by read_pref_io. Generally, any multiple or factor of read_nstream multiplied by read_pref_io should be a good size for performance. For writing, the same rule of thumb applies to the write_pref_io and write_nstream parameters. When tuning a file system, the best thing to do is try out the tuning parameters under a real life workload. If an application is doing sequential I/O to large files, it should try to issue requests larger than the discovered_direct_iosz. This causes the I/O requests to be performed as discovered direct I/O requests, which are unbuffered like direct I/O but do not require synchronous inode updates when extending the file. If the file is larger than can fit in the cache, then using unbuffered I/O avoids throwing useful data out of the cache and it avoids a lot of CPU overhead. Cache Advisories

The VxFS file system allows an application to set cache advisories for use when accessing files. These advisories are in memory only and they do not persist across reboots. Some advisories are currently maintained on a per-file, not a per-file-descriptor, basis. This means that only one set of advisories can be in effect for all accesses to the file. If two conflicting applications set different advisories, both use the last advisories that were set. All advisories are set using the VX_SETCACHE ioctl. The current set of advisories can be obtained with the VX_GETCACHE ioctl. For details on the use of these ioctls, see vxfsio(7). The VX_SETCACHE ioctl is available only with the HP OnLineJFS product.

Direct I/O
Direct I/O is an unbuffered form of I/O. If the VX_DIRECT advisory is set, the user is requesting direct data transfer between the disk and the user-supplied buffer for reads and writes. This bypasses the kernel buffering of data, and reduces the CPU overhead associated with I/O by eliminating the data copy between the kernel buffer and the user's buffer. This also avoids taking up space in the buffer cache that might be better used for something else. The direct I/O feature can provide significant performance gains for some applications. For an I/O operation to be performed as direct I/O, it must meet certain alignment criteria. The alignment constraints are usually determined by the disk driver, the disk controller, and the system memory management hardware and software. The file offset must be aligned on a 4-byte boundary. If a request fails to meet the alignment constraints for direct I/O, the request is performed as data synchronous I/O. If the file is currently

85 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

being accessed by using memory mapped I/O, any direct I/O accesses are done as data synchronous I/O. Since direct I/O maintains the same data integrity as synchronous I/O, it can be used in many applications that currently use synchronous I/O. If a direct I/O request does not allocate storage or extend the file, the inode is not immediately written. The CPU cost of direct I/O is about the same as a raw disk transfer. For sequential I/O to very large files, using direct I/O with large transfer sizes can provide the same speed as buffered I/O with much less CPU overhead. If the file is being extended or storage is being allocated, direct I/O must write the inode change before returning to the application. This eliminates some of the performance advantages of direct I/O. The direct I/O and VX_DIRECT advisories are maintained on a per-file-descriptor basis.

Unbuffered I/O
If the VX_UNBUFFERED advisory is set, I/O behavior is the same as direct I/O with the VX_DIRECT advisory set, so the alignment constraints that apply to direct I/O also apply to unbuffered. For I/O with unbuffered I/O, however, if the file is being extended, or storage is being allocated to the file, inode changes are not updated synchronously before the write returns to the user. The VX_UNBUFFERED advisory is maintained on a per-file-descriptor basis.

Discovered Direct I/O


Discovered Direct I/O is not a cache advisory that the user can set using the VX_SETCACHE ioctl. When the file system gets an I/O request larger than the default size of 128K, it tries to use direct I/O on the request. For large I/O sizes, Discovered Direct I/O can perform much better than buffered I/O. Discovered Direct I/O behavior is similar to direct I/O and has the same alignment constraints, except writes that allocate storage or extend the file size do not require writing the inode changes before returning to the application.

Data Synchronous I/O


If the VX_DSYNC advisory is set, the user is requesting data synchronous I/O. In synchronous I/O, the data is written, and the inode is written with updated times and (if necessary) an increased file size. In data synchronous I/O, the data is transferred to disk synchronously before the write returns to the user. If the file is not extended by the write, the times are updated in memory, and the call returns to the user. If the file is extended by the operation, the inode is written before the write returns. Like direct I/O, the data synchronous I/O feature can provide significant application performance gains. Since data synchronous I/O maintains the same data integrity as synchronous I/O, it can be used in many applications that currently use synchronous I/O. If the data synchronous I/O does not allocate storage or extend the file, the inode is not immediately written. The data synchronous I/O does not have any alignment constraints, so applications that find it difficult to meet the alignment constraints of direct I/O should use data synchronous I/O. If the file is being extended or storage is allocated, data synchronous I/O must write the inode change before returning to the application. This case eliminates the performance advantage of data synchronous I/O. The direct I/O and VX_DSYNC advisories are maintained on a per-file-descriptor basis.

Other Advisories
The VX_SEQ advisory indicates that the file is being accessed sequentially. When the file is being read, the maximum read-ahead is always performed. When the file is written, instead of trying to determine whether the I/O is sequential or random by examining the write offset, sequential I/O is assumed. The pages for the write are not immediately flushed. Instead, pages are flushed some distance behind the current write point. The VX_RANDOM advisory indicates that the file is being accessed randomly. For reads, this disables read-ahead. For writes, this disables the flush-behind. The data is flushed by the pager, at a rate based on memory contention. The VX_NOREUSE advisory is used as a modifier. If both VX_RANDOM and VX_NOREUSE are set, pages are immediately freed and put on the quick reuse free list as soon as the data has been used. If VX_NOREUSE is set when doing sequential I/O, pages are also put on the quick reuse free list when they are flushed. The VX_NOREUSE may slow down access to the file, but it can reduce the cached data held by the system. This can allow more data to be cached for other files and may speed up those accesses. VxFS provides the VX_GET_IOPARAMETERS ioctl to get the recommended I/O sizes to use on a file system. This ioctl can be used by the application to make decisions about the I/O sizes issued to VxFS for a file or file device. For more details on this ioctl, refer to vxfsio (7)

86 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Raw asynchronus logical volumes Some database vendors recommend using raw logical volumes for faster I/O. This is best implemented with asynchronous I/O. The difference between the async I/O and the synchronous I/O is that async does not wait for confirmation of the write before moving on to the next task. This does increase the speed of the disk performance at the expense of robustness. Synchronous I/O waits for acknowledgement of the write (or fail) before continuing on. The write can have physically taken place or could be in the buffer cache but in either case, acknowledgement has been sent. In the case of async, no waiting. To implement asynchronous I/O on HP-UX for raw logical volumes: * set the async_disk driver (Asynchronous Disk Pseudo Driver) to IN in the HP-UX Kernel, this will require generating a new kernel and rebooting. * create the device file: # mknod /dev/async c 101 0x00000# #=the minor number can be one of the following values: 0x000000 default 0x000001 enable immediate reporting 0x000002 flush the CPU cache after reads 0x000004 allow disks to timeout 0x000005 is a combination of 1 and 4 0x000007 is a combination of 1, 2 and 4 Note: Contact the database vendor or product vendor to determine the correct minor number for your application.

Change the ownership to the appropriate group and owner: chown oracle: dba /dev/async change the permissions: chmod 660 /dev/async

vi /etc/privgroup add 1 line: dba MLOCK give the group MLOCK priviledges /usr/sbin/setprivgrp MLOCK To verify if a group has the MLOCK privilege execute: /usr/bin/getprivgrp

Disk I/O monitoring tools


The two standard utilities to measure disk I/O are sar -d and iostat. In order to get a statistically significant sample run the over a sufficient time to detect load variances. sar -d 5 100 is a good starting point . (8.3 minute sample) The output will look similar to: device %busy avque r+w/s blks/s avwait avserv c1t6d0 0.80 0.50 1 4 0.27 13.07

87 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

c4t0d0 0.60

0.50

0.26

8.60

There will be an average printed at the end of the report. %busy avque r+w/s Portion of time device was busy servicing a request Average number of requests outstanding for the device Number of data transfers per second (read and writes) from and to the device Number of bytes transferred (in 512-byte units) from and to the device Average time (in milliseconds) that transfer requests waited idly on queue for the device Average time (in milliseconds) to service each transfer request (includes seek, rotational latency, and data transfer times) for the device.

blks/s

avwait

avserv

When average wait (avwait) is greater than average service time (avserv) it indicates the disk can't keep up with the load during that sample. When the average queue length exceeds the norm of .50 it is an indication of jobs stacking up. These conditions are considered to be a bottleneck. It is prudent to keep in mind how long these conditions last. If the queue flushes, or the avwait clears in a reasonable time, (ie 5 seconds), it is not a cause for concern. Keep in mind that the more jobs in a queue, the greater the effect on wait on I/O even if they are small. Large jobs, those greater than 1000 blks/s will also affect throughput. Also consider the type of disks being used. Modern disk arrays are capable of handling very large amounts of data in very short processing times. Processing loads of 5000 blks/s or greater in under 10mS. Older standard disks may show far less capability. The avwait is similar to %wio returned for sar -u on cpu.

IOSTAT Another way to sample disk activity is to run iostat with a time interval, for example: iostat 5 iostat iteratively reports I/O statistics for each active disk on the system. Disk data is arranged in a four-column format: Column Heading Interpretation device Device name bps Kilobytes transferred per second sps Number of seeks per second msps Milliseconds per average seek

If two or more disks are present, data is presented on successive lines for each disk. To compute this information, seeks, data transfer completions, and the number of words transferred are counted for each disk. Also, the state of each disk is examined HZ times per second (as defined in <sys/param.h>) and a tally is made if the disk is active. These numbers can be combined with the transfer rates of each device to determine average seek times for each device. With the advent of new disk technologies, such as data striping, where a single data transfer is spread across several disks, the number of milliseconds per average seek becomes impossible to compute accurately. At best it is only an approximation, varying greatly, based on

88 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

several dynamic system conditions. For this reason and to maintain backward compatibility, the milliseconds per average seek (msps) field is set to the value 1.0.

Options iostat recognizes the following options and command-line arguments: -t Report terminal statistics as well as disk statistics. Terminal statistics include: tin Number of characters read from terminals. tout Number of characters written to terminals. us Percentage of time system has spent in user mode. ni Percentage of time system has spent in user mode running low-priority (nice) processes. sy Percentage of time system has spent in system mode. id Percentage of time system has spent idling. interval Display successive lines, which are summaries of the last interval seconds. The first line reported is for the time since a reboot and each subsequent line is for the last interval only. count Repeat the statistics count times.

EXAMPLES 1. Show current I/O statistics for all disks: iostat

2. Display I/O statistics for all disks every 10 seconds until INTERRUPT or QUIT is pressed: iostat 10

3. Display I/O statistics for all disks every 10 seconds and terminate after 5 successive readings: iostat 10 5

4. Display I/O statistics for all disks every 10 seconds, also show terminal and processor statistics, and terminate after 5 successive readings: iostat -t 10 5

WARNINGS Users of iostat must not rely on the exact field widths and spacing of its output, as these will vary depending on the system, the release of HP-UX, and the data to be displayed.

89 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Module 6 Network Performance


Excessive demand on an NFS server. LAN bandwidth limitations Guidelines Keep NFS servers and their clients on the same LAN segment or subnet. If this is not practical, and you have control over the network hardware, use switches, rather than hubs, bridges and routers, to connect the workgroup. As far as possible, dedicate a given server to one type of task. On file servers, use your fastest disks for the exported file systems. Distribute the workload evenly across these disks. Make sure servers have ample memory. Keep NFS servers and their clients on the same LAN segment or subnet. If this is not practical, and you have control over the network hardware, use switches, rather than hubs, bridges and routers, to connect the workgroup. As far as possible, dedicate a given server to one type of task. For example, in our sample network (see A Sample Workgroup / Network) flserver acts as a file server, exporting directories to the workstations, whereas appserver is running applications. If the workgroup needed a web server, it would be wise to configure it on a third, high-powered system that was not doing other heavy work. On file servers, use your fastest disks for the exported file systems, and for swap. Distribute the workload evenly across these disks. For example, if two teams are doing I/O intensive work, put their files on different disks or volume groups. Distribute the disks evenly among the system's I/O controllers. For exported HFS file systems, make sure the NFS read and write buffer size on the client match the block size on the server. You can set these values when you import the file system onto the NFS client; see the Advanced Options pop-up menu on SAMs Mounted Remote File Systems screen. See Checking NFS Server/Client Block Size for directions for checking and changing the values. Enable asychronous writes on exported file systems. For HFS set fsasync to 1 in the kernel Make sure enough nfsd daemons are running on the servers. As a rule, the number of nfsds running should be twice the number of disk spindles available to NFS clients. For example, if a server is exporting one file system, and it resides on a volume group comprising three disks, you should probably be running six nfsds on the server. TIPS Monitor server memory frequently Keep exported files and directories small as possible. Large files require more NFS operations than small ones, and large directories take longer to search. Encourage your users to weed out large, unnecessary files regularly (see Finding Large Files). Monitor server and client performance regularly.

In practice, though, a server is dealing with many I/O requests at a time, and intelligence is designed into the drivers to take account of the current head location and direction when deciding on the next seek. This means that defragmenting an HFS file system on HP-UX may never be necessary; JFS file systems, however, do need to be defragmented regularly.

Tuning Fibre Channel Network Performance

90 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Two TCP extensions have been designed to improve performance over paths with large bandwidth and delay products, and to provide reliable operation over very high-speed paths. The first extension increases the number of TCP packets that can be sent before the first packet sent is acknowledged. This is called window scaling. The second extension, time stamping, provides a more reliable data delivery mechanism. The following two options turn these extensions on or off in the kernel: tcp_dont_winscale Setting this option to 0 means do window scaling; any non-zero value means don't do window scaling) tcp_dont_tsecho Setting this option to 0 means enable the time stamp option; any non-zero value would disable the time stamp option) Two TCP variables, which determine the amount of memory used for socket buffers effect the window scaling and time stamp options. The following are the default settings for HSC Fibre Channel recommended by Hewlett-Packard: tcp_sendspace = 0x30000 tcp_recspace = 0x30000

To change these defaults, copy, modify, and execute the following script. Since your performance improvement depends on a number of factors, including available memory and network load, you may want to experiment with these settings. Hewlett-Packard recommends that you enable both window scaling and time stamping.

/****************Begin Script***************************/ #!/bin/ksh adb -w /stand/vmunix /dev/kmem < EOF #This script is used for changing sb_max, tcp_sendspace # and tcp_recvspace in the live kernel (/dev/kmem). # tcp_sendspace/W 20000 tcp_recvspace/W 20000 #For HSC Fibre Channel, use tcp_sendspace/W 30000 #For HSC Fibre Channel, use tcp_recvspace/W 30000 #This script is used for changing sb_max, tcp_sendspace # and tcp_recvspace in the kernel (/hp-ux). # tcp_sendspace? W 20000 tcp_recvspace? W 20000 #For HSC Fibre Channel, use tcp_sendspace? W 30000 #For HSC Fibre Channel, use tcp_recvspace? W 30000 #window scaling enabled in live kernel tcp_dont_winscale/W 0 #window scaling enabled in the kernel tcp_dont_winscale? W 0

91 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

#Timestamp option enabled in live kernel tcp_dont_tsecho/W 0 #Timestamp option enabled in the kernel tcp_dont_tsecho?W 0 EOF /******************End Script********************************/

Tuning Fibre Channel Mass Storage Performance Two parameters are available for configuring HP Fibre Channel/9000 for maximum mass storage performance. The first parameter controls the type of memory that is allocated by the system. This is dependent upon the number of FC adapters in the system. The second parameter may be used to override the default number of concurrent FCP requests allowed on the adapter. The optimal number of concurrent requests is dependent upon a number of factors, including device characteristics, I/O load, and host memory. The following two options set these parameters in the kernel: num_tachyon_adapters Set this parameter to the number of HSC Fibre Channel adapters in the system. max_fcp_reqs Set this parameter to the number of concurrent FCP requests allowed on the adapter. The default value is 512.

Module 7 General Tools


It is important to keep in mind that no one utility is completely accurate. The shortest sample period of any utility is 1 second. Considering that every cpu will have no fewer than 10 processes running per second, at best we are only getting an approximation of the true activity. When analyzing performance data, it is wise to look at all the data, as problems often have causes that are related to resource issues in other regions of the system from what appears the obvious cause. ADB adb is a general-purpose debugging program that is sensitive to the underlying architecture of the processor and operating system on which it runs. It can be used to examine files and provide a controlled environment for executing HP-UX programs. For 64 bit adb64 is invoked It operates on assembly language programs. It allows you to look at object files and "core" files that result from aborted programs, to print output files in a variety of formats, to patch files, and to run programs with embedded breakpoints Information on adb can be found at http://docs.hp.com/ in the following manuals : HP-UX 11 Reference Manual volume 1 section 1 ADB Tutorial Streams/UX for the HP 9000 Reference Manual It can be useful to gather system information useful in performance tuning as well To determine the physical memory (RAM) : for HP-UX 10x echo physmem/D | adb /stand/vmunix /dev/kmem physmem: physmem: 24576 for HP-UX 11.x systems running on 32 bit architecture: echo phys_mem_pages/D | adb /stand/vmunix /dev/kmem phys_mem_pages: phys_mem_pages: 24576

92 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

for HP-UX 11.x systems running on 64 bit architecture: echo phys_mem_pages/D | adb64 /stand/vmunix /dev/mem phys_mem_pages: phys_mem_pages: 262144 The results of these commands are in memory pages, multiply by 4096 to obtain the size in bytes. To determine the amount of lockable memory: echo total_lockable_mem/D | adb /stand/vmunix /dev/mem total_lockable_mem: total_lockable_mem: 185280 for HP-UX 11.x systems running on 64 bit architecture: echo total_lockable_mem/D |adb64 /stand/vmunix /dev/mem

To determine the number of free swap pages : echo swapspc_cnt/D | adb /stand/vmunix /dev/kmem swapspc_cnt: swapspc_cnt:

216447

This will display the number of free swap pages. Multiply the number returned by 4096 for the number of free swap bytes. To determine the processor speed: echo itick_per_usec/D | adb /stand/vmunix /dev/mem itick_per_usec: itick_per_usec: 360 To determine the number of processors in use: echo "runningprocs/D" | adb /stand/vmunix /dev/mem runningprocs: runningprocs: 2 To determine the number of pages of buffer cache ( 4Kb in size) echo bufpages/D | adb /stand/vmunix /dev/mem bufpages: bufpages:

18848

To display kernel parameters using adb use the parameter name :

echo nproc/D | adb /stand/vmunix /dev/mem nproc: nproc:

276

To determine the amount of vxfs inodes in use echo vxfs_ninode/D | adb /stand/vmunix /dev/mem vxfs_ninode:

93 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

vxfs_ninode:

64000

To determine the kernel you are booted from: 10.x echo 'boot_string/S' | adb /stand/vmunix /dev/mem boot_string: boot_string: disc(52.6.0;0)/stand/vmunix 11.x example: echo 'boot_string/S' | adb /stand/vmunix /dev/mem boot_string: boot_string:

disk(0/0/2/0.6.0.0.0.0.0;0)/stand/vmunix

IPCS - report status of interprocess communication facilities ipcs displays certain information about active interprocess communication facilities. With no options, ipcs displays information in short format for the message queues, shared memory segments, and semaphores that are currently active in the system. Options The following options restrict the display to the corresponding facilities. (none) This is equivalent to -mqs. -m Display information about active shared memory segments. -q Display information about active message queues. -s Display information about active semaphores.

The following options add columns of data to the display. See "Column Description" below. (none) Display default columns: for all facilities: T, ID, KEY, MODE, OWNER, GROUP. -a Display all columns, as appropriate. This is equivalent to -bcopt. -b Display largest-allowable-size information: for message queues: QBYTES; for shared memory segments: SEGSZ; for semaphores: NSEMS. -c Display creator's login name and group name: for all facilities: CREATOR, CGROUP. -o Display information on outstanding usage: for message queues: CBYTES, QNUM; for shared memory segments: NATTCH. -p Display process number information: for message queues: LSPID, LRPID; for shared memory segments: CPID, LPID. -t Display time information: for all facilities: CTIME; for message queues: STIME, RTIME; for shared memory segments: ATIME, DTIME; for semaphores: OTIME.

The following options redefine the sources of information. -C core Use core in place of /dev/kmem. core can be a core file or a directory created by savecrash or savecore. -N namelist Use file namelist or the namelist within core in place of /stand/vmunix

Column Descriptions The column headings and the meaning of the columns in an ipcs listing are given below. The columns are printed from left to right in the order shown below. T Facility type: m Shared memory segment q Message queue s Semaphore

94 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

ID The identifier for the facility entry. KEY The key used as an argument to msgget (), semget (), or shmget () to create the facility entry. (Note: The key of a shared memory segment is changed to IPC_PRIVATE when the segment has been removed until all processes attached to the segment detach it.) MODE The facility access modes and flags: The mode consists of 11 characters that are interpreted as follows: The first two characters can be: R A process is waiting on a msgrcv (). S A process is waiting on a msgsnd (). D The associated shared memory segment has been removed. It will disappear when the last process attached to the segment detaches it. C The associated shared memory segment is to be cleared when the first attach is executed. - The corresponding special flag is not set.

The next 9 characters are interpreted as three sets of three characters each. The first set refers to the owner's permissions, the next to permissions of others in the group of the facility entry, and the last to all others. Within each set, the first character indicates permission to read, the second character indicates permission to write or alter the facility entry, and the last character is currently unused. r w a Read permission is granted. Write permission is granted. Alter permission is granted. The indicated permission is not granted.

OWNER The login name of the owner of the facility entry. GROUP The group name of the group of the owner of the facility entry. CREATOR The login name of the creator of the facility entry. CGROUP The group name of the group of the creator of the facility entry. CBYTES The number of bytes in messages currently outstanding on the associated message queue. QNUM The number of messages currently outstanding on the associated message queue. QBYTES The maximum number of bytes allowed in messages outstanding on the associated message queue. LSPID The process ID of the last process to send a message to the associated message queue. LRPID The process ID of the last process to receive a message from the associated message queue. STIME The time the last msgsnd () message was sent to the associated message queue. RTIME The time the last msgrcv () message was received from the associated message queue. CTIME The time when the associated facility entry was created or changed. NATTCH The number of processes attached to the associated shared memory segment. SEGSZ The size of the associated shared memory segment. CPID The process ID of the creating process of the shared memory segment. LPID The process ID of the last process to attach or detach the shared memory segment. ATIME The time the last shmat () attach was completed to the associated shared memory segment. DTIME The time the last shmdt () detach was completed on the associated shared memory segment. NSEMS The number of semaphores in the set associated with the semaphore entry. OTIME The time the last semop () semaphore operation was completed on the set associated with the semaphore entry.

WARNINGS ipcs produces only an approximate indication of actual system status because system processes are continually changing while ipcs is acquiring the requested information. Do not rely on the exact field widths and spacing of the output, as these will vary depending on the system, the release of HP-UX, and the data to be displayed.

SAR - The System Activity Reporter The sar utility is available on all systems and can provide valuable data to assist in identifying problems and making changes to optimize the systems efficiency .It is one of the least intrusive tools to gather performance related statistics. Interpretation of the data takes time and experience. It is important when gathering any data to get a statistically significant sample. For analysis purposes a sample every 5 seconds for at least 100 iterations is the smallest amount of data to consider. For example to look at the disk I/O on a system run sar -d 5 100 . Only disks with activity will report, you may see some samples during the report that all disks are not present. This only indicates there

95 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

was no disk activity during that sample period. device %busy avque r+w/s c1t6d0 0.80 0.50 1 c4t0d0 0.60 0.50 1 blks/s avwait avserv 4 0.27 13.07 4 0.26 8.60

Keep in mind that read and write transactions are system calls. When an application is producing a heavy load on disk, the % system in the cpu reports may appear higher than expected. Data can be obtained on the following areas: -d Block Device (disk or tape) -b Buffer Cache -u, -q CPU use and run queue -a Files system access routines -m Message and Semaphore activity -c System Calls -s System swapping and context switching -v System tables - process , inode , file - y TTY device On Multi processor systems you must use the -M switch for a detailed report of each cpu .

Time usr/bin/time is a UNIX command that can be used to run a program and determine what percentage of time is being spent in user code and what percentage is being spent in the system. Upon completion, time prints the elapsed time during the command, the time spent in the system, and the time spent executing the command. Times are reported in seconds. Execution time can depend on the performance of the memory in which the program is running. The times are printed on standard error. For example $ /bin/time bdf real 15.2 user 11.4 sys 0.4

Timex When run without switches this is the equivalent to the time command. The timex command can be useful for determining the impact a command has on the system. -o Report the total number of blocks read or written and total characters transferred by command and all its children. This option works only if the process accounting software is installed. -p [fhkmrt] List process accounting records for command and all its children. The suboptions f, h, k, m, r, and t modify the data items reported. They behave as defined in acctcom (1M). The number of blocks read or written and the number of characters transferred are always reported. This option works only if the process accounting software is installed and /usr/lib/acct/turnacct has been invoked to create /var/adm /pacct. -s Report total system activity (not just that due to command) that occurred during the execution interval of command. All the data items

96 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

listed in sar (1M) are reported.

Tools available for purchase


Glanceplus While Glance is not available on all systems, it is a key diagnostic tool to use when available. It is resource intensive, so running it while the system is under a severe load may impose an unacceptable load. The OpenView GlancePlus Concepts Guide is available at: http://ovweb.external.hp.com/ovnsmdps/pdf/concepts_lx.pdf

The following white paper is available internally at : http://rpm-www.rose.hp.com/rpm/papers/glanceff.htm

Using Glance Effectively Doug Grumann Hewlett-Packard Company

Introduction
Many people have used Glance, which is a powerful tool for system performance diagnosis. Although Glance is very popular, many users do not take full advantage of the capabilities of the product, or do not understand how its many metrics can be used to optimize their systems' performance. In this article, I will present some background information on general system performance principles, cover some tips and techniques for getting the most from Glance, and list some common performance problems and how Glance can be used to characterize them. I'll also discuss how to customize your use of Glance to best suit your environment. This article is intended primarily for those who have a basic knowledge of the product and it's not intended as a tutorial for new users.

Performance Analysis
Many articles have been written on the art of system performance analysis. In an ideal situation, performance tools would not be necessary at all. Your computer system would optimize its resources automatically, and continually adjust its behavior based on the workload. In reality, it is up to system administrators to optimize system performance manually. I believe that tuning performance will always remain somewhat of an art. There are too many variables and dependencies in constant flux for a self-diagnostic to handle. For example, even engineers that write the HP-UX operating system cannot always determine the performance impact of every change and feature they code into the kernel. This is one reason why we have user-configurable kernel parameters and options such as disk mirroring, the logical volume manager, and commands to adjust process scheduling priorities. These facilities allow you to manage your configuration to best optimize the performance of your particular system. Different features affect performance in different ways. To optimize performance in your environment, you need to understand your workload and understand the major resources of the system which may be under stress. Let's briefly review some of the guiding principles of performance analysis: Know your system. Your task of solving a performance problem will be much harder if you don't know what the system looks like when it is performing well. If you're lucky and proactive, you can get an understanding of the normal everyday workloads, throughput, and response times of the systems you manage before a performance crisis occurs. Then when you later take steps to tune a system, you'll have baseline knowledge to compare against. Define the symptom. Users like to say things like "the system's too slow," but subjective complaints are hard to address. Before you start changing things, define exactly what's wrong and try to set goals so that you'll know if you were successful. Many administrators use response time or throughput metrics to define their goals. Try to find something quantifiable, and write the goals down along with your measurements. Characterize the bottleneck. People who do performance analysis consulting use the term bottleneck a lot. A bottleneck is a resource which is at maximum capacity, and cannot keep up with the demands being placed on it. In other words, the bottlenecked resource is the part of the computer responsible for the system not running faster. A more powerful CPU will do you no good if your performance bottleneck is in the disk I/O subsystem. Measuring performance with a tool like Glance allows you to characterize which resources are constrained so you can determine how to alleviate the bottleneck.

97 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Change one thing at a time. Once you've isolated a performance problem and you decide how to address it, change only one thing at a time. If you change more than one thing at once, then you will not know which change helped performance. It's also possible that one change will improve performance while another makes it worse, but you won't know that unless you implement them separately and measure performance in-between. A complete discussion of performance analysis might include information on topics such as benchmarking, system sizing, workgroup computing, and capacity planning. Other HP products such as MeasureWare and PerfView address more long-term performance data collection and analysis needs. These topics are beyond the scope of this article. I will concentrate on the area of performance analysis that Glance is made to address: single-system on-line performance diagnosis.

Glance Overview
The Glance product is available on several platforms including HP-UX, Solaris, and AIX. I will focus this material for HP-UX Glance. Note that the implementations of Glance differ in minor ways on the different platforms. In all cases, the purpose of the product is to address the "what's going on right now" type of question for system administrators. There are two user interfaces for Glance. The original interface is a curses-based character mode interface named simply glance. Two years ago, a second user interface was added to the product. This Motif-based interface is named gpm. You may use either or both programs to display performance data. The gpm interface imposes more memory and CPU overhead on your system, however you may find it more intuitive, and some of its features go beyond what the character-mode interface provides. For the remainder of this article I will refer to gpm exclusively, however most of the examples apply equally well to either interface. People often ask me why the data shown in Glance sometimes differs from the data shown by tools such as sar, vmstat, iostat, and top. Most often, the root cause of discrepancies is due to the underlying collection methodology. Glance uses special tracing features in the HP-UX kernel which are translated into performance metrics via the midaemon process. The "kmem" tools like top get their data from counters in the kernel which are maintained by sampling instrumentation. Because a tracing methodology can capture all system state information, it is more accurate than data which is obtained via periodic sampling. I strongly encourage new Glance users to get into gpm's award-winning on-line help subsystem, and view its Guided Tour topics. The Guided Tour introduces you to the product and its concepts. Experienced Glance users (like myself!) also find the on-line help invaluable, with its topics such as Adviser Syntax Reference and Performance Metric Definitions.

Top-Down Approach
There are over 1000 performance metrics accessible from Glance. You do not need to understand even a small percentage of them in order to get your work done. This is because the tool organizes its data in two different ways so that you only need to look at the metrics important to your situation. First of all, metrics are organized according to resource: there is a group of reports and graphs oriented around CPU, Memory, I/O, Networking, Swap, and System Tables. If your system is having no problems with I/O, then you need never investigate the reports in that area. Secondly, metrics are organized from a Global level down to an Application level and finally down to a Process level. As a whole, global metrics show you overall summarization of what is going on with the entire system. Application metrics allow you to group your system's workload into sets of processes representing different tasks. Then you can compare the relative impact of different applications on overall performance. Process metrics let you zoom in on specific processes and their individual attributes. Use Glance in a top-down manner to be most effective. When you first start gpm, the main graphical display will show you four potential bottleneck areas of performance: CPU, Memory, Disk, and Network. Each of these areas is represented by a graph and a button. The graphs show metrics for these resources over time, while the buttons give you status on adviser symptoms. The gpm adviser is a complex but powerful feature of Glance that I will delve more into later. For now, just note that the color of the buttons can be your first clue to a performance bottleneck. If the CPU button turns yellow or red, then this means you should investigate the utilization of the CPU resource. Use the main window to determine which area might be impacting performance. Drill down into report screens for that resource to characterize the problem further. Then use the Application or Process list reports to pinpoint the cause of the problem. Once you are down to the process level, you can determine which actions to take to correct the situation. It sounds easy, huh? Before we go into some examples, let's discuss a few important techniques.

Applications
It's useful to view Application data in glance as an intermediate step between Global and Process data. If you manage hundreds of diverse systems, you may not have the time to group processes into applications on each system.

98 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Likewise, if you are managing systems that only have basically one application on them, then tuning your parm file application groupings may not be a good use of your time. On the other hand, if you are doing frequent performance analysis on multi-user systems, application groupings can be very useful. Frequently on my systems I'll have separate applications defined for backups and builds. Without looking at individual processes, I can quickly tell if my backups have been running too long or if a software build is interfering with my NFS server's other activities. Just keep in mind that if you don't want to use application groupings, you don't have to. Neither global nor process data is affected by your application parm file definitions.

Sorting, Arranging, and Filtering


Too often I see gpm users scrolling through hundreds of lines of Process List detail, looking for an item of interest. It would save them time to just set up some filtering or sorting for the Process List report to bring the data they want into one window. For example, I usually set up a sort by Current CPU so that the processes that are most active will be at the top of the list. The default column arrangement can also be changed. For example, you can bring the memory fields RSS and VSS into view and sort by those fields if you are looking for memory hogs. Filtering allows you to set up intricate thresholds based on what type of data you'd like to view or highlight. For example, you can filter on the Physical I/O field so the Process List will only report processes doing I/O, and you can highlight processes that exceed a threshold you define. Remember that your customizations of gpm such as report field column arrangements, sort fields, and filter values (as well as colors, fonts, and measurement intervals) are saved in a ".gpmhp" file in your home directory. This file saves your customizations so that they stay in effect between invocations of gpm. Normally, I like to run gpm under my own non-root user login so that other people who share the root login with me won't change my .gpm settings. If you have several users who share root access on a system, you can also create separate su accounts for them with different home directories so they keep separate gpm configurations.

Overhead
Any performance tool will impose a certain amount of additional overhead on the system. In the case of Glance, this overhead is significant only for CPU and memory. There is a tradeoff here: the more data you want to gather, the more overhead required to get the data. If you're concerned about CPU overhead, you can reduce the impact by running Glance with longer update intervals. One trick I've used in the past is to set the update interval way up to, say, 5 minutes, and then I use the Update Now menu selection (just a carriage return in character-mode glance) to update intermittently when I want to see fresh data. You'll notice that gpm's memory usage is higher than character-mode glance's because it loads the Motif libraries. With systems getting faster and bigger all the time, you rarely need to be concerned about Glance overhead.

New in Glance
New features are always being added to Glance. The gpm interface now has a continuous on-item help feature which lets you get instant information about any metric. The "choose metric" functionality of gpm allows you to select exactly which metrics go into the report windows. The glance character-mode interface now uses the Adviser technology (you'll also find the adviser alarm syntax used in HP MeasureWare). There have been reports added for Distributed Computing Environment (DCE) support and Transaction Tracker metrics. Transaction Tracker is a user-defined transaction library which is bundled with HP MeasureWare. Glance for HP-UX 10.0 has several features including new Disk I/O detail information, Global and Process-level System Calls reports, and reporting and manipulation of Process Resource Manager variables. Details about these new features are found in the product ReleaseNotes and on-line help.

Examples
We'll now go through a few examples of using Glance to address specific performance problems. Hopefully these will provide insight as to how to drill down into the data to characterize problems, but remember that every system is different and it's impossible to cover even a small percentage of all possible performance problem scenarios. Note that gpm's on-line help contains a few short case studies under its Performance Sleuthing topic that might also be useful to you.

CPU Bottlenecks
Let's say the main window shows the CPU to be 100% utilized. This might mean that the CPU is bottlenecked, but then again it may not. Realize that it is good to have a resource fully utilized: it means that your system is fully taking advantage of its capabilities. On a single

99 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

user workstation, the CPU might always be 100% busy because the user has their x11fish backdrop program running to entertain and distract visitors. If overall performance is fine, there is no problem. On another system, however, the CPU might be 100% busy and users might be complaining because their response time has fallen off dramatically. Then it's time to delve deeper into glance. The CPU graph and report will tell you whether there is contention for the CPU. If so, then you may want to go straight to the Process List and sort on the top CPU consumers. The simplest common source of CPU contention is a runaway process. Often shell programmers will get scripts stuck in a loop, and sometimes they'll leave the loops active. When you see a process using as much of the CPU as it can get, spending all its time in User mode, and doing no I/O, then it might be looping. Check with the owner of the process. Compiles are also a major culprit in CPU bottlenecks. In software development environments, I've seen cases where a whole project team was slowed down because they all were doing their build on the same system. By mounting the source on a NFS server, we separated the compiles onto different systems and alleviated the bottleneck. On a Multiprocessor system, CPU bottlenecks can be very interesting to diagnose. For example, on a 2-way MP system, a looping process can consume 100% of one processor while leaving the other processor idle. Also, it might "bounce" between processors, keeping each about 50% busy. Note that Glance normalizes CPU at the global level, so 100% busy in the main window means that all processors are 100% busy. The CPU By Processor report will show you how this breaks down into individual processor loads. The Process List report, since it is oriented on a process level, does not normalize CPU utilization. In our example of a looping process on a 2-way MP system, the global CPU utilization might show 50% but the looping pid will show 100% utilization in the Process List. This makes sense because if you had a 8-way MP system you want to see that process stand out: a single process in that environment could only use 12.5% of the overall system CPU resource because it could only be active on one processor at any one time.

In gpm, double-clicking on a process in the Process List gets you into the Process Resource report. This report is very useful because it shows you a lot of detail about what the process is doing. For example, some processes that use timers often have a very high proportion of System CPU, and you'll see a lot of context switching and perhaps Signals Received. I've sometimes surprised developers by showing them gpm screens of their programs in action, doing outrageous things like opening and closing 50 files a second.

Memory Bottlenecks
Often in today's Unix environments, it is normal to see physical memory fully utilized. This is a sign of a well-tuned system, because access to memory is so much faster than access to disk. Keeping text and data pages in memory speeds up subsequent references, and only becomes a problem when processes try to allocate more memory and the system is forced to flush buffers, page out data, or (worst case) start swapping. Sometimes, a memory bottleneck will disguise itself as a disk bottleneck, because memory management activities cause disk I/O. Normally, a good rule of thumb is to avoid swapping at all costs. A certain amount of paging is very normal (especially page-ins), but swapping only occurs when there is excessive demands for physical memory. Note: in HP-UX 10.0, swapping is called deactivation, but the basic concept remains the same. Too often, I've seen the attitude that the simplest way to solve any performance problem is to buy more memory. Although this solution frequently works because memory bottlenecks are common, there are many instances where a cheaper alternative existed. In HP-UX 9.0, some systems experienced memory bottlenecks caused by the dynamic buffer cache growing too large. The buffer cache speeds up filesystem I/O, but on some systems it can grow too large and start causing excessive paging. Many administrators are familiar with the dynamic buffer cache patch to HP-UX which puts a limit on the size the cache can grow. In 10.0, there are dbc_min and dbc_max kernel parameters that allow you to fine-tune the cache to meet your needs. In most instances, I've found the 10.0 default values for these variables to be appropriate. Glance's Memory report shows you the relative amount of physical memory allocated to the system, user pages, and the buffer cache. The System Tables report will help you decide if you should reconfigure the kernel with larger of smaller table sizes. Normally, you want to allocate enough space in tables so that you never have to worry about running out (the same goes for swap areas). If your system is tight on memory, you may consider reducing the size of some tables in order to make the kernel smaller, leaving more room for user data. I've seen systems running with unnecessarily huge values for maxusers or nproc, which increases the size of the kernel and can impact performance. A single user workstation does not normally need nproc values over 1000! Isolating memory hogs in gpm's process list is easy: sort or filter the report on the Resident Memory Set Size or Virtual Memory Set Size. Processes that allocate a lot of memory may have huge VSS values, most of which is paged or swapped out most of the time. I've often spotted programs with memory leaks by just watching their VSS size grow over time. Sometimes a developer will not find a memory leak during testing because a program is never executed for very long periods, but then when the program moves into production, the process will slowly consume memory over a matter of days until it aborts. The Process Memory Regions report available for individual processes in Glance is extremely useful for debugging memory problems.

100 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Disk Bottlenecks
Disk bottlenecks are very common on multi-user systems and servers. Glance's I/O By Filesystem and I/O By Disk reports are extremely useful in isolating these problems. Look for filesystems and disks with consistently high activity. Look at the Disk Queues to see if there are a lot of I/Os waiting for service. I've often seen cases where all the most frequently-used files are on the root filesystem, which gets bottlenecked while other disks sit idle. Load balancing across disks can be an easy way to improve performance. Common techniques include moving swap areas and heavily accessed filesystems off the root disk, or using disk striping, LVM and/or mirroring to spread I/Os out across multiple disks. You can use Glance to verify the effectiveness of these methods. For example, LVM mirroring can improve read performance but degrade write performance. Using Glance, you can look at a volume that you're considering mirroring in order to verify that many more read than write I/Os are occurring. The filesystem buffer cache, mentioned above, it very important in understanding disk bottlenecks. If your workload is filesystem disk I/O intensive, a large buffer cache can be useful in reducing the number of disk I/Os. However, certain environments such as database servers using raw disk partitions don't make use of the buffer cache, and so a large buffer cache could hurt performance by wasting memory. Ideally, what you'd like to see on your system are processes doing lots of logical I/Os and very few physical I/Os. Because of the effects of read-ahead and buffering, it isn't always easy to determine why a application is doing more or fewer physical I/Os, but Glance's Process Resource report and Process Open File report can be useful.

Network Bottlenecks
As systems rely more and more on the network, we've begun to see more instances of bottlenecks relating to network activity. Unfortunately, at a system level, there are not as many good metrics for isolating network performance problems as there are for other bottleneck areas. For network servers such as NFS servers, you can sometimes use the process of elimination to isolate a network bottleneck: if your server has ample CPU, memory, and disk resources but is still slow, it may be due to LAN bandwidth limitations. You can use glance to look at the client side and server side simultaneously. Glance has several NFS reports which can be useful, especially NFS By System which will tell you which clients are pounding your server the hardest. One example I've seen is where a user was repeatedly executing the find command on NFS clients looking for old core files, but each find hit the same NFS-mounted disk over and over, causing needless overhead on the server. In large environments, tools such as OpenView and PerfView are useful for monitoring the overall network and then they can turn to Glance to zoom in on specific systems with the greatest activity.

The Adviser
Although you can use gpm effectively without ever even knowing about its Adviser feature, taking the time to understand it can be very profitable. I don't have space to cover this topic thoroughly, but it is covered extensively in gpm's on-line help. Basically, think of the Adviser as a set of rules based on performance metrics which can be used to take different actions. If you look at the default Adviser syntax, you'll see some rules that control the colors of the four primary bottleneck indicators, and some supplementary rules that generate alarms for things like high system table utilization. These default rules control the color of the buttons on gpm's main window, and they are also visible in the Adviser Status and History reports. You should feel free to edit the rules to have more meaning for your own unique environment, because every system is different. You can always return to the default rules. Your customized rules are another aspect of the configuration of gpm that's stored in the .gpm file As an example of when you might want to change the default Adviser symptoms, let's say you have a large server system that's always fully CPU utilized, and frequently also has a high run queue. Although on many systems a large run queue is a CPU bottleneck indicator, this isn't always the case, especially on large servers. In our example, the high CPU utilization and the high run queue always makes the CPU bottleneck symptom in the default Adviser syntax go red. It isn't helpful if a button is always red. You would want to edit the Adviser syntax to bump up the criteria so that a CPU red alert only goes off when the run queue exceeds, say, 10 instead of 3. The full potential value of the Adviser is in adding syntax for your own particular environment. For example, let's say that you know from past experience that the physical disk I/O rate exceeds a certain value, user response time degrades. If you are willing to let gpm stay

101 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

running for longer periods, you can put in some adviser syntax which will alert you via email of the problem: if gbl_disk_phys_io_rate > 2000 then exec "echo 'disk i/o rate is high' | mail root"

You can get really fancy with these rules. You can combine metrics, define variables and use looping constructs. You can generate Alerts, execute Unix commands, and print information to gpm's stdout. What follows is a more complex example to illustrate Adviser "programming" . More examples are in gpm's on-line help. # check for high system-mode cpu utilization, and when it is high # print the highest sys cpu consuming process. if gbl_cpu_sys_mode_util > 50 then { highestsys = 0 process loop if proc_cpu_sys_mode_util > highestsys then { highestpid = proc_proc_id highestname = proc_proc_name highestsys = proc_cpu_sys_mode_util } print "--- High system mode cpu rate = ", gbl_cpu_sys_mode_util, " at ", gbl_stattime, " ---" print " Process with highest system cpu was pid ", highestpid|5|0, ", name: ", highestname print " which had", highestsys, " percent system mode cpu utilization" }

Summary
In order to manage the performance of your systems effectively, you need to understand a little about the art of performance analysis, and you need a good tool like Glance. I encourage you to spend some time getting to know your system's performance characteristics before a problem occurs. When you are involved in a performance crisis, objectively define the symptoms of the problem, and then use them to guide you through analysis. Use Glance to characterize the bottlenecked resource. Follow the tools' top-down methodology to go from a high-level bottleneck down to the process responsible if possible. When you know what's wrong, make a change but change only one variable in the environment at a time so you can gauge its success.

Measureware /Perfview The PerfView product is an HP-UX Motif-based tool designed for analysis, monitoring, and forecasting of system performance and resource utilization data. Data collection and threshold alarming is provided by MeasureWare. This is also known as the HP OpenView VantagePoint Performance Agent for HP-UX

Information on this product can be found in the

102 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

HP OpenView Performance Agent for HP-UX Installation & Configuration Guide for HP-UX 10.20 and 11.x Available at http://docs.hp.com/

Module 8 WTEC Tools


There are many useful tools available from the Worldwide Technical Expert Center (WTEC). Unless instructed by WTEC or completely familiar with these tools, it is ill advised to distribute them. The following are the most appropriate tools for use by the Response Center . kmeminfo http://oceanie.grenoble.hp.com/georges/kmeminfo.html A tool to troubleshoot kernel and user memory (VM) problems- by Georges Aureau - WTEC HPUX Description Usage: kmeminfo [options ...] [coredir | kernel core] Default: coredir="." if "INDEX" file present else kernel="/stand/vmunix" core="/dev/kmem"

Options -summary -dynamic -bucket ... -static -user ... -pid ... [-prot] [-parse] -bufcache -eqalloc -help

103 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

-physical .. -physmem -pfdat ... -sysmap -kas -vmtrace ... -alias -virtual ... -pdk_malloc ... When invoked with no option kmeminfo tries to open the current directory as a dump directory, and if the current directory does not contain a crash dump, it opens /stand/vmunix and /dev/kmem. It then prints statistics about the physical memory utilisation, with a focus on memory allocated by the kernel. kmeminfo supports the Mc Kusick bucket kernel allocator (HP-UX 10.x and 11.0).

Example: One error that can is seen occaisonaly is : equivalent mapped reserve pool exhausted . By running kmeminfo eqalloc you can determine the size of eqalloc and set eqmemsize to this value and resolve the issue.

The example below shows a memory leak: $ kmeminfo /dumps/LABS/Dumps/MemLeak.1100 Pfdat processing: Scanning 900603 pfdat entries (be patient) ... Physical memory usage summary (in page/byte/percent): Physmem = 917248 3.5g 100% Physical memory Freemem = 24059 94.0m 3% Free physical memory Used = 893189 3.4g 97% Used physical memory System = 537002 2.0g 59% By kernel: Static = 31028 121.2m 3% for text/static data Dynamic = 229014 894.6m 25% for dynamic data Bufcache = 275174 1.0g 30% for buffer cache Eqmem = 26 104.0k 0% for equiv. mapped memory SCmem = 1760 6.9m 0% for critical memory User = 356183 1.4g 39% By user processes: Uarea = 3704 14.5m 0% for thread uareas Disowned = 4 16.0k 0% Disowned pages Kernel dynamic memory usage (in page/byte/percent): Dynamic MALLOC bucket[5] bucket[6] bucket[7] bucket[8] bucket[9] = 229014 894.6m 25% Kernel dynamic memory = 205447 802.5m 22% Memory buckets = 1606 6.3m 0% size 32 bytes = 150 600.0k 0% size 64 bytes = 4472 17.5m 0% size 128 bytes = 1586 6.2m 0% size 256 bytes = 169755 663.1m 19% size 512 bytes

104 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

bucket[10] = 20396 79.7m 2% size 1024 bytes bucket[11] = 1863 7.3m 0% size 2048 bytes bucket[12] = 318 1.2m 0% size 4096 bytes bucket[13] = 234 936.0k 0% size 2 pages bucket[14] = 102 408.0k 0% size 3 pages bucket[15] = 8 32.0k 0% size 4 pages bucket[16] = 70 280.0k 0% size 5 pages bucket[17] = 180 720.0k 0% size 6 pages bucket[18] = 490 1.9m 0% size 7 pages bucket[19] = 120 480.0k 0% size 8 pages bucket[20] = 4097 16.0m 0% size > 8 pages Reserved = 13 52.0k 0% Reserved pools Kalloc = 20304 79.3m 2% kalloc() SuperPagePool = 0 0.0k 0% Kernel superpage cache BufcacheBufs = 15353 60.0m 2% Buffer cache bufs BufcacheHash = 1280 5.0m 0% Buffer cache hash heads Other = 3671 14.3m 0% Other... Eqalloc = 3250 12.7m 0% eqalloc() Checking bucket free list heads: No corruption detected...

And it supports the arena kernel allocator (HP-UX 11i and up). The example below shows the corruption of an arena free list head. $ kmeminfo /dumps/pa/arena.11i Pfdat processing: Scanning 2044581 pfdat entries (be patient) ... Physical memory usage summary (in page/byte/percent):

Physmem = 2093056 8.0g 100% Physical memory Freemem = 1630660 6.2g 78% Free physical memory Used = 462396 1.8g 22% Used physical memory System = 335982 1.3g 16% By kernel: Static = 96457 376.8m 5% for text/static data Dynamic = 100043 390.8m 5% for dynamic data Bufcache = 135290 528.5m 6% for buffer cache Eqmem = 46 184.0k 0% for equiv. mapped memory SCmem = 4146 16.2m 0% for critical memory User = 126406 493.8m 6% By user processes: Uarea = 2984 11.7m 0% for thread uareas Disowned = 8 32.0k 0% Disowned pages Kernel dynamic memory usage (in page/byte/percent): Dynamic = 100043 390.8m 5% Kernel dynamic memory Arenas = 64553 252.2m 3% Kernel arenas M_TEMP = 19504 76.2m 1% M_SWAP = 12468 48.7m 1% M_LVM = 4761 18.6m 0% KMEM_ALLOC = 4647 18.2m 0% ALLOCB_MBLK_LM = 4052 15.8m 0% M_IOSYS = 3368 13.2m 0% ALLOCB_MBLK_DA = 2941 11.5m 0% M_SPINLOCK = 2416 9.4m 0% VFD_BT_NODE = 1312 5.1m 0% ALLOCB_MBLK_SM = 1296 5.1m 0% M_DYNAMIC = 1067 4.2m 0%

105 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

KMEM_VARFLIST_H = 882 3.4m 0% ALLOCB_MBLK_MH = 780 3.0m 0% M_PREG = 597 2.3m 0% M_REG = 590 2.3m 0% Other = 3872 15.1m 0% Other arenas... Kalloc = 35399 138.3m 2% kalloc() SuperPagePool = 17584 68.7m 1% Kernel superpage cache BufcacheBufs = 11235 43.9m 1% Buffer cache bufs BufcacheHash = 5120 20.0m 0% Buffer cache hash heads Other = 1460 5.7m 0% Other... Eqalloc = 91 364.0k 0% eqalloc()

Checking locked arena free list heads: The following free list is locked: kmem_arena_t 0x0000000040001240 "M_DYNAMIC" kmem_flist_hdr_t 0x0000000040012480 (cpu 3, index 1, size 56) Error while scanning a "M_DYNAMIC" free list: kmem_flist_hdr_t 0x0000000040012480 (cpu 3, index 1, size 56) kfh_head 0x000e40c500000000 (no translation!)

kmeminfo -summary The -summary option prints the memory usage summary (see above default output).

kmeminfo -dynamic The -dynamic option prints the memory usage summary and the kernel dynamic memory usage (see above default output).

kmeminfo -static The -static option prints the details of kernel static memory usage, ie. Pfdat processing: $ kmeminfo -static /padumps/arena_corr Pfdat processing: Scanning 2044581 pfdat entries (be patient) ... -Static kernel memory usage (in page/byte/percent): Static Text Data Bss Tables pfdat = 96457 376.8m 5% Static memory = 2308 9.0m 0% Text = 450 1.8m 0% Data = 482 1.9m 0% Bss = 93217 364.1m 4% System tables = 47919 187.2m 2% pfdat

Static system memory (size in bytes and pages): Name StartEnd Nent Size text 0x00020000-0x00924000 1 9453568 data 0x00924000-0x00ae6000 1 1843200 bss 0x00ae6000-0x00cc89f0 1 1976816 phys_mem_tbl 0x00ce05a0-0x00ce05ac 1 12 sysmap_32bit 0x00defd40-0x00e47f40 22560 360960 sysmap_64bit 0x00e47f40-0x00ea0140 22560 360960 pgclasstab 0x013ce000-0x015cd000 2093056 2093056

106 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

mpproc_info 0x015d1000-0x015eae40 8 106048 htbl2_0 0x04000000-0x08000000 2097152 67108864 pfn_to_virt_ptr 0x0962e000-0x0962fff0 511 8176 pfn_to_virt 0x09630000-0x0b620000 2093056 33488896 inode 0x0b628000-0x0ba08000 8192 4063232 file 0x0ba08000-0x0bab8370 8202 721776 ncache 0x0bab8380-0x0bc8c380 13312 1916928 nc_hash 0x0bc8c380-0x0bccc380 16384 262144 nc_lru 0x0bccc380-0x0bcd6380 512 40960 callout_info_array 0x0bcd6380-0x0bcd6c00 8 2176 cfree 0x0bd56400-0x0bd6b3a0 2148 85920 physio_buf_list 0x0bd6b3c0-0x0be39a18 1409 845400 pfdat 0x0bdf3880-0x0be53880 4096 393216 pfdat 0x0bdfdc00-0x1797dc00 2048000 196608000 tmp_save_states 0x0be39a40-0x0be3be40 8 9216 memWindows 0x0be3be40-0x0be3bea0 2 96 quad4map_32bit 0x0be3f880-0x0be41510 457 7312 quad1map_64bit 0x0be41540-0x0be431d0 457 7312 quad4map_64bit 0x0be43200-0x0be44e90 457 7312 pfdat_ptr 0x0be44ec0-0x0be46eb0 511 8176 page_groups 0x0be4b400-0x0be4b4e0 7 224 space_map 0x17980000-0x17988000 262144 32768 kmem_lobj_hdr_tbl 0x40080000-0x400e3d10 25553 408848 Total accounted static memory = 78668 pages

kmeminfo -user [swap,max=<count>,all] The -user option prints a list of user processes sorted by physical size (aka resident), or sorted by swap size when specifying the swap flag. kmeminfo scans the pregions and prints the following information: virtual is summing up the pregions p_count. physical is summing up r_nvalid for private regions, and r_nvalid/r_refcnt for shared regions. swap is summing up r_swalloc for private regions, and r_swalloc/r_refcnt for shared regions. when used, r_refcnt is adjusted to skip references from pseudo-pregions. The all flag allows to also include system daemons (as opposed to just getting user processes). Some examples: Top 5 processes using the most physical memory: $ kmeminfo -user max=5 /padumps/eqalloc.11o Summary of processes memory usage: List sorted by physical size, in pages/bytes: virtual physical swap ppid pages / bytes pages / bytes pages / bytes command 1 100586 392.9m 11013 43.0m 9421 36.8m oninit 3082 100586 392.9m 10827 42.3m 9443 36.9m oninit 3082 100586 392.9m 10824 42.3m 9427 36.8m oninit 3082 100591 392.9m 10413 40.7m 9417 36.8m oninit 3082 100586 392.9m 10410 40.7m 9415 36.8m oninit Total: 53487 208.9m 47123 184.1m

pid 3073 3083 3084 3090 3088

Top 10 processes having reserved the most swap space: $ kmeminfo -user swap,max=10 /iadumps/bbds_panic Summary of processes memory usage: List sorted by swap size, in pages/bytes:

107 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

virtual physical swap pid ppid pages / bytes pages / bytes pages / bytes command 24409 24363 268924 1.0g 1275 5.0m 3182 12.4m sim 1085 1 3758 14.7m 1741 6.8m 1658 6.5m dced 1207 1 3591 14.0m 1654 6.5m 1475 5.8m swagentd 1382 1370 269990 1.0g 761 3.0m 1344 5.2m httpd 1381 1370 269990 1.0g 761 3.0m 1344 5.2m httpd 1385 1370 269990 1.0g 761 3.0m 1344 5.2m httpd 1380 1370 269990 1.0g 761 3.0m 1344 5.2m httpd 1384 1370 269990 1.0g 761 3.0m 1344 5.2m httpd 1941 1370 269990 1.0g 761 3.0m 1344 5.2m httpd 1370 1 269974 1.0g 757 3.0m 1343 5.2m httpd Total: 9993 39.0m 15722 61.4m

In the above example, there is a few processes with a virtual set size of 1GB, but those processes are just using a few MB of swap space. This is resulting from IA64 stacks being lazy-swap evaluated, see -pid 24409 below.

kmeminfo -pid <pid> [-prot] [-parse] The -pid option prints the list of the pregions in the vas of the specified process pid. A pid of -1 allows to select all the processes. The type column gives the pregion type. Note that kmeminfo uses pseudo pregion types for shared libraries: SHLDATA is an MMAP pregion with PF_SHLIB set (and PF_VTEXT clear). SHLTEXT is an MMAP pregion with both PF_SHLIB and PF_VTEXT set. The ref column is the region r_refcnt. The virtual column is the pregion p_count. The physical column is the region r_nvalid, ie. number of resident physical memory pages. The swap column is the region r_swalloc, ie. reserved swap space. The total on the last line is computed as described above for the -user option. An example: $ kmeminfo -pid 24409 /iadumps/bbds_panic Process's memory regions (in pages):

Process "sim", pid 24409, 64bit ASL, R_SHARE_MAGIC: type space vaddr ref virt phys swap NULLDREF 0x27ad231.0x0000000000000000 412 1 1 1 TEXT 0x3aa6231.0x4000000000000000 2 36 35 0 DATA 0x0399031.0x6000000000000000 1 704 676 704 UAREA 0x1fe0831.0x8003ffff7fefc000 1 20 16 20 UAREA 0x1fe0831.0x8003ffff7ff10000 1 20 16 20 UAREA 0x1fe0831.0x8003ffff7ff24000 1 20 16 20 UAREA 0x1fe0831.0x8003ffff7ff38000 1 20 16 20 UAREA 0x1fe0831.0x8003ffff7ff4c000 1 20 16 20 ...... RSESTACK 0x1fe0831.0x8003ffffbf7ff000 1 2048 1 2 STACK 0x1fe0831.0x8003ffffc0000000 1 262144 5 5 ...... total 268924 1275 3182

When the -prot option is specified along with -pid, kmeminfo prints two additional columns: the key column gives the pregion p_hdl.hdlprot field

108 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

ie. a protection id on PARISC or a protection key on IA64, and the ar column gives the pregion p_hdl.hdlar field ie. access rights. $ ps PID TTY TIME COMMAND 27385 pts/4 0:00 rlogind 27443 pts/4 0:00 ps 27387 pts/4 0:00 sh $ kmeminfo -prot -pid 27387 Process's memory regions (in pages): Process "sh", pid 27387, 32bit ASL, R_SHARE_MAGIC: type space vaddr key ar ref virt phys swap NULLDREF 0x522e800.0x0000000000000000 0x6865 URX 98 1 1 1 TEXT 0x522e800.0x0000000000001000 0x6865 URX 17 45 41 0 DATA 0x8b22c00.0x0000000040001000 0x1e24 URW 1 19 18 19 MMAP 0x8b22c00.0x000000007f7d2000 0x1e24 URWX 1 1 0 1 SHLDATA 0x8b22c00.0x000000007f7d3000 0x1e24 URWX 1 1 1 1 SHLDATA 0x8b22c00.0x000000007f7d4000 0x1e24 URWX 1 9 8 9 MMAP 0x8b22c00.0x000000007f7dd000 0x1e24 URWX 1 14 7 14 MMAP 0x8b22c00.0x000000007f7eb000 0x1e24 URWX 1 2 2 2 SHLDATA 0x8b22c00.0x000000007f7ed000 0x1e24 URWX 1 3 3 3 STACK 0x8b22c00.0x000000007f7f0000 0x1e24 URW 1 8 8 8 SHLTEXT 0xd99dc00.0x00000000c0004000 PUBLIC URX 83 2 2 0 SHLTEXT 0xd99dc00.0x00000000c0010000 PUBLIC URX 98 21 19 0 SHLTEXT 0xd99dc00.0x00000000c0100000 PUBLIC URX 83 299 231 0 UAREA 0xa78c800.0x400003ffffff0000 KERNEL KRW 1 8 8 8 total 433 60 65

kmeminfo -parse When the -parse option is specified along with -pid, kmeminfo prints an additional pid column in first position. This format allows to parse the output as to search for specific patterns. For example, checking if shared libraries on the system are properly configured can be done with parsing the output with grep commands: $ kmeminfo -parse -prot -pid -1 /padumps/java_perf | \ grep SHLTEXT | grep -v PUBLIC 1595 SHLTEXT 0xc6f6c00.0x00000000c022f000 0x17b5 URX 2 1595 SHLTEXT 0xc6f6c00.0x00000000c025f000 0x524e URX 2 1595 SHLTEXT 0xc6f6c00.0x00000000c027e000 0x500e URX 2 1595 SHLTEXT 0xc6f6c00.0x00000000c02b3000 0x45b1 URX 2 1595 SHLTEXT 0xc6f6c00.0x00000000c02ba000 0x5d89 URX 2 1595 SHLTEXT 0xc6f6c00.0x00000000c02dd000 0x3a09 URX 2 ...

1 1 2 1 2 3

1 1 2 1 2 3

0 0 0 0 0 0

ie. the process with pid 1595 is using shared libraries which are not configured with the usual read-only/execute 0555 (chmod a=rx) mode.

kmeminfo -bucket [index,flags] Without flags, the -bucket option prints statistics about kernel memory buckets, such as number of pages allocated per bucket and per cpu, total number of objects, number of free objects, and number of objects in use. Note that if not already printed, the summary and dynamic section are printed. An example: $ kmeminfo -bucket /padumps/buckcorr.11o ... Per cpu kernel dynamic memory usage (in pages): Only the byte buckets are per cpu (bucket 5 to 12, ie. size 32 to 4096)

109 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

CPU # 0 = 1488 : ( objects: used, free) bucket[ 5] = 35 size 32 bytes ( 4480: 4010, 470) bucket[ 6] = 17 size 64 bytes ( 1088: 929, 159) bucket[ 7] = 76 size 128 bytes ( 2432: 1163, 1269) bucket[ 8] = 93 size 256 bytes ( 1488: 972, 516) bucket[ 9] = 188 size 512 bytes ( 1504: 1206, 298) bucket[10] = 223 size 1024 bytes ( 892: 838, 54) bucket[11] = 548 size 2048 bytes ( 1096: 1019, 77) bucket[12] = 308 size 4096 bytes ( 308: 296, 12) CPU # 1 = bucket[ 5] = bucket[ 6] = bucket[ 7] = bucket[ 8] = bucket[ 9] = bucket[10] = bucket[11] = bucket[12] = ... 2576 : ( objects: used, free) 125 size 32 bytes ( 16000: 15367, 633) 22 size 64 bytes ( 1408: -1, -1) 153 size 128 bytes ( 4896: 4543, 353) 158 size 256 bytes ( 2528: 2319, 209) 332 size 512 bytes ( 2656: 2435, 221) 164 size 1024 bytes ( 656: 611, 45) 1504 size 2048 bytes ( 3008: 2675, 333) 118 size 4096 bytes ( 118: 114, 4)

ie. we had 17 pages allocated to cpu 0 bucket 6. Those 17 pages represented a total of 1088 objects of 64 bytes: 929 objects were used, and 159 objects were on the bucket's free list. A negative used count is telling that kmeminfo couldn't walk the free list. Below is the list of the possible error codes: -1 means P4_ACCESS_ERR_NO_TRANS -2 means P4_ACCESS_ERR_NO_PHYSMEM -3 means P4_ACCESS_ERR_NOT_SELECTED -4 means P4_ACCESS_ERR_BAD_CORE_ACCESS -5 means P4_ACCESS_ERR_PG0 -99 means that a next pointer wasn't aligned on the bucket size. ie. cpu 1 bucket 6 was likely to be corrupted. Bucket index When specifying a bucket index, kmeminfo prints the list of the objects belonging to the specified bucket. It actually scans the kmemusage array for pages allocated to the bucket, and those pages are sliced into objects, reporting the state of each object, ie. free vs. used. When a bucket free list is corrupted, the state of some of the objects might be reported as n/a, as kmeminfo couldn't walk the free list to properly determine the state of the object: $ kmeminfo -bucket 6 /padumps/buckcorr.11o Error while scanning a bucket_64bit free list: cpu 1, index 6 (64 bytes), head 0x00a55e78 next 0x0000be48 (no translation!) used 0x104004000 used 0x104004040 used 0x104004080 used 0x1040040c0 ... n/a 0x1025fb000 n/a 0x1025fb040 n/a 0x1025fb080 n/a 0x1025fb0c0 ... free 0x109203000 used 0x109203040 free 0x109203080 free 0x1092030c0 ...

110 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Bucket flags The following flags allow to control the processing of bucket objects: free to print only free objects. used to print only used objects. cpu=number to print only objects from the specify cpu. skip=number to skip number objects before starting printing objects. max=number to limit the output to number objects. dump[=size] to include an hexa dump of size bytes, the default size being the size of the object. type=struct to include a struct print out. offset=size to specify an offset of size bytes to add to the object address before dumping hexa or printing struct. $ kmeminfo -bucket 10,dump=32,cpu=0,free,max=3 /padumps/memcorr.11o bucket_64bit[0][10]: 0x5a701400 0x5a701400 : 0x00000000631c1000 0x2f64617465003200 ....c.../date.2. 0x5a701410 : 0x3200474c2f6c6962 0x2f6c69626f676c74 2.GL/lib/liboglt 0x631c1000 0x631c1000 : 0x000000005e9c7c00 0x000000007f7f0005 ....^.|......... 0x631c1010 : 0x7f7f00117f7f0036 0x7f7f00817f7f00a5 .......6........ 0x5e9c7c00 0x5e9c7c00 : 0x0000000057e44800 0x6f6c2f707767722f ....W.H.ol/pwgr/ 0x5e9c7c10 : 0x6461656d6f6e002f 0x636c69656e743135 daemon./client15

as the head of the free list has no translation.

kmeminfo -pgclass The -pgclass option prints page classification statistics. This can be usefull to check for partial selective dumps (dumps where not all the selected pages have been dumped) as in the example below: $ kmeminfo -pgclass Page class statistics: PC_UNUSED PC_USERPG PC_BCACHE PC_KCODE PC_USTACK PC_FSDATA PC_KDDATA PC_KSDATA Total = 73401 excluded, = 361086 excluded, = 78500 excluded, = 1656 excluded, = 2927 included, = 99 included, = 204688 included, = 63819 included, 0 dumped 0 dumped 0 dumped 0 dumped 0 dumped 0 dumped 198296 dumped 63819 dumped

= 271533 included, 262115 dumped

111 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

kmeminfo -pdk_malloc The -pdk_malloc option prints the pinned and unpinned pdk malloc maps, along with other pdk malloc related tables, such as the Translation Registers. Note that the pseudo TRs for the unpinned pdk space are tagged as DATA*, see the example below. This option is most useful when debugging pdk malloc related problems. $ kmeminfo -pdk_malloc Translation Registers: dtr type ar virt_addr range phys_addr range 0 TEXT KRW 0xe000000000000000-0xe000000003ffffff 0x04000000-0x07ffffff 1 DATA KRW 0xe000000100000000-0xe000000103ffffff 0x08000000-0x0bffffff 2 DATA KRW 0xe000000104000000-0xe000000107ffffff 0x0c000000-0x0fffffff 3 SAPIC KRW 0xe000eeeefee00000-0xe000eeeefeefffff 0xfee00000-0xfeefffff 6 VHPT KRW 0xe000000120000000-0xe0000001203fffff 0x00400000-0x007fffff 7 VHPT KRW 0xe000000120400000-0xe0000001207fffff 0x00800000-0x00bfffff 64 DATA* KRW 0xe000000108000000-0xe00000010bffffff 0x10000000-0x13ffffff itr type ar virt_addr range phys_addr range 1 TEXT KRX 0xe000000000000000-0xe000000003ffffff 0x04000000-0x07ffffff 2 DATA KRX 0xe000000100000000-0xe000000103ffffff 0x08000000-0x0bffffff 3 DATA KRX 0xe000000104000000-0xe000000107ffffff 0x0c000000-0x0fffffff PDK Malloc: pinned_pdk_malloc_base = 0xe0000001004e3000 pinned_pdk_malloc_unused_base = 0xe000000100757000 pinned_pdk_malloc_unused_end = 0xe000000106ffffff pinned_pdk_malloc_end = 0xe000000107ffffff unpinned_pdk_malloc_base = 0xe000000108000000 unpinned_pdk_malloc_unused_base= 0xe00000010976f000 unpinned_pdk_malloc_unused_end = 0xe000000109800fff unpinned_pdk_malloc_end = 0xe00000010bffffff unpinned_pdk_malloc_itir.ps = 0x0000000004000000 unpinned_pdk_malloc_va2pa_delta= 0xe0000000f8000000 pinned_pdk_malloc_map: map size vaddr: first last 0xe00000010036a710 12288 0xe0000001004e3000 0xe0000001004e5fff 0xe00000010036a720 64 0xe000000100692000 0xe00000010069203f 0xe00000010036cd00 64 0xe000000107fed880 0xe000000107fed8bf 0xe00000010036cd10 64 0xe000000107ff47c0 0xe000000107ff47ff 0xe00000010036cd20 18688 0xe000000107ffb700 0xe000000107ffffff unpinned_pdk_malloc_map: map size vaddr: first last 0xe000000100371590 2624 0xe00000010806e5c0 0xe00000010806efff 0xe0000001003715a0 3392 0xe0000001088fa2c0 0xe0000001088fafff 0xe0000001003715b0 3392 0xe000000108cf52c0 0xe000000108cf5fff 0xe0000001003715c0 4032 0xe00000010976e040 0xe00000010976efff 0xe0000001003715d0 64 0xe00000010b000fc0 0xe00000010b000fff 0xe0000001003715e0 16142336 0xe00000010b09b000 0xe00000010bffffff

kmeminfo -sysmap The -sysmap option prints the system virtual address resource maps. Those resource maps are tracking ranges of free virtual addresses for kernel dynamic memory allocations and for buffer cache allocations: PARISC 10.x PARISC 11.x IA64 11.x KERNEL

112 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

SYSMAP's sysmap sysmap_32bit sysmap_64bit sysmap sextmap BUFCACHE BUFMAP's bufmap bufmap2 bufmap bufmap2 bufmap When free virtual address ranges are getting fragmented, the resource map entries are used up, and the kernel is no longer able to return free addresses to the resource map. In this case, an rmap ovflo message is printed and the freed address range is lost. Eventually, if we keep loosing addresses, the kernel might no longer be able to allocate virtual space when needed. For kernel sysmap, this could result in a panic: out of kernel virtual space. For bufcache bufmap's, this could result in poor performance (hang in bcfeeding_frenzy()). Below is an example from an PARISC 11.11 machine: $ kmeminfo -sysmap Resource maps for kernel dynamic virtual addresses: sysmap_32bit at 0xca2890 (6399 entries max): m_addr m_size vaddr_range:first last 0x0000000000c93 9 0x0000000000c92000 0x0000000000c9afff 0x0000000000cd6 213 0x0000000000cd5000 0x0000000000da9fff 0x0000000002135 16 0x0000000002134000 0x0000000002143fff 0x0000000002148 57 0x0000000002147000 0x000000000217ffff 0x000000000218c 6 0x000000000218b000 0x0000000002190fff 0x00000000021a2 17104 0x00000000021a1000 0x0000000006470fff 0x0000000006473 25949 0x0000000006472000 0x000000000c9cefff 0x000000000c9da 6 0x000000000c9d9000 0x000000000c9defff 0x000000000c9ec 4 0x000000000c9eb000 0x000000000c9eefff 0x000000000ca01 1181 0x000000000ca00000 0x000000000ce9cfff 0x000000000ce9f 1 0x000000000ce9e000 0x000000000ce9efff 0x000000000cea1 209248 0x000000000cea0000 0x000000003fffffff Total size: 253794 (12 entries used) sysmap_64bit at 0xcbb890 (6399 entries max): m_addr m_size vaddr_range:first last 0x000000004106b 96 0x000000004106a000 0x00000000410c9fff 0x0000000041191 3696 0x0000000041190000 0x0000000041ffffff 0x0000000044001 278528 0x0000000044000000 0x0000000087ffffff Total size: 282320 (3 entries used) Bitmaps for buffer cache virtual addresses: bm_t bufmap at 0x40589000

Address space (in pages): Total = 98304 Used = 75006 Free = 23298 bitmap nbit/mask vaddr_range:first last 0x000000004058d100 128 0x8000000008800000 0x800000000887ffff 0x000000004058d110 0x0fffffff 0x8000000008880000 n/a 0x000000004058d130 0xffff0000 0x8000000008980000 n/a 0x000000004058d134 160 0x80000000089a0000 0x8000000008a3ffff ... 0x000000004058edf8 32 0x8000000016fc0000 0x8000000016fdffff 0x000000004058edfc 0x7e7fffff 0x8000000016fe0000 n/a 0x000000004058ee00 1504 0x8000000017000000 0x80000000175dffff 0x000000004058eebc 0xfff3ffff 0x80000000175e0000 n/a _________________________________________________________________________________________________

113 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

kmeminfo -alias This option prints the list of the Physical Addresses aliased to different Virtual Addresses. To do so, kmeminfo searches the pfn to virt table for entries having aliases. Except for the NULLDREF pages, there should not be many aliased pages in the system. $ kmeminfo -alias List of alias entries: Printing all the alias entries ... PA 0x04e5c000 aliased to VA 0x0204b000.0x0000000000000000 PA 0x04e5c000 aliased to VA 0x03c11c00.0x0000000000000000 PA 0x04e5c000 aliased to VA 0x098f8000.0x0000000000000000 PA 0x04e5c000 aliased to VA 0x0cf9f400.0x0000000000000000 Used aliases: 4 entries Free aliases: 336 entries Total aliases: 340 entries (1 pages) _________________________________________________________________________________________________ kmeminfo -kas This option prints information about the Kernel Allocator Superpage, aka kernel superpage pool. The 11.00 implemention of the superpage pool was vulnerable to the fragmentation of the physical memory pool. For 11.00, the highest field would typically be used to check for fragmentation issue causing this kas allocator to perform poorly. The 11.11 implementation fixed the performance issue observed when fragmenting superpages. $ kmeminfo -kas Kernel memory Allocation Superpage pool (KAS): super_page_pool at 0x69b668 kas_total_in_use = 10372 kas_max_total_in_use = 15216 kas_force_free_on_coalesce = 1 kas_total_freed =0

size sp_pool_t count free highest 0 4KB 0x000000000069b668 0 384 1 8KB 0x000000000069b680 0 450 2 16KB 0x000000000069b698 0 354 3 32KB 0x000000000069b6b0 0 238 4 64KB 0x000000000069b6c8 0 10 5 128KB 0x000000000069b6e0 0 1 6 256KB 0x000000000069b6f8 0 1 7 512KB 0x000000000069b710 0 1 8 1MB 0x000000000069b728 0 0 9 2MB 0x000000000069b740 0 0 10 4MB 0x000000000069b758 0 1 11 8MB 0x000000000069b770 0 0 12 16MB 0x000000000069b788 4 0 Total number of free page on pools: 6012

sp_next 825 0x000000010380f000 540 0x00000001009f2000 363 0x0000000100864000 241 0x0000000100a68000 53 0x0000000103410000 24 0x0000000104b20000 7 0x0000000104b40000 3 0x0000000104b80000 2 0x0000000000000000 2 0x0000000000000000 2 0x0000000104c00000 1 0x0000000000000000 0 0x0000000000000000

kmeminfo -vmtrace [flag,flag,...] Without any flags, the vmtrace option causes kmeminfo to dumps all the vmtrace logs: memory corruption log memory leak log general memory tracing log The entries in both the corruption log and the general tracing log are printed using the following format: Each record is printed in the following format: address, size, arena, pid, tid, date/time, stack trace Memory log for cpu 0:

114 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

0xe0000001400d09c0 56 M_TEMP 68 77 Sep 5 17:29:16 vmtrace_free+0x1d0 kfree+0x150 vx_worklist_process+0x290 vx_worklist_thread+0x70 ... By default, the memory leak log is printed grouping allocation pattern together, and sorting them by increasing occurences, eg. : Vmtrace Leak Log: Note: "Total allocated memory" is the number of pages allocated since vmtrace was started. Repeated 752 times, malloc size 24 bytes: vmtrace_alloc+0x160 kmalloc+0x240 vx_zalloc+0x30 vx_inode_alloc+0x220 vx_ireuse+0x340 vx_iget+0x270 Total allocated memory 5 pages Latest on Sep 5 17:47:39 Oldest on Sep 5 13:30:57 ... To obtain a detailed output of each memory leak log entries, the -verbose option should be specified. In which case, the leak log entries are sorted by time.

The following flags can be used to limit the output to specific logs: bucket=<bucket> # limit output to the specified bucket index (10.x and 11.0) arena=<arena> # limit output to the specified arena (11.11 and beyond) count=<num> # limit output to the first log entries leak # limit output to the leak log cor # limit output to the corruption log log # limit output to the general tracing log parse # produce an output which can be easilly parse For more information about vmtrace, please visit the vmtrace web site.

kmeminfo -virtual [<space>.]<vaddr>[,trans,hash,pid=<pid>] The -virtual option prints both translation and ownership for the specified space.vaddr virtual address. Translation When focusing on translation, you may skip the ownership information by specifying the trans flag. Translation hash chains may be printed by specifying the hash flag. When hash is specified, the primary hash chain is printed, but kmeminfo also prints the secondary hash chain on systems supporting dual pdir (ie. post 11.22). When specifying a process id using the pid flag, you may specify only the vaddr offset (skipping the space). The space would be taken from the pregion holding vaddr (if any). A PA2.0 dual pdir example: $ kmeminfo -virtual 0x7ac02300,trans,hash,pid=1802 /padumps/type9 VA 0x813f400.0x7ac02300 translates to PA 0x77802300 Page table entry: hpde2_0_t 0x4b51fa0 Access rights : 0x1f PDE_AR_URW Protection key: 0x4f7a Page size : 4MB Large page details: Addr : virtual physical Start: 0x7ac00000 0x77800000 End : 0x7b000000 0x77c00000

115 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Hashing details: Primary: pdirhash=0x04b51fe0=htbl[370943] vtag=0x0813f400 0x0007ac02 hpde2_0_t pde_next pde_space pde_vpage 0x04b51fe0 0x00000000 0xffffffff 0x00000000 Secondary: mask=0x3fffff base=0x7ac00000 pdirhash=0x04b51fa0=htbl[370941] vtag=0x0813f400 0x0007ac00 hpde2_0_t pde_next pde_space pde_vpage 0x04b51fa0 0x00000000 0x0813f400 0x0007ac00

When space is omitted, and a pid not specified, kmeminfo will use the kernel space if the vaddr is within the kernel segment. An IA64 example: $ kmeminfo -virtual 0xfffc00005cefa000,trans,hash /iadumps/bbds_panic VA 0xdead31.0xfffc00005cefa000 translates to PA 0x15fa000 Page table entry: pte_t 0xe000000108ab8820 Access rights : 0x0c PTE_AR_KRWX Protection key: 0xbeef KERNEL/PUBLIC Page size : 4KB Hashing details: thash=0xe000000120220ae0=vhpt[69719] ttag=0x006f56c00005cefa pte_t pte_next pte_tag 0xe000000120220ae0 0x00000000107d54c0 0x00a8180000004067 0xe0000001087d54c0 0x000000001119c340 0x006f56800011cefa 0xe00000010919c340 0x0000000010ab8820 0x0088680000040087 0xe000000108ab8820 0x0000000000000000 0x006f56c00005cefa

Ownership When trans is not specified, kmeminfo also prints the owner of the virtual address. User private: $ kmeminfo -virtual 0x813f400.0x7ac02300 /padumps/type9 ... VA belongs to PRIVATE reg_t 0x494cb4c0: Region index: 2 Page valid : 1 Page frame : 0x77802 dbd_type : DBD_NONE dbd_data : 0xfffff0c Front store : struct vnode 0 Back store : struct vnode 0x413ff140 VA belongs to process "a.out" pid 1802, MMAP preg_t 0x4949fb80. User shared: $ kmeminfo -virtual 0x0c6f6c00.0xc1381000 /dumps/pa/java_perf ... VA belongs to SHARED reg_t 0x4a75a600: Region index: 0 Page valid : 1

116 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Page frame : 0x5b211 dbd_type : DBD_NONE dbd_data : 0x1fffff0c Front store : struct vnode 0 Back store : struct vnode 0x401f3e00 List of pregions sharing the region: pid preg_t type vaddr bytes command 1551 0x000000004a90e500 SHMEM 0x00000000c1381000 0x00074000 HPSS7 1550 0x000000004a8c6b00 SHMEM 0x00000000c1381000 0x00074000 ss7waiter.TSC 1530 0x000000004a7cbd00 SHMEM 0x00000000c1381000 0x00074000 ttlRecover

Kernel buffer cache: $ kmeminfo -virtual 0xfffc00005cefa000 /iadumps/bbds_panic ... VA belongs to buffer cache, struct buf 0xe000000118b54500.

Kernel byte bucket: $ kmeminfo -virtual 0x4a90e520 /padumps/java_perf ... VA belongs to "bucket_64bit" (cpu 0,index 8, size 256 bytes). VA is within the object at: Start: 0x4a90e500 Size : 0x00000100 256 bytes End : 0x4a90e600 The object is currently in use.

Kernel page bucket: $ kmeminfo -virtual 0x4b9f83c0 /padumps/java_perf ... VA belongs to "page_buckets_64bit" (index 2, size 8 KB). VA is within the object at: Start: 0x4b9f8000 Size : 0x00002000 2 pages 8 KB End : 0x4b9fa000 The object is free (on the bucket free list at position 3).

Kernel arena: $ kmeminfo -virtual 0x4bf4a940 /dumps/pa/arena_corr ... VA belongs to variable arena "M_PREG" (cpu 0, index 4, size 248 bytes). VA is within the object at: Start: 0x4bf4a940 Size : 0x000000f8 248 bytes End : 0x4bf4aa38 The object is in use.

Kernel super page pool:

117 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

$ kmeminfo -virtual 0xe000000145812f00 /iadumps/bbds_panic ... VA belongs to a free chunk on "super_page_pool.sp_pool_list[11]": Start: 0xe000000145800000 Size : 0x0000000000800000 2048 pages 8 MB End : 0xe000000146000000

Kernel sysmap: $ kmeminfo -virtual 0xe00000011022d000 /dumps/ia/bbds_panic VA 0xdead31.0xe00000011022d000 does not have a translation. Hashing details: thash=0xe0000001203b9000=vhpt[121984] ttag=0x006f56800011022d pte_t pte_next pte_tag 0xe0000001203b9000 0x0000000000000000 0x006f56c00005022d VA belongs to a free chunk on "sysmap": Start: 0xe00000011022d000 Size : 0x0000000000003000 3 pages 12 KB End : 0xe000000110230000

shminfo http://oceanie.grenoble.hp.com/georges/shminfo.html A tool to troubleshoot shared memory allocation by Georges Aureau - WTEC HPUX Description shminfo looks at the resource maps of the available shared quadrants (global and/or private) and at the allocated shared memory segments, then it prints a consolidated map of what is free/used. It also prints information about system limitations (shmmax, swap space), and about allocation policies. shminfo supports the following features/releases: 10.20 MR... 10.20 PHKL_8327 SHMEM_MAGIC executables... 10.20 PHKL_15058 Best-fit allocation policy... 10.30 MR/LATEST. 11.00 MR 32bit or 64 bit shared space... 11.00 PHKL_13810 Memory windows... 11.00 PHKL_16236 BigSpace memory windows... 11.00 PHKL_20224 Q3 Private executables... 11.00 PHKL_20995 Fix for Q3 Private breaking memory windows...

Usage When run without any options, shminfo prints the following 4 sections (click here for an example): Global 32-bit shared quadrants. Private 32-bit shared quadrants. Limits for 32-bit SHMEM allocation. Allocation policy for 32-bit shared segments. If the current directory contains a crash dump (an INDEX file is present) then the crash dump is opened, otherwise it opens /stand/vmunix and /dev/kmem The -help (for short -h) gives the full usage: Usage: shminfo [options ...] [coredir | kernel core] Default: coredir="." if "INDEX" file present else kernel="/stand/vmunix" core="/dev/kmem"

118 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Options: -s | -shmem -w | -window -g | -global -p | -private -f | -free -F | -bigfree -l | -limits -W | -64bit -h | -help -v | -verbose -a | -async -V | -version -u | -update The -global option prints only the maps of the global shared quadrants. The -private option prints only the maps of the private shared quadrants (memory windows). The -window <id> option prints the map of the specified memory window. The -free option prints only the free virtual space of the shared quadrants, and the -F (-bigfree) option prints the largest free chunk within each shared quadrant (very useful when debugging shmem allocation problems). The -limits prints only the system limitation for 32bit shared memory segment allocation. The -shmem <id> option prints the list of the processes currently attached to the specified shmem id. The -W (-64bit) option prints the maps of 64bit shared quadrants. The -async option (courtesy of Peter Hryczanek, thanks Pete :-) prints information of the registered async segments.

Shared quadrants Shared space is allocated from shared quadrants. A shared quadrant can be global or private. Space allocated from a global shared quadrant can be shared by all processes on the system. However, space allocated from a private shared quadrant can only be shared by a group of processes (see memory windows below). 10.20 MR There is only two global shared quadrants, with a total of 1.75GB of shared space: 10.20 MR global shared quadrants Space Start End Size Q3 q3 space 0x80000000 0xc0000000 1GB Q4 0 0xc0000000 0xf0000000 .75GB An application can allocate up to 1.75GB of shared segments. However, the largest individual shared segment is limited to 1GB (the size of a quadrant), ie. to allocate 1.75GB of shared segments, the application would have to get at least 2 individual shared segments. 10.20 PHKL_8327 (SHMEM_MAGIC) A third global shared quadrant was added in order to support SHMEM_MAGIC executables ("chatr -N", PHSS_8358), thus giving a total shared space of ~2.75GB: 10.20 LATEST global shared quadrants Space Start End Size Q2 q2 space 0x40000000 0x7ffe6000 ~1GB Q3 q3 space 0x80000000 0xc0000000 1GB Q4 0 0xc0000000 0xf0000000 .75GB An SHMEM_MAGIC application can allocate up to ~2.75GB of shared segments. However, the largest individual shared segment is limited to 1GB (the size of a quadrant), ie. to allocate ~2.75GB of shared segments, the application would have to get at least 3 individual shared segments. Also, note that there is a gap of virtual space between the Q2 and Q3 shared quadrants: this gap is actually used for the UAREA (starting at 0x7ffe6000 on 10.x). 10.20 PHKL_15058 (BEST FIT) Best fit allocation policy was introduced to reduce the fragmentation of the shared space resource maps. 11.0 MR Support for 32bit shared space and 64bit shared space. There is three 32bit global shared quadrants, however the layout of those quadrants differ from 32bit kernel to 64bit kernels for the end of 32bit Q2. For 32bit kernels, the UAREA sits at the end of 32bit Q2 (starting at 0x7fff0000 on 11.0), thus creating a virtual space gap between the Q2 and Q3 32bit shared quadrants:

119 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

11.0 MR global 32bit shared quadrants on 32bit kernel Space Start End Size Q2 q2 space 0x40000000 0x7fff0000 ~1GB Q3 q3 space 0x80000000 0xc0000000 1GB Q4 0 0xc0000000 0xf0000000 .75GB 11.0 MR global 32bit shared quadrants on 64bit kernel Space Start End Size Q2 q2 space 0x40000000 0x80000000 1GB Q3 q3 space 0x80000000 0xc0000000 1GB Q4 0 0xc0000000 0xf0000000 .75GB There is two 64bit global shared quadrants (64bit kernel only). Those 64bit shared quadrants are used by 64bit applications: 11.0 MR global 64bit shared quadrants on 64bit kernel Space Start End Size Q1 q1 space 0x0000000000000000 0x0000040000000000 4TB Q4 q4 space 0xc000000000000000 0xc000040000000000 4TB Also, note that for 64kernels the UAREA sits a the end of 64bit Q2. The table below gives the UAREA location given the kernel bits: UAREA location Kernel Q2 start/end UAREA start 32bit 0x40000000 0x80000000 0x7fff0000 64bit 0x40000000'00000000 0x40000400'00000000 0x400003ff'ffff0000 As you can see, the end of the 32bit Q2 shared quadrant overlaps with UAREA on 32bit kernels but not on 64bit kernels. On a 64bit kernel, the 32bit Q2 shared quadrant is actually within the 64bit Q1 quadrant, hence it does not overlap with the UAREA which sits in the 64bit Q2 quadrant. This is an important point to notice as it explains why the BigSpace Window and Q3 Private features (described below) are only available on 64bit kernels. 11.0 PHKL_13810 (MEMORY WINDOW) This patch introduced support for memory windows. In the previous HPUX releases, the available shared quadrants were global shared quadrants. With memory windows, each window is giving 2 private shared quadrants. Memory windows are only used by 32bit applications (running on either 32bit or 64bit kernels), ie. a memory window provides 32bit shared space. The memory window with id 0 is know as the "global memory window" (or "default memory window). Each memory window provides ~2GB of private shared space: Private shared quadrants for memory window index i on 32bit kernel Space Start End Size Q2 mw[i] q2 space 0x40000000 0x7fff0000 ~1GB Q3 mw[i] q3 space 0x80000000 0xc0000000 1GB Private shared quadrants for memory window index i on 64bit kernel Space Start End Size Q2 mw[i] q2 space 0x40000000 0x80000000 1GB Q3 mw[i] q3 space 0x80000000 0xc0000000 1GB Global 32bit shared quadrant Space Start End Size Q4 0 0xc0000000 0xf0000000 .75GB 11.0 PHKL_16236 (BIG SPACE MEMWIN) BigSpace memory windows ("setmemwindow -b", PHCO_16795) on 64bit kernels only. When configured for "big space", the memory window is set with its q2 space matching its q3 space, thus allowing (on 64bit kernel only) to have 2GB of virtually-contiguous (same space id, and no gap at the end of Q2 for UAREA) shared space: Space Start End OS Size Q2 Q3 mw[b] q2 space mw[b] q3 space 0x40000000 0x80000000 0x80000000 0xc0000000 64bit 1GB 1GB The big space window can be viewed as:

120 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Space Start End OS Size Q2 Q3 mw[b] space 0x40000000 0xc0000000 64bit 2GB A 32bit SHMEM_MAGIC application can then allocate two 1GB shared memory segments, and treat those two segments as a single 2GB segment. Note that the largest 32bit shared memory segment which can be can allocated through a shmget() call is limited by the size of a quadrant, hence 1GB. The Big Space window allows to work-around this limitation, and through 2 shmget() calls, an application is now able to get a "2GB" shared memory segment.

11.0 PHKL_20224... (Q3 PRIVATE) Q3 private executables ("chatr +q3p enable", PHSS_19866) on 64bit kernels only. Q3 private processes are attached to the Q3 private memory window (id 1). The Q3 private memory window does not hold any virtual space, its resource maps for both its q2 and q3 quadrants are kept empty, thus forcing allocation of shared space from the Q4 global shared quadrant. This essentially gives 3GB of private space (Q1, Q2, Q3 are considered as quadrants for private data). Running shminfo -w 1 (ie. q3 private window id is 1) reports no free space: $ shminfo -w 1 libp4 (7.98): Opening /stand/vmunix /dev/kmem shminfo (3.7) Shared space from Window id 1 (q3private): Space Start End Kbytes Usage Q2 0xffffffff.0x40000000-0x7fffffff 1048576 OTHER Q3 0xffffffff.0x80000000-0xbfffffff 1048576 OTHER 11.0 PHKL_20995 (Q3 PRIVATE FIX) Fix for Q3 private breaking memory windows. The initial Q3 private patches (PHKL_20227 and PHKL_20836) had a defect causing memory windows (other than the global window) to not been properly initialized. When Q3 private is configured and all the windows except the global window are empty then we are very likely to be running into the Q3 private bug and thus shminfo prints the following warning: WARNING: Q3 Private is enabled but "shminfo" couldn't find a memory window with free virtual space: this is suggesting that one of PHKL_20227 or PHKL_20836 is installed (both are BAD patches). Please, make sure that PHKL_20995 is installed !

shmalloc Along with the shminfo executable, the ktools shminfo.exe archive provides shmalloc.c, a C program to exercise shared memory allocation: # shmalloc -h Purpose: To troubleshoot 32bit shared memory allocation problems. It should be used along with "shminfo" (v3.0 or greater)...

Usage: shmalloc [options ...] Options: -s size segment size in KB (default=4, max=1048576=1GB) -c number number of segments (default=1, max=10)

121 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

-g Global space (IPC_GLOBAL), instead of window space -n Narrow mode (IPC_SHARE32), useful when compiling +DD64 -l Lock segments in memory (root or PRIV_MLOCK groups only) -t seconds Touch all the pages during seconds -h prints the help Compiling shmalloc.c: SHARED_MAGIC: cc -o shmalloc shmalloc.c EXEC_MAGIC: cc -Wl,-N -o shmalloc shmalloc.c SHMEM_MAGIC: cc -Wl,-N -o shmalloc shmalloc.c; chatr -M shmalloc Example: Allocating 2 shmem segments of 1GB each: Default window : ./shmalloc -c 2 -s 1048576 Global space : ./shmalloc -c 2 -s 1048576 -g Window id 100 : setmemwindow -i 100 ./shmalloc -c 2 -s 1048576 BigSpace id 100: setmemwindow -b -i 100 ./shmalloc -c 2 -s 1048576 vmtrace http://psweb1.cup.hp.com/~projects/vm/tools/vmtrace/index.html Introduction Vmtrace is a tool for debugging incorrect use of dynamically allocated kernel memory in HPUX. This is memory allocated using kmem_arena_alloc(), kmem_arena_varalloc(), MALLOC(), kmalloc(), and other related routines and macros. Vmtrace consists of 3 parts: The main portion is built into all kernels since 10.10, and was available as a patch for some earlier releases. It's normally inactive, and must be enabled in order to use it. A user space tool called 'vmtrace' is the normal means of enabling vmtrace. A perl script called 'vmtrace.pl' is used with Q4 to assist in analyzing information recorded by vmtrace. Users should be aware that the implementation of vmtrace has changed over time, and many of the early versions of user space vmtrace components lack version information. It's thus possible to get confusing results when using the wrong versions of the tools. They should also be aware that the user interface has changed over time. In particular, the kernel memory allocator was rewritten in 11.11, leading to many changes in the behavior and interface of vmtrace. In order to reduce confusion caused by version incompatabilities, the most recent (11.11) versions of the user space components have built in backwards compatability; if they are run on an older kernel, they will emulate the behavior of their 11.10 versions. This means that the 11.11 version of the user space tool will work on all vmtrace kernel implementations, except for a Beta version of vmtrace that existed for about a month during 11.11 development. The 11.11 version of the vmtrace perl script will work for 11.00 and later kernels except, once again, the 11.11 Beta version. Be aware that when operating in compatability mode, the user interface is significantly different. Users should also be aware that when built in OSDEBUG mode, 11.11 and later kernels have significant built in kernel memory allocation debugging capability even without vmtrace. Thus, you may not need vmtrace to find your problem on such a kernel

VMTRACE - How It Works (Pre 11.11)


Introduction It has often been the case when one subsytem allocates a chunk of kernel memory through a MALLOC call and after releasing the memory through a FREE call it incorrectly accesses it. This illegal operation often ends up corrupting other data structures in the kernel who use the same size. The consequences can be very drastic since there is no way to track down the offending code. The other common case which has been noticed is that a subsystem allocates a chunk of kernel memory but never releases it. Repeated use of this path can lead to a drain in the system memory and eventually lead to a system hang or very poor performance. A mechanism has been put in place in the regular kernel to enable tracking these often difficult bugs. This tool would enable an engineer to start online tracing without rebooting. If there is a memory corruption problem the system would eventually panic and the stack trace should show the culprit. If there is a leak the system needs to be brought to single user mode and then a memory dump should be taken for further analysis of the log.

122 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

What needs to be considered before using vmtrace ? First of all, vmtrace will have an impact on performance as well as on memory consumption. The performance impact typically is not too bad, about 2-3%, but in cases where applications take big benefits of large pages, the performance impact can be more visible. Memory consumption is more of a concern. When tracing for memory corruption, all sizes less than a page are getting a full page allocated. So there is increased use of system memory. For e.g. a 256 byte size allocation now uses a page (4096 bytes). So if a particular system uses the 256 bytes very heavily then there is a considerable increase in memory usage. For normal to heavy workloads 64 MB should be sufficient. This has been a problem a while ago, but with the current systems and their memory sizes there should not be too much concern in using vmtrace. When tracing for memory corruption, the system will be paniced by vmtrace as soon as a corruption is detected. This is certainly something the customer needs to be prepared for. For all other cases, the system will need to be brought down manually. This can be accomplished by shutting down the system to single user mode and then TOC'ing the system. Make sure that the dump space is large enough to be able to save the majority of the memory content. To be on the save side allow a full dump to be saved. On 10.20 systems that use PA2.0 processors, usage of superpages has to be switched off before enabling vmtrace. In 11.00 this is not necessary anymore. Here is how to do so with adb: # adb -w /stand/vmunix

kas_force_superpages_off?W 0x1

$q Lastly the list of patches that need to be installed before usage of vmtrace: 10.20: PHKL_8377 s800 vmtrace:malloc() PHKL_8376 s700 10.20 vmtrace and malloc() patch 11.00: PHKL_17038 vmtrace:data:page:fault a dependency on PHKL_18543! PHNE_18486 streams:cumulative On 11.11 vmtrace operation can be enabled as well as stopped while the system is alive. This allows to start vmtrace, reproduce the problem and then disable vmtrace afterwards again. So far no patches are needed to be able to use vmtrace on 11.11.

Common Cases and their symptoms Memory Corruption This is the most common and the most difficult to debug. There are several ways in which memory corruption can happen. Case 1: A subsystem allocates memory, uses it and then releases it. After releasing it, the subsytem continues to write to it. This causes corruption of other data structures which have currently been allocated this chunk of memory. Case 2: A subsystem allocates a chunk of memory but writes beyond the size allocated to it. This causes corruption of data structures which have chunks of memory adjacent to this chunk. Case 3: A subsystem allocates a chunk of memory, uses it and then releases it twice. This often causes corruption of the first word since it is used as a link in the free list. The symptoms can be drastic since the same chunk could be allocated to two subsystems simultaneously. It also could lead to severe damage of the free list. Memory Leaks In this situation a subsystem allocates a chunk of memory but does not release it when done. If there is continuous use of this path, there is

123 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

a constant drain of memory which could lead to a hang or poor performance. Tracing Mechanism This mechanism does not change the MALLOC and FREE macro. So, this avoids any performance overhead when tracing is disabled. It does have a couple of checks in kmalloc() and kfree(). But since these routines are not called often the performance overhead is almost negligible.

When tracing is enabled for memory corruption, sizes less than a page are always allocated a page. When the virtual address is released we remove the translation and the virtual address is kept in a backup pool. So if the offending code touches the virtual address which has been released it should cause a Data fault at the exact location of illegal access. Also, the remaining portion of the page for sizes less than a page are scrambled with some known values before giving the virtual address to the subsystem. When the subsystem returns the memory we verify that the scrambled portion remains the same. If it has been written to, then we panic the kernel. This happens in cases when someone writes beyond their requested size. This situation is detected only when the subsytem releases the memory and not at the exact instruction when someone wrote to this illegal location. When tracing is enabled for memory leaks, we allocate the usual sizes but everytime before we give the virtual address to the subsystem we enter it into a log. When the address is released we remove it from the log. So the log consists of virtual addresses currently in use. If there is a drain of system memory we can analyze the log and pinpoint the culprit with information in the log When tracing is enabled for general logging, all allocations and deallocations are tracked in a circular log. Memory and Performance Degradation When tracing is enabled, there is some memory and performance degradation. When tracing for memory corruption, all sizes less than a page are allocated a page. So there is increased use of system memory. For e.g a 256 byte size allocation now uses a page (4096 bytes). So if a particular system uses the 256 bytes very heavily then there is considerable loss of memory. For normal to heavy workloads a 64MB should be sufficient. But if for instance there is an application which memory maps 10,000 segments of 32 pages or less it could use 10,000 256 byte allocation which gets allocated 10,000 pages (~40 MB). Since there is no way to determine the impact of this, the best way is to enable it online and if the system runs out of memory, then add lots of memory and run it again. This should not be a problem in most cases and definitely worth enabling online until we hit the problem. When tracing for memory leaks or general tracing, then the default size for the log is 2MB each. The allocation size remains the same and so there is no memory issue as in the case of tracing memory corruption. So additional memory is never needed. Whenever vmtrace is turned on (any mode), large pages will no longer be allocated for kernel memory, not even for bucket sizes not being traced. This can cause performance degradation on large memory systems. VMTRACE - How to Use It (11.00 to 11.10) By default the performance kernel does not do any tracing. However the kernel contains the necessary code capable of online tracing of buckets. Tracing for chosen buckets and for a chosen type can be enabled through a tool called vmtrace, or by setting certain kernel global variables with adb and rebooting the kernel. The steps to be followed are: Identify the bucket size from the dump. This can be done in two ways for memory corruption. Given the data structure which is corrupted find the size of the data structure and this should give the bucket size. The second way is to use the function WhichBucket in the perl script kminfo.pl with q4. For memory leaks you can use the function Dbuckets in the perl script kminfo.pl with q4. If there is a bucket which has a large number of pages allocated for that size, then it is very probable that the corresponding bucket is leaking. For more information on how to use the perl script kminfo.pl please refer to the documentation in the perl script. Turn on vmtrace. There are several possible ways to do this:

Run the tracing tool vmtrace. This is an interactive tool which would prompt for bucket sizes and type of tracing i.e whether you want to detect memory corruption, memory leaks or just want to log all memory allocation/deallocation calls. This tool can also be run with

124 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

command line arguments, bypassing the menus entirely. This can be useful when you want to run it from a script. The syntax for this option is: vmtrace -b <bucket map> -f <flags> where bucket map is a bit map. The corresponding bits need to be set for the appropriate sizes being traced. The chart below shows the mapping of the bit to the corresponding bucket sizes. The bits are numbered from right to left starting with 0.

BIT Description 0-4 ******** ( NOT VALID ) 5 32 byte bucket ( VALID ) 6 64 byte bucket ( VALID ) 7 128 byte bucket ( VALID ) 8 256 byte bucket ( VALID ) 9 512 byte bucket ( VALID ) 10 1024 byte bucket ( VALID ) 11 2048 byte bucket ( VALID ) 12 4096 byte bucket ( VALID ) 13 2 page bucket ( VALID ) 14 3 page bucket ( VALID ) 15 4 page bucket ( VALID ) 16 5 page bucket ( VALID ) 17 6 page bucket ( VALID ) 18 7 page bucket ( VALID ) 19 8 page bucket ( VALID ) 20 > 8 pages ( VALID ) 21-31 ******* ( NOT VALID )

The flags parameter should be an OR of the following values. 1 = Tracing for Memory Corruption 2 = Tracing for Memory Leaks 4 = Tracing for Logging For e.g. vmtrace -b 0x180 -f 0x1 would enable tracing for 256 and 128 size buckets and log for memory corruption. If you type vmtrace -b 0x100 -f 0x7 tracing would be enabled for 256 size buckets and log for memory corruption, memory leaks and general logging. Please Note that for sizes upto a page the bit corresponds to the size i.e you can just OR the sizes. If you prefer, you can turn on vmtrace by setting global variables and rebooting. You must do this to trace problems that occur before reaching the multi-user prompt), PLEASE NOTE that this tool should be used with caution. In the case of tracing for memory corruption please understand the memory issues as described in the next section before tracing for multiple buckets. After the dump is taken analyze the logs using the perl script vmtrace.pl in conjunction with Q4. Note that it is important to have the right version of this script to match the kernel being debugged. After the tracing is enabled the following symptoms would be observed:

Memory Corruption For case 1, the kernel panics with "Data Fault" exactly at the location of the offending instruction which was accessing an address which was reelased earlier. By running the Q4 perl script vmtrace.pl you can look at a log and find the stack trace which released this address. This should be sufficient to help one to find the bug easily. For case 2, the kernel panics when it detects that someone wrote beyond their allocated size. The stack trace should show the location of when the kernel released this memory. This does not give the exact offending instruction when someone wrote beyond their allocated size but gives approximately which data structure was the offending one. Then one needs to match the corresponding MALLOC and find the bug through other variables. If this becomes hard, at least we know that this was the case and the tracing mechanism in the kernel can be enhanced to detect this case more precisely. For case 3, the kernel panics in either FREE() or in vmtrace_kfree(). In vmtrace_kfree() we print that there was no translation for the address. This was because the translation for the vitual address was removed by an earlier FREE(). If the kernel panic'd in FREE() then it would panic with a data fault. Again by looking at the log one can find when it was released earlier and find the bug very easily.

125 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Memory Leaks In this case the symptom would be a drain in memory. After the system displays an appreciable loss of memory, the system should be brougt to single user mode through the command "shutdown". After the system comes to single user mode a dump should be taken with a "TOC". The dump can then be analyzed with a Q4 perl script vmtrace.pl to find the culprit. The output of the script should give a list of the outstanding MALLOC's with the corresponding stack trace. If there is one stack trace with very many entries in the log then it is the most probable culprit. Caveats When tracing for memory corruption, there could be fragmentation of the virtual address resource map called the sysmap. This could cause it to lose some virtual addresses. Losing virtual addresses does not mean losing physical memory. In 32 bit systems which have large physical memory(> 2GB) we could reach the virtual space limit (1 GB approx) under very heavy load. If this limit is reached then the kernel would panic with the message "kalloc: out of kernel virtual space". For large 32-bit systems, do not trace multiple buckets if this panic happens. If this panic occurs even when tracing one bucket then the instrumentation needs to be enhanced for this customer to avoid this case. There's also the possibility of vmtrace for corruption causing a small memory system to use so much extra physical memory that thrashing occurs, producing severe performance degradation. There are multiple versions of the vmtrace.pl perl script. Some do not work with 64 bit kernels. Others do not work with pre-11.00 kernels.

Vmtrace in 11.11 or Later Kernels VMTRACE - How It Works Introduction Kernel memory corruption has been a recurrent problem in HPUX. This occurs when one caller allocates a chunk of memory, and then either writes beyond the end of the chunk size requested, or frees the memory and continues to use it, or frees it more than once. This is likely to result in corrupting other structures allocated from the same kernel memory arena; sometimes it even spills beyond the specific arena and affects the superpage pool. These problems can be extremely difficult to debug. The resulting panic is likely to occur long after the memory was corrupted, and may affect any code that uses the same arena, or code remotely connected with a user of that arena (such as something called with a parameter taken from an object allocated from that arena). Moreover, some arenas (particularly comaptability mode arenas like M_TEMP) have a very large number of users, not always even from the same subsystem. This can make diagnosis and triage rather difficult.

In pre-11.11 kernels the situation was even worse. There were no arenas; instead, all allocations of approximately the same size (to within a power of 2, eg. 129-258 byte allocations) shared the same "bucket", and could corrupt each other. Another common problem is that a subsystem allocates a chunk of kernel memory but never releases it. Repeated use of this path can lead to a drain in the system memory and eventually lead to a system hang or very poor performance. The arena allocator, when built in OSDEBUG mode, contains some mechanisms intended for fast detection and isolation of memory corruption problems. However, to supplement these, and to handle performance kernels and memory leaks, we have a kernel memory debugging tool called vmtrace. Vmtrace is built into the regular kernel, whether debug or performance. It simply needs to be enabled, using a user space tool called, simply "vmtrace", or (when looking for problems that occur early in system initialization) by setting certain kernel global variables with adb and rebooting. If there is a memory corruption problem, the vmtraced system will panic with a stack trace showing the culprit. In the case of leaks, the system needs to be brought to single user mode and then a memory dump should be taken for further analysis of the vmtrace leak log.

Common Cases and their symptoms Memory Corruption This is the most common and the most difficult to debug. There are several ways in which memory corruption can happen. Case 1 (Stale Pointer):

126 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

A subsystem allocates memory, uses it and then releases it. After releasing it, the subsytem continues to write to it. This causes corruption of other data structures which have currently been allocated this chunk of memory. Case 2 (Overrun): A subsystem allocates a chunk of memory but writes beyond the size allocated to it. This causes corruption of data structures which have chunks of memory adjacent to this chunk.

Case 3 (Double Free): A subsystem allocates a chunk of memory, uses it and then releases it twice. This often causes corruption of the first word since it is used as a link in the free list. The symptoms can be drastic since the same chunk could be allocated to two subsystems simultaneously. It also could lead to severe damage of the free list. Memory Leaks In this situation a subsystem allocates a chunk of memory but does not release it when done. If there is continuous use of this path, there is a constant drain of memory which could lead to a hang or poor performance.

Tracing Mechanism This mechanism does not change the kmem_arena_alloc(), kmem_arena_varalloc() or kmem_arena_free() routines or associated macros. So, this avoids any performance overhead when tracing is disabled. Instead, it uses the function pointer associated with each arena free list; if vmtrace is in use on that free list, the function pointer will be set, and a special vmtrace version of the function will be called.

Corruption Tracing Modes There are 3 different modes for handling memory corruption. These are called, somewhat unimaginatively: Light corruption mode Standard corruption mode Heavy corruption mode The names are based on the effect on memory usage. The light mode uses only a little extra memory; standard mode uses rather more (at least a page per traced allocation) and heavy mode uses even more than that (at least 2 pages per traced allocation). In particular, the 3 corruption modes do the following: Light Corruption Mode This mode is pretty much the same as the built in features of the OSDEBUG kernel, except that it's available in customer/performance kernels.

It detects double frees when the second free occurs. It detects overruns, but only when the memory is freed. It cannot detect underruns. It cannot detect stale pointer references. Each allocation has a few extra bytes of padding added at the end, with well known contents; these are checked on free to detect overrun. When memory is in use, a bit is set in the object header to indicate this. When it's freed, the bit is checked, and then cleared. This is used to detect double frees.

Standard Corruption Mode This mode is particularly useful for detecting stale pointer use. It detects double frees when the second free occurs. It detects overruns, but only when the memory is freed. It detects stale pointer use. With this mode, sizes less than a page are increased to be a full page. The rest of the page is filled with well known contents; these are checked on free to detect overrun. When memory is freed, the page(s) are protected (made unaccessible) and placed on the free list. Any attempt to access this memory (via a stale pointer) before it's reallocated for some other use will cause a protection fault panic.

Heavy Corruption Mode This mode is designed to diagnose overruns efficiently. It also gives significantly more information in case of double frees and stale

127 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

pointer references, and is the only mode that can give any information at all in case of underrun. It detects double frees when the second free occurs. It detects overruns immediately. It detects stale pointer use. It can detect some underrun errors. With this mode, sizes are increased to whole page(s), and then an extra page is added. That extra page is protected (made inaccessible). The caller is given an object that ends just before the beginning of the last page, so that an overrun will immediately panic with a protection fault. Known contents are written to the part of the page before the object and its object header; these might possibly detect an underrun problem. When memory is freed, its translation is deleted. The physical memory is returned to the physical memory allocator, but the virtual addresses are placed in an aging queue, along with information (stack trace, etc.) about when they were freed. A stale pointer reference will cause a data page fault panic; its stack trace will show where the stale reference was made, and the information in the aging queue can be used to find out when the memory was freed. If a double free occurs, a panic will occur with vmtrace_free_heavy_corruption() in the stack trace. Possibly when attempting to delete the non-existent translation, or possibly earlier, e.g. a data page fault in the destructore function. If an underrun occurs, it might simply corrupt the object header, in which case there's little useful the code can do. But if it corrupts the early part of the page (before the object header) it will be detected when the object is freed. (This isn't an especially good method, but underruns are extremely rare.)

Other Modes There are also 2 other modes, which may be enabled individually, or in combination with each other and/or any single corruption mode.

Leak Mode When tracing is enabled for memory leaks, we allocate the usual sizes but every time before we give the virtual address to the subsystem we enter it into a log. When the address is released we remove it from the log. So the log consists of virtual addresses currently in use. If there is a drain of system memory we can analyze the log and pinpoint the culprit with information in the log. Each log record also contains information about the allocation. In particular, the stack trace of the code that called the allocation function. To improve performance (when searching for the address being freed) the log is organized as a hash table. This has the side effect that when printing out addresses allocated but not yet freed, the addresses are printed in no particular order.

Logging Mode When tracing is enabled for general logging, all allocations and deallocations are tracked in a circular log. To reduce lock contention, each cpu has its own log. Allocations and frees are recorded in the log corresponding to the cpu where the allocation as performed, even though the memory may have been freed on some other cpu. This corresponds to the arena free list used, as these are also organized on a per cpu basis to reduce lock contention.

Memory and Performance Degradation When tracing is enabled, there is some memory and performance degradation.

Memory Usage When tracing for memory corruption, standard and heavy corruption modes use significantly more memory for each allocation. An allocation that normally use 256 bytes would use a whole 4096 byte page in standard corruption mode, and two or these pages (8192 bytes) in heavy corruption mode. This can add up fast. On performance kernels, light corruption mode uses slightly more memory for each allocation. (This is the same increase seen between performance and OSDEBUG kernels.) Leak Mode, General Logging Mode, and Heavy Corruption Mode all allocate logs or buffers to store information they record. Leak mode uses 2 MB for its log records, plus a little extra for its hash table. General logging mode uses 2 MB for each cpu. Haavy corruption mode allocates 4096 aging records per cpu. Their size depend on the CPU architecture. On a 64 bit PA RISC system they are presently 136 bytes each, for a total of approx. .5 MB per cpu, in addition to the extra memory used for each allocation. All of these sizes are only defaults; they can be changed by adb'ing kernel global variables and rebooting.

128 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Large Pages In 11.00 and 11.10 kernels, any use of vmtrace disabled use of large pages for all kernel memory allocations, whether or not they were traced, even those not involving MALLOC. This had significant performance impact, especially on systems with many cpus. In 11.11, large pages are only disabled for allocations where this is needed. In particular, allocations being traced in standard or heavy corruption mode. Allocations from other arenas are not affected. This greatly reduces lock contention; when the lower level allocator allocates a kernel virtual address (needed for every small page allocation) it passes through a choke point where a single lock (not a per cpu lock) is held across an inefficient algorithm.

It also reduces kernel memory fragmentation, and the risk of overflowing the sysmap, which were problems with the eaelier versions of vmtrace. VMTRACE - How to Use It By default the performance kernel does not do any tracing. However the kernel contains the necessary code capable of online tracing of arenas. Tracing for chosen arena(s) can be enabled through a tool called vmtrace, or by setting certain kernel global variables with adb and rebooting the kernel. The steps to be followed are: Determine the mode(s) to use. There are several mode(s) available; the appropriate mode(s) to use depend on the problems to be diagnosed, and to some extent on the platform and scenario that replicates the problem. See scenarios below for details. Identify the arena(s) to be traced. See scenarios below for examples of how to do this. Determine the size to be traced. Most of the time, it's simplest to trace all sizes (enter zero in response to the query about the size to be traced); if tracing a heavily used variable sized arena in a situation where you believe only one size is affected, you might want to specify the size. Run the tracing tool vmtrace. This is an interactive tool. See How to Use the Vmtrace Tool below for details. If you prefer, you can turn on vmtrace by setting global variables and rebooting. You must do this to trace problems that occur before reaching the multi-user prompt), After the dump is taken you will probably need to analyze the logs using the perl script vmtrace.pl in conjunction with Q4. PLEASE NOTE that running vmtrace can have a number of possible effects on the system. Depending on the mode, it can significantly increase kernel memory usage, increase kernel memory fragmentation, and/or introduce additional lcok contention into the kernel memory allocation path. It should be used with caution. Scenarios Here are some likely problems, and how to use vmtrace to help debug them. Stale Pointer This is the situation when some code frees memory, but retains a pointer to it, which is later used. In some cases, by the time the stale pointer is used, the memory may already have been reallocated for some other purpose. Depending on the code, you may want information on the code that used the stale pointer, the code that freed the memory, and/or (least common) the code that allocated it originally. You usually discover this situation because of: An assertion failure or other panic Traced back to a corrupt dynamically allocated data structure With contents that appear to be reasonable ... but incorrect. Alternatively, it's noticed while running "standard" or "heavy" corruption mode, which will panic on the access rather than on the data contents. To get information on this: Run heavy or standard corruption modes. Both panic when the stale pointer is used, but heavy mode will store information on the code that did the free as well as producing a panic stack trace that points at the code which uses the stale pointer. If you can panic reliably without vmtrace's assistance, you don't absolutely have to run either of these modes. If you need information about the stack trace that freed the memory, run heavy corruption mode or general logging, and extract the information from the core with vmtrace.pl. If you need information about the original allocator of the abused memory, run general logging mode, and extract the information from the core with vmtrace.pl.

129 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Overrun This is the situation when some not so clever piece of code allocates fewer bytes of memory than it needs, and then writes beyond the end of the area allocated. Any corruption mode will detect this and panic when the memory is freed. This includes "light" corruption mode, and the built in debugging features of OSDEBUG kernels. Sometimes you can determine what's wrong simply by looking at the memory contents. In the more common case, you want to panic when the memory is actually overwritten. To do this, run heavy corruption mode; it will panic with a stack trace pointing to the culprit. If you want to know where the memory was allocated, run general logging mode, and extract the information from the core with vmtrace.pl. (Or you could also get the same information from leak mode.) If all you care about is where the memory was allocated, you could simply use general logging mode, and rely on (e.g.) the OSDEBUG kernel features to panic when the overwritten memory is freed. But this is unlikely to be very useful. Underrun Double Free Leak How to Use the Vmtrace Tool There are two steps to turning on vmtrace. They must be done in the correct order.

WARNING
Once you select vmtrace modes, the only way to change them is to reboot. If this is inconvenient, be sure you know what modes you really want before using the vmtrace utility. Also, due to a defect (in the tool, not the kernel), once you select the mode(s) you should not exit the vmtrace tool until you are certain that you will not want to enable tracing on any additional arenas. (Yes, the fix is well known and obvious. However, actually making the change hasn't been given sufficient priority by management.)

Main Vmtrace Menu 1) Start vmtrace (do this before selecting any arenas) 2) Select an arena to trace 3) Disable vmtrace (prevents additional memory consumption ONLY) 4) Reenable vmtrace 5) DONE Pick option 1 first. This is where you select modes. If you haven't done this, anything else you attempt will fail with EINVAL. Selecting Modes in the Vmtrace Tool Enter action desired [ 1- 5]> 1 1) Lightweight Corruption 2) Standard Corruption 3) Heavyweight Corruption 4) Leak Detection 5) General Logging 6) DONE See the discussion above to determine which modes to select. Select modes one at a time by number. When finished, select DONE. Once you have done this, your next step will be to select arena(s) to trace.

Warning
Once you select any modes and leave this menu, you cannot select additional modes, or change the ones already selected without rebooting.

Selecting Arenas in the Vmtrace Tool Enter action desired [ 1- 5]> 2 Enter arena name (hit return for none) > M_TEMP Enter allocation size to trace (zero for all supported sizes) >

130 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Known Problems In HP-UX 10.30, HP-UX 11.00 without PHKL_17038, tracing for leaks can cause corruption on multi-processor systems. The work around is to trace for both corruption and leaks, rather than tracing for leaks alone. In releases prior to HP-UX 11.00 vmtrace does not work on machine which support superpages (i.e Mohawk). On Mohawk machines for both the 32bit kernel and 64bit kernel, you need to do the following through adb and reboot the kernel: adb -w /stand/vmunix kas_force_superpages_off?W 0x1 $q

In the HP-UX 10.20 Release (a.k.a. Davis), the vmtrace could cause panics on MP systems. You need the patches PHKL_8376 (Series 700) or PHKL_8377 (Series 800) installed on the systems. Getting Help The HPUX kernel virtual memory (VM) group maintains vmtrace. The CHART project is hpux.kern.vm. The VM group's policy is to have designated owners of all VM code. The current list of VM owners can be found at http://integration.cup.hp.com/~cmather/ownership_table.html. At present (07/06/00) the owner of vmtrace is Arlie Stephens. She can be reached at:

Arlie Stephens ESTL Lab Hewlett-Packard Co., MS 47LA2 Cupertino, California 95014

Vmtrace for the arena memory allocator

Usage of the vmtrace utility on 11.11 and later releases is very similar to the usage on McKusic based bucket allocator OS releases. The main difference is in usage of arena names instead of bucket sizes. Lets simply start by looking at the vmtrace menu's again:
# ./vmtrace_11.11 1) Start vmtrace (do this before selecting any arenas) 2) Select an arena to trace 3) Disable vmtrace (prevents additional memory consumption ONLY) 4) Reenable vmtrace 5) DONE Enter action desired [ 1- 5]>1

As we are told by the tool, we first start vmtrace, enter 1:


1) 2) 3) 4) 5) 6) Enter Lightweight Corruption Standard Corruption Heavyweight Corruption Leak Detection General Logging DONE type of tracing (one per prompt) [ 1- 6]>4

Here we are going for leak detection, as we would do for investigating a kernel memory leak.
Enter type of tracing (one per prompt) [ 1- 6]> 4 Enter type of tracing (one per prompt) [ 1- 6]> 6 Enabling vmtrace 1) Start vmtrace (do this before selecting any arenas) 2) Select an arena to trace 3) Disable vmtrace (prevents additional memory consumption ONLY) 4) Reenable vmtrace 5) DONE Enter action desired [ 1- 5]>

131 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

Now we are going to select the arena we want to trace:


Enter action desired [ 1- 5]> 2 Enter arena name (hit return for none) > M_TEMP Enter allocation size to trace (zero for all supported sizes) > 0 Enabling vmtrace on arena M_TEMP Enter arena name (hit return for none) > 1) Start vmtrace (do this before selecting any arenas) 2) Select an arena to trace 3) Disable vmtrace (prevents additional memory consumption ONLY) 4) Reenable vmtrace 5) DONE Enter action desired [ 1- 5]> 5 #

We have selected the M_TEMP arena and then left the vmtrace tool. Remember that you now can easily disable vmtrace without the need to reboot the system. Tracing and logging can later be re-enabled too. There is one caveat, tracing modes can not be changed on the fly, you will need to reboot for this. So please make up you mind what exactly to do before running vmtrace.... Selection of the corruption detection modes is not that simple. To decide and figure out that a certain crash was caused by a corruption issue usually requires some amount of experience in reading dumps

Tusc NOTE:

Tusc is not an official HP product, and therefore is not currently supported by HP.
It is available along with the other GR8 tools at ftp://hpchs.cup.hp.com/tools/11.X If you intend on running the kitrace tool , GR8 advises running tusc first .
tusc works with HP-UX 11.* PA-RISC systems, and HP-UX 11i Version 1.5 (Itanium Processor Family) systems. It is not supported on HP-UX 10.20. tusc is a great program for working with Java. It gives you another view into the system activity in addition to Java stack traces, GlancePlus, and HPjmeter. tusc has many options which can be displayed with the command tusc -help.

Below you'll find a list of the available options, plus a few examples of using tusc for debugging and performance tuning Below is the output from tusc -help:
Usage: tusc [-<options>] <command [args ...]> -OR- <pid [pid ...]> -a: show exec arguments -A: append to output file -b bsize: dump 'bsize' max bytes (-r/-w) -c: count syscalls instead of printing trace -d [+][!][fd | all]: select only syscalls using fd -e: show environment variables -E: show syscall entries -f: follow forks -F: show kernel's ttrace feature level -g: don't attach to members of my session -h: show state of all processes when idle -i: don't display interruptible syscalls -I start[/stop]: single-step and show instructions -k: keep alive (wait for *all* processes) -l: print lwpids -n: print process names

132 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

-o [file|fd]: send trace output to file or fd -p: print pids -Q: be quiet about some warnings -r [!][fd | all]: dump read buffers -R: show syscall restarts -s [!]syscalls: [un]select these syscalls -S [!]signals: [un]select these signals -t: detach process if it becomes traced -T timestamp: print time stamps -u: print user thread IDs (pthreads) -v: verbose (some system calls only) -V: print version -w [!][fd | all]: dump write buffers -x: print raw (hex) arguments -z: only show failing syscalls

Here are a few examples of debugging and performance tuning information you can see with tusc.
thread.interrupt

The Thread.interrupt() call is implemented using SIGUSR1. Hence, if you see SIGUSR1 in the tusc output, the program must be making Thread.interrupt() calls. You can confirm this by making an -Xeprof trace and viewing the data with HPjmeter. It's not necessarily good or bad to use Thread.interrupt(), but you can monitor it with tusc and it may be helpful information in various performace or correctness situations. Here is an example of a Thread.interrupt(). Threads are indentified by their lwp id, shown in the second column. Thread 19729 interrupts thread 19731 with the signal.
1008628500.571138 {19731} write(1, "\n", 1) ...................... = 1 1008628500.571337 {19731} gettimeofday(0x6c258910, NULL) ......... = 0 1008628500.571444 {19731} clock_gettime(CLOCK_REALTIME, 0x6c258a40) = 0 1008628500.571625 {19729} _lwp_kill(19731, SIGUSR1) .............. = 0 1008628500.571737 {19731} Received signal 16, SIGUSR1, in ksleep(), \ 91;caught] 1008628500.571757 {19731} Siginfo: sent by pid 10468 (uid 565), \ si_errno: 0 1008628500.571939 {19731} ksleep(PTH_CONDVAR_OBJECT, 0x1fde70, 0x1fde78, \ 0x6c258908) = -EINTR 1008628500.572143 {19731} gettimeofday(0x6c258910, NULL) ......... = 0 1008628500.572258 {19801} ksleep(PTH_MUTEX_OBJECT, 0xaae8, 0xaaf0, \ NULL) = 0 1008628500.572438 {19731} clock_gettime(CLOCK_REALTIME, 0x6c258a40) = 0 1008628500.572522 {19801} kwakeup(PTH_CONDVAR_OBJECT, 0x309580, \ WAKEUP_ALL, 0x6b6c1848) = 0 1008628500.572611 {19802} ksleep(PTH_CONDVAR_OBJECT, 0x309580, 0x309588, \ 0x6b640908) = 0 1008628500.572704 {19729} kwakeup(PTH_MUTEX_OBJECT, 0xaae8, WAKEUP_ONE, \ 0x6c2d978c) = 0 1008628500.572800 {19778} sched_yield() .......................... = 0

Here we used -T "" and -l to show the timestamp in basic format and the lwp id. This time we happened to interrupt a thread sleeping on a pthread_cond_wait call. You can see how he wakes up with EINTR. This will cause an InterruptedException in the Java program.
implicit null pointer checks

The hotspot compiled code uses SIGSEGV and SIGBUS to implement implicit null pointer checks which result in NullPointerExceptions in the Java application, for example, when trying to perform a method dispatch when the "this" pointer is null. To a Java programmer, it's not particularly important whether the exceptions come from interpreted or compiled code, but it's helpful to understand this for performance tuning. If there are such signals in the output, the program must be throwing these exceptions from a frequently called method which has been compiled. The interpreter uses SIGFPE for null pointer checks in the interpreter. If there are such signals in the output, the program is causing these exceptions from interpreted methods. The JVM is designed to execute the normal non-exception-throwing case as fast as possible, but the exception-throwing case is quite expensive, so to

133 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

get good performance it is important to eliminate any extra exceptions caused by careless coding. You can use tusc to detect if this is happening, since if you have correct exception-handling routines in your program you might not notice it, but you would have lower overall performance than you could otherwise achieve.
read(24, "\0q \0\006\0\0\0", 8) .......................... = 8 send(42, "\0\0\01706e \0\0\098a5\0\098a5\0".., 23, 0) .... = 23 sigsetstatemask(0x17, NULL, 1135461680) .................. = 0 read(24, "\0\006020105\001\n\0\a\0810102c1".., 105) ...... = 105 Received signal 11, SIGSEGV, in user mode, [0xbaf215a2], partial siginfo Siginfo: si_code: I_NONEXIST, faulting address: 0x8, si_errno: 0 PC: 0xb92eb3c7, instruction: 0x0eb01096 sigsetstatemask(0x17, NULL, 1135460976) .................. = 0 Received signal 8, SIGFPE, in user mode, [0xbaf215a2], partial siginfo Siginfo: si_code: I_COND, faulting address: 0x1132ab, si_errno: 0 PC: 0x1132ab, instruction: 0x0a6024c0 send(24, "\00f\0\006\0\0\0\0\00314\00114", 15, 0) ........ = 15 Received signal 11, SIGSEGV, in user mode, [0xbaf215a2], partial siginfo Siginfo: si_code: I_NONEXIST, faulting address: 0x8, si_errno: 0 PC: 0xb8d73d4b, instruction: 0x0cd01096 sigsetstatemask(0x17, NULL, 1135461104) .................. = 0

In this output, we are not showing the lwp id or timestamp. Here we have thrown a couple exceptions in a row. The SIGSEGV will result in NullPointerExceptions from hotspot-compiled code, and the SIGFPE will result in a NullPointerException from interpreted Java code. To get the best performance, avoid throwing exceptions whenever possible. You can measure the count of such exceptions happening at runtime with tusc, then use HPjmeter to determine where they are happening. It requires thousands of machine instructions to throw the exception, while it usually requires little effort up front to minimize them. To determine the source of the NullPointerExceptions, make an -Xeprof trace and view the data with HPjmeter. HPjmeter has built-in features to examine exception handling.
return values

Another great thing about tusc is that you can see the system call return values.
close(34) ................................................ send(20, "\00f\0\006\0\0\0\0\00314\00115", 15, 0) ........ sigsetstatemask(0x17, NULL, 1134403888) .................. read(6, 0x43852a68, 5) ................................... sigsetstatemask(0x17, NULL, 1133867888) .................. poll(0x4a1af0, 6, -1) .................................... = 0 = 15 = 0 ERR#11 EAGAIN = 0 = 1

Here we can see that the read returned with EAGAIN. This kind of information may be useful for diagnosing various problems. Below we can see a new thread being created. First the mmap for the thread stack, then the _lwp_create happens. Lastly, the sigaltstack() installs the signal stack on the new thread.
1008628500.575606 {19792} sched_yield() .......................... = 0 1008628500.575784 {19792} sched_yield() .......................... = 0 1008628500.575952 {19792} sched_yield() .......................... = 0 1008628500.576072 {19616} mmap(NULL, 528384, PROT_READ|PROT_WRITE, \ MAP_PRIVATE|MAP_ANONYMOUS, 0, NULL) = 0x6a8a5000 1008628500.576197 {19792} sched_yield() .......................... = 0 1008628500.576312 {19616} mprotect(0x6a925000, 4096, PROT_NONE) .. = 0 1008628500.576424 {19792} sched_yield() .......................... = 0 1008628500.576588 {19792} sched_yield() .......................... = 0 1008628500.576753 {19616} _lwp_create(0x77ff1c00, \ LWP_DETACHED|LWP_INHERIT_SIGMASK|LWP_USER_TID, \ 0x471fd4, 0x77ff20d8) = 0 (19934) 1008628500.576948 {19934} _lwp_self() ............................ = 19934 1008628500.577179 {19792} sched_yield() .......................... = 0 1008628500.577274 {19616} kwakeup(PTH_MUTEX_OBJECT, 0xaae8, WAKEUP_ONE, \ 0x77ff1a8c) = 0 1008628500.577365 {19869} ksleep(PTH_MUTEX_OBJECT, 0xaae8, 0xaaf0, \ NULL) = 0 1008628500.577462 {19896} kwakeup(PTH_CONDVAR_OBJECT, 0x45ee20, \

134 of 135

10/28/2010 11:19 PM

Advanced Performance Tuning for HP-UX

file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

WAKEUP_ALL, 0x6acad848) = 0 1008628500.577552 {19897} ksleep(PTH_CONDVAR_OBJECT, 0x45ee20, 0x45ee28, \ 0x6ac2c908) = 0 1008628500.577663 {19778} sched_yield() .......................... = 0 1008628500.577769 {19934} sigaltstack(0x6a8a50b8, NULL) .......... = 0 1008628500.577881 {19792} sched_yield() .......................... = 0 1008628500.578008 {19616} sched_yield() .......................... = 0

The new thread is 19934, and the first thing he does is call _lwp_self(). Remember that the lwp id is also shown in Java stack traces, and in GlancePlus in the Process Thread List window, so you can correlate the tusc data with other data.

This course was written by James Tonguet 1/20/04

References Process Management Whitepaper Online JFS 3.3 Guide HP Process Resource Manager User's Guide by Jan Weaver- Crisis Management Team Engineer: The HPUX Buffer Cache by Mark Ray GR8 Engineer :The HFS Inode Cache , The JFS Inode Cache by Eric Caspole Enterprise Java Lab - Tusc review by Ute Kavanaugh-Response Center Engineer
Performance Optimized Page Sizing in HP-UX 11.0. White Paper -KBAN00000849

by James Tonguet -Response Center Engineer : Introduction to Performance Tuning UPERFKBAN00000726 Configuring Device Swap DocId: KBAN00000218
How to configure device swap in VxVM DocId: VXVMKBRC00005232 How to use adb to get system information DocId: KBRC00004523 Using GlancePlus for CPU metrics DocId: UPERFKBAN00001048 By Markus Ostrowicki -GSE Engineer The McKusick & Karels bucket allocator and The Arena Allocator : http://wtec.cup.hp.com/~hpux/crash/FirstPassWeb /PA/contents.htm

135 of 135

10/28/2010 11:19 PM

You might also like