RMF

z/OS
Resource Measurement Facility Performance Management Guide

V ersion 1 Release 11
SC33-7992-09
z/OS
Resource Measurement Facility Performance Management Guide

V ersion 1 Release 11
SC33-7992-09
Note Before using this information and the product it supports, be sure to read the general information under Notices on page 213.
| | |
This edition applies to Version 1 Release 11 of z/OS (5694-A01) and to all subsequent releases and modifications until otherwise indicated in new editions. This edition replaces SC33-799208. Order publications through your IBM representative or the IBM branch office serving your locality. Publications are not stocked at the address below. IBM welcomes your comments. A form for readers comments may be provided at the back of this publication, or you may address your comments to the following address: IBM Deutschland Research & Development GmbH Department 3248 Schnaicher Str. 220 D-71032 Bblingen Federal Republic of Germany If you prefer to send comments electronically, use one of the following methods: FAX (RMF Development): Your International Access Code +49+7031+16+4240 Internet: rmf@de.ibm.com Internet Visit our homepage at http://www.ibm.com/servers/eserver/zseries/zos/rmf/ If you would like a reply, be sure to include your name, address, telephone number, or FAX number. Make sure to include the following in your comment or note: v Title and order number of this book v Page number or topic related to your comment When you send information to IBM, you grant IBM a nonexclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you. Copyright International Business Machines Corporation 1993, 2009. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Contents
Figures . . . . . . . . . . . . . . . v About this document . . . . . . . . vii
. vii . vii . vii . viii viii Monitor III Sysplex Summary report . . . . Monitor III Response Time Distribution report. Monitor III Work Manager Delays report . . Postprocessor Workload Activity report . . . Monitor III indicators . . . . . . . . . Identifying the major delay in a response time problem . . . . . . . . . . . . . . Using Monitor III reports . . . . . . . . Monitoring daily performance: System indicators . Summary of major system indicators . . . . . Using Monitor I reports . . . . . . . . Where do you go from here? . . . . . . . . . . . . . . . . . . 26 26 27 27 28 35 36 41 42 43 59
Who should use this document . . . . . . When to use this document . . . . . . . . How this document is organized . . . . . . The z/OS RMF library . . . . . . . . . Using LookAt to look up message explanations
Summary of changes . . . . . . . . . xi
| | | | |
Whats new in z/OS V1.11 . . . . . . . New real storage measurements . . . . Measuring WLMs promotion for workloads holding locks . . . . . . . . . . . Enhanced group capacity reporting . . . History of changes . . . . . . . . . . Whats new in z/OS V1R10 . . . . . . Whats new in z/OS V1R9 . . . . . . Whats new in z/OS V1R8 . . . . . . Whats new in z/OS V1R7 . . . . . . Whats new in z/OS V1R6 . . . . . . Whats new in z/OS V1R4 . . . . . . Whats new in z/OS V1R2 . . . . . . Whats new in z/OS V1R1 . . . . . . . . . . . . . . . . . . . . xi . xi xi xi xi xi . xii . xiii . xiv . xiv . xv . xv . xv . . . .
Chapter 3. Analyzing processor activity 61

Do you have a processor problem? . . . . . Monitor III indicators . . . . . . . . . Monitor I indicators . . . . . . . . . Is your processor idle due to other bottlenecks? DELAY for processor: Guidelines for improvement Determine CPU holders . . . . . . . . Review dispatching priorities . . . . . . Check for CPU delay due to other LPARs . . Check LPAR balancing using the CPC capacity report . . . . . . . . . . . . . . Processor USING: Trimming activities . . . . Check for loops . . . . . . . . . . . Check for slip traps. . . . . . . . . . Reduce I/O . . . . . . . . . . . . Decrease monitor overhead . . . . . . . Tune system address spaces . . . . . . . Tracing . . . . . . . . . . . . . . Calculate your capture ratio . . . . . . . Redesign application . . . . . . . . . Install more CPU power . . . . . . . . Summary . . . . . . . . . . . . . . . 62 . 62 . 68 72 73 . 73 . 74 . 75 . . . . . . . . . . . . 75 76 76 78 78 79 79 79 79 79 79 80
Chapter 1. Performance overview . . . . 1

Defining performance management . . . . . . Setting performance goals . . . . . . . . . Planning system capacity . . . . . . . . . Benefits of capacity planning . . . . . . . Common mistakes in capacity planning . . . Performance management versus capacity planning. . . . . . . . . . . . . . Using RMF . . . . . . . . . . . . . . What sampling cycle and reporting interval should you use? . . . . . . . . . . . Analyzing transaction response time . . . . . General formulas . . . . . . . . . . . GROUP report . . . . . . . . . . . . Analyzing workload characteristics . . . . . . Identifying workloads . . . . . . . . . Measuring resource utilization by workload . Using Monitor I reports to analyze workloads . Analyzing processor characteristics . . . . Analyzing workload I/O characteristics . . . Analyzing processor storage characteristics . . Processor speed indicators . . . . . . . Where do you go from here? . . . . . . . . . . . . 2 2 3 3 3
. 4 . 5 . 6 . 7 . 7 . 8 . 9 . 9 . 10 . 11 . 13 . 16 . 18 . 20 . 22
Chapter 4. Analyzing I/O activity . . . . 81

Do you have an I/O problem? . . . . . . . I/O subsystem health check . . . . . . . . Understand your business priorities . . . . Review CPU and processor storage . . . . Establish a baseline measurement . . . . . Improving I/O performance with the Enterprise Storage Server . . . . . . . . . . . . z/OS Parallel Sysplex I/O management . . . ESS performance features . . . . . . . . Cache performance . . . . . . . . . . . CACHE . . . . . . . . . . . . . DASD . . . . . . . . . . . . . . Cache management with ESS . . . . . . Cache Subsystem Activity report . . . . . Using Monitor III cache reporting . . . . DASD indicators and guidelines . . . . . . Response time components . . . . . . . Spreadsheet reports . . . . . . . . . . . . . . 82 84 84 84 84
Chapter 2. Diagnosing a problem: The first steps . . . . . . . . . . . . . 23

What is a performance problem? . . . . . . Getting started: A top-down approach to tuning . Where to start sysplex monitoring? . . . . .
Copyright IBM Corp. 1993, 2009
. 24 . 25 . 26
. 85 . 85 . 85 . 89 . 89 . 89 . 90 . 92 . 102 . 104 . 104 . 105
iii
General DASD guidelines . . . . . DASD performance in the sysplex . . Analyzing specific DASD problems . . Performance analysis on data set level . Improving your DASD performance. . Tape indicators and guidelines . . . . Identifying tape-bound jbs . . . . . RMF measurements for tape monitoring Response time components . . . . . General tape indicators . . . . . . Improving your tape performance . . Summary. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
107 117 118 122 124 126 126 126 127 128 129 130
Defining logical processors .
. 182
Appendix B. The Intelligent Resource Director . . . . . . . . . . . . . . 185

Overview. . . . . . . . . . . . . . Dynamic channel path management . . . . . Value of dynamic channel path management Reporting of dynamic channel path management . . . . . . . . . . . Channel subsystem priority queuing . . . . Value of channel subsystem priority queuing Reporting of channel subsystem priority queuing . . . . . . . . . . . . . LPAR CPU management . . . . . . . . LPAR cluster . . . . . . . . . . . Value of LPAR CPU management . . . . LPAR CPU management decision controls . . LPAR CPU management controls . . . . . Reporting of LPAR CPU management . . . . 186 . 186 187 . 187 . 189 190 . . . . . . . 190 190 190 192 192 193 193
Chapter 5. Analyzing processor storage activity . . . . . . . . . . 131

Do you have a processor storage problem? Monitor III indicators. . . . . . . Monitor I indicators . . . . . . . General storage recommendations . . . Increase storage . . . . . . . . Auxiliary storage tuning. . . . . . Prioritize access to processor storage . Tune DASD . . . . . . . . . . Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 133 138 144 144 144 144 144 144
Appendix C. Planning considerations for HiperDispatch mode . . . . . . . 197

Introduction to HiperDispatch mode . . How RMF supports HiperDispatch mode . Processing benefits . . . . . . . . . . . . . . . 197 . 198 . 199
Chapter 6. Analyzing sysplex activity

Understanding Workload Activity data for IMS or CICS . . . . . . . . . . . . . . . Interpreting the Workload Activity report . . Problem: Very large response time percentage Problem: Response time is zero . . . . . Problem: More executed transactions than ended transactions . . . . . . . . . Problem: Execution time is greater than response time . . . . . . . . . . . Problem: Large SWITCH percentage in CICS execution phase . . . . . . . . . . Problem: Decreased number of ended transaction with increased response times . . Analyzing coupling facility activity . . . . . Using the Postprocessor Coupling Facility Activity report . . . . . . . . . . . Spreadsheet report . . . . . . . . . Using the Monitor III online reports . . . . Factors affecting Coupling Facility performance
145
. 146 . 146 150 . 151 . 152 . 154 . 154 . 155 . 157 . 157 . 172 . 172 174
Appendix D. Other delays . . . . . . 201

Enqueue delays ENQ report . Major names HSM delays . . JES delays . . OPER delays . Unknown delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 202 202 204 205 205 205
Accessibility . . . . . . . . . . . . 207
Using assistive technologies . . . . . Keyboard navigation of the user interface . z/OS information . . . . . . . . . . . . . . . . 207 . 207 . 207
Glossary . . . . . . . . . . . . . 209 Notices . . . . . . . . . . . . . . 213

Policy for unsupported hardware. . Programming Interface Information . Trademarks . . . . . . . . . . . . . . . . . . . . . . 214 . 214 . 214
Appendix A. PR/SM LPAR considerations . . . . . . . . . . . 179

Understanding the Partition Data report . . . . 180
Index . . . . . . . . . . . . . . . 217
iv
z/OS V1R11.0 RMF Performance Management Guide
Figures
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. End-to-End Response Time Components . . . 7 Monitor III Group Response Time Report 8 System Resource Summary by Workload 10 Monitor I CPU Activity Report - Part 1 11 Workload Activity Report - Service Class 12 Workload Activity Report - Service Policy 12 Processor Utilization by Workload . . . . . 13 I/O Activity by Workload. . . . . . . . 16 Workload Activity Report - I/O Activity 17 Processor Storage by Workload . . . . . . 18 Workload Activity Report - Processor Storage Use . . . . . . . . . . . . . . . 19 Simplified View of Performance Management 25 Monitor III Sysplex Summary Report . . . . 28 Monitor III Sysplex Summary Report - GO Mode . . . . . . . . . . . . . . 29 Monitor III Response Time Distribution report 32 Monitor III Work Manager Delays Report 34 End-to-End Response Time Components 35 Monitor III Group Response Time Report 37 Monitor III Storage Delays Report . . . . . 38 Monitor III Delay Report . . . . . . . . 38 Monitor III Job Delays Report . . . . . . 39 Monitor III Workflow/Exceptions report 40 Summary of Major System Indicators . . . . 42 Monitor I CPU Activity Report . . . . . . 44 CPU Contention . . . . . . . . . . . 46 Monitor I CPU Activity Report - Partition Data Report Section . . . . . . . . . . . 47 Monitor I Channel Path Activity report 48 Monitor I I/O Queuing Activity Report 49 Monitor I DASD Activity Report . . . . . 50 DASD Summary . . . . . . . . . . . 51 Response Time of Top-10 Volumes . . . . . 52 Paging Activity report - Swap Placement Activity . . . . . . . . . . . . . . 53 Monitor I Page/Swap Data Set Activity Report 54 Workload Activity report . . . . . . . . 55 Monitor I Virtual Storage Activity Report Common Storage Summary . . . . . . . 57 Monitor III Group Response Time Report 63 Monitor III Delay Report . . . . . . . . 64 Monitor III Workflow/Exceptions Report 65 Monitor III Processor Delays Report . . . . 66 Monitor I CPU Activity Report . . . . . . 68 CPU Contention Report . . . . . . . . 70 Monitor I CPU Activity Report - Partition Data Report Section . . . . . . . . . . . 71 Monitor III Job Delays Report . . . . . . 73 Monitor III Address Space State Data Report 74 Monitor III CPC Capacity Report . . . . . 75 Monitor III Processor Delays Report . . . . 76 Monitor III Job Delays Report . . . . . . 77 Monitor III Workflow/Exceptions Report 78 z/OS Traditional Device Serialization . . . . 86 50. Device Queuing in a Parallel Access Volume Environment . . . . . . . . . . . . 88 51. Cache Performance Management: The Key 89 52. Cache Subsystem Activity report - Summary 92 53. Cache Subsystem Activity Report - Top-20 Device Lists . . . . . . . . . . . . 93 54. Cache Subsystem Activity Report - Status and Overview . . . . . . . . . . . . . 94 55. Cache Subsystem Activity Report - Device Overview . . . . . . . . . . . . . 97 56. Cache Subsystem Activity report - RAID rank activity . . . . . . . . . . . . . . 97 57. Cache Subsystem Activity Report - Cache Device Activity (device-level reporting) . . . 98 58. Overview Report for Cached DASD Device 99 59. Cache Hits Overview Report . . . . . . 100 60. Cache Trend Report . . . . . . . . . 100 61. CACHSUM Report. . . . . . . . . . 102 62. CACHDET Report . . . . . . . . . . 103 63. DASD Response Time Components . . . . 104 64. DASD Summary Report . . . . . . . . 105 65. Activity of Top-10 Volumes . . . . . . . 106 66. Response Time of Top-10 Volumes . . . . 107 67. Direct Access Device Activity Report 108 68. Shared DASD Activity Report . . . . . . 117 69. Monitor III System Information Report 118 70. Monitor III Delay Report . . . . . . . . 119 71. Monitor III Device Delays Report . . . . . 120 72. Monitor III Device Resource Delays Report 121 73. Monitor III - Device Resource Delays Report 122 74. Monitor III - Data Set Delays by Volume 122 75. Monitor III - Data Set Delays by Job . . . . 123 76. Monitor III - Data Set Delays by Data Set 124 77. Magnetic Tape Device Activity Report 127 78. Monitor III System Information Report 133 79. Monitor III Delay Report. . . . . . . . 135 80. Monitor III Storage Delays Report . . . . 136 81. Paging Activity report - Central Storage Paging Rates . . . . . . . . . . . . 138 82. PAGING Report - Central Storage Movement Rates / Frame and Slot Counts . . . . . 140 83. Monitor I Page/Swap Data Set Activity Report . . . . . . . . . . . . . . 141 84. Direct Access Device Activity Report 142 85. Hotel Reservations Service Class . . . . . 146 86. Response Time Breakdown of CICSHR accessing DBCTL . . . . . . . . . . 147 87. Point-of-Sale Service Class . . . . . . . 148 88. Response Time Breakdown of CICSPS accessing DBCTL with IMS V5 . . . . . . 149 89. CICS User Transactions Service Class 149 90. Response Time Percentages Greater than 100 150 91. Response Time Percentages all Zero . . . . 152 92. Executed Transactions greater than Ended Transactions . . . . . . . . . . . . 153 93. Execution Time Greater than Response Time 154
94. Large SWITCH Percentage in a CICS Execution Environment . . . . . . . 95. Coupling Facility Activity Report - Usage Summary . . . . . . . . . . . . 96. Coupling Facility Structure Activity . . . 97. Coupling Facility Activity Report Subchannel Activity . . . . . . . . 98. Coupling Facility Activity Report - CF to CF Activity . . . . . . . . . . . . 99. Coupling Facility Structure Activity . . . 100. CFOVER Report . . . . . . . . . 101. CFACT Report . . . . . . . . . .
. 155 . 158 . 163 . 169 . . . . 171 172 173 174
102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113.
CFSYS Report . . . . . . . CPU Activity Report . . . . . Partition Data Report . . . . . CHANNEL Report . . . . . . IOQUEUE Report . . . . . . LPAR Cluster - Example 1 . . . LPAR Cluster - Example 2 . . . LPAR Cluster Report . . . . . Partition Data Report . . . . . CPU Activity report . . . . . Monitor III Enqueue Delays Report HSM Report . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
174 180 181 188 189 191 191 193 195 198 202 204
vi
About this document

The z/OS Resource Measurement Facility (RMF) is a performance management tool that is designed to measure selected areas of system activity and present the data collected in the form of system management facility (SMF) records, formatted printed reports, or formatted display reports. Use RMF to report on resource usage, evaluate system performance and identify reasons for performance problems. This document offers a practical, task-oriented approach to analyzing performance issues using RMF. Covering the most frequently asked questions, it offers guidelines and suggestions concerning performance management of a z/OS sysplex.
Who should use this document

Anyone involved in understanding, measuring, and improving the performance of z/OS systems. This document assumes that the reader has a basic understanding of RMF and how z/OS systems work. The detailed description of all RMF functions and reports is available in z/OS RMF Users Guide and z/OS RMF Report Analysis.
When to use this document

Use this document when you are trying to improve the performance of your z/OS sysplex. Reasons could include: v You receive complaints from users v Service level objectives are not met v Ongoing performance monitoring showing adverse trends v Resource usage is changing v Furthering your own understanding Do not use this document when your efforts have taken you beyond the scope of RMF and z/OS performance. For example: v Network tuning v Detailed internal subsystem tuning for DB2. Refer to the DATABASE 2 Client/Server Administration Guide (SC09-1909) instead.
How this document is organized

This document handles the following topics: Chapter 1, Performance overview This topic describes performance concepts common to all z/OS systems. Chapter 2, Diagnosing a problem: The first steps This topic helps you analyze a performance problem to determine which resource is constrained, and thus where you should proceed with further analysis. Chapter 3, Analyzing processor activity Here you find information for further analysis of CPU constraints.
vii
Chapter 4, Analyzing I/O activity This information unit concentrates on the analysis of I/O constraints (Cache subsystem, DASD, and tape). Chapter 5, Analyzing processor storage activity The analysis of processor storage constraints (central storage and paging subsystem) will be explained in this information unit. Chapter 6, Analyzing sysplex activity This information unit covers key indicators for workload management and coupling facility performance aspects. Appendix A. PR/SM LPAR considerations Here you find performance considerations for systems running in LPAR mode. Appendix B. The Intelligent Resource Director This information unit describes the functions of the Intelligence Resource Director and its reporting in RMF. Appendix C. Planning considerations for HiperDispatch mode This information unit provides a CPU Activity report sample showing metrics for a system running in HiperDispatch mode. These specific metrics are explained and HiperDispatch processing benefits are discussed. Appendix D. Other delays This information unit discusses some delays shown in Monitor III that have not been covered before.
The z/OS RMF library

This table shows the full titles and order numbers of the books in the RMF library for z/OS.
Table 1. RMF Library Title z/OS RMF Users Guide z/OS RMF Report Analysis z/OS RMF Performance Management Guide z/OS RMF Messages and Codes z/OS RMF Programmers Guide z/OS RMF Reference Summary Order Number SC33-7990 SC33-7991 SC33-7992 SC33-7993 SC33-7994 SX33-9033
Softcopy documentation as part of the z/OS Collection (SK3T-4269 (CD-Rom) and SK3T-4271 (DVD))
Using LookAt to look up message explanations

LookAt is an online facility that lets you look up explanations for most of the IBM messages you encounter, as well as for some system abends and codes. Using LookAt to find information is faster than a conventional search because in most cases LookAt goes directly to the message explanation. You can use LookAt from these locations to find IBM message explanations for z/OS elements and features, z/VM, VSE/ESA, and Clusters for AIX and Linux:
viii
v The Internet. You can access IBM message explanations directly from the LookAt Web site at www.ibm.com/servers/eserver/zseries/zos/bkserv/lookat/. v Your z/OS TSO/E host system. You can install code on your z/OS systems to access IBM message explanations using LookAt from a TSO/E command line (for example: TSO/E prompt, ISPF, or z/OS UNIX System Services). v Your Microsoft Windows workstation. You can install LookAt directly from the z/OS Collection (SK3T-4269) or the z/OS and Software Products DVD Collection (SK3T-4271) and use it from the resulting Windows graphical user interface (GUI). The command prompt (also known as the DOS > command line) version can still be used from the directory in which you install the Windows version of LookAt. v Your wireless handheld device. You can use the LookAt Mobile Edition from www.ibm.com/servers/eserver/zseries/zos/bkserv/lookat/lookatm.html with a handheld device that has wireless access and an Internet browser. You can obtain code to install LookAt on your host system or Microsoft Windows workstation from: v A CD in the z/OS Collection (SK3T-4269). v The z/OS and Software Products DVD Collection (SK3T-4271). v The LookAt Web site (click Download and then select the platform, release, collection, and location that suit your needs). More information is available in the LOOKAT.ME files available during the download process.
About this document
ix
Summary of changes
This document contains information previously presented in z/OS RMF Performance Management Guide, SC33-7992-08. This document includes terminology, maintenance, and editorial changes. The following information describes the enhancements that are being distributed with z/OS Version 1 Release 11. All technical changes or additions to the text are indicated by a vertical line to the left of the change.
| | | | | | | | | | | | | | |
Whats new in z/OS V1.11 New real storage measurements

The Frame and Slot Counts section of the Postprocessor Paging Activity report is enhanced to provide measurements about real storage requests and page faults.
Measuring WLMs promotion for workloads holding locks

The Postprocessor Workload Activity report is enhanced to show the CPU time consumed for work units while they have been promoted by WLM to free a local suspend lock quicker.
Enhanced group capacity reporting

RMF enhances its reporting about capacity groups. The Partition Data Report displays the long term average capacity which is not used by members of the group but would be allowed by the defined limit. The Group Capacity Report displays a column CAPPING WLM% which indicates to what extent the partition is subject to capping. Column CAPPING ACT% is added which displays how capping really limited the usage of processor resources.
History of changes Whats new in z/OS V1R10

Summary of changes for SC33-7992-08 z/OS Version 1 Release 10
Support of HiperDispatch mode

Appendix C. Planning considerations for HiperDispatch mode on page 197 provides an introduction to HiperDispatch mode with RMF report examples.
64-Bit Common Storage measurements

The new Monitor III Storage Memory Objects report (STORM) contains measurement of 64-bit virtual common storage consumption per system and per job. Measurements of the usage of memory objects previously contained in the Monitor III Storage Resource Delays (STORR), Storage Delay Summary (STORS) and Storage Frames (STORF) reports are moved to this new report. Also, the following Postprocessor reports are enhanced to provide 64-bit virtual common storage usage information:
xi
v In the Frame and Slot Counts section of the Paging Activity report, a new Memory Objects and Frames section is added. v In the Private Area Detail section of the Virtual Storage Activity report, the section containing memory allocation values above 2GB is enhanced with new information about memory objects and 1 MB frames. In addition, RMF provides new overview conditions for the Postprocessor based on SMF record 71.
New reports on spin locks and suspend locks

RMF periodically checks all types of system resource locks and provides information about lock holders and lock contention in two new Monitor III reports: v Spin Lock Report v Suspend Lock Report
Support of WLM enqueue management

The Postprocessor Workload Activity report is enhanced to show the CPU times for work units that have been promoted by WLM for reasons of enqueue management and chronic resource management.
Providing DASD volume capacity

The Postprocessor Direct Access Device Activity report shows the volume capacity for DASD devices specified by the number of cylinders. In addition, RMF provides a new overview condition for the Postprocessor based on SMF record 74-1.
Whats new in z/OS V1R9

Measurements about WLMs blocked workload promotion

If the CPU utilization of a system is at 100%, workloads with low importance (low dispatch priority) might not get dispatched anymore. This could cause problems if the work holds a resource and by that holds up more important workloads. Therefore, any address space or enclave which has ready-to-run work units, but does not get CPU service within a certain time interval due to its low dispatch priority, will be temporarily promoted to a higher dispatch priority. RMF supports this function by reporting relevant measurements in the new Blocked Workload Promotion section of the Postprocessor CPU Activity report. The Postprocessor Workload Activity report provides the CPU time, that transactions of a certain service or report class were running at a promoted dispatching priority. In addition, RMF provides new overview conditions for the Postprocessor based on SMF record 72-3.
Enhanced monitoring of FICON performance

The average number of concurrently active I/O operations is provided in the LCU summary line of the Postprocessor I/O Queuing Activity report if at least one FICON channel is connected to the LCU.
xii
HyperPAV support
In HyperPAV mode, PAV-aliases are no longer statically bound to PAV-bases but are bound only for the duration of a single I/O operation. Thus, the number of aliases required for an LCU is reduced. This enables applications to achieve equal or better performance compared to the original PAV feature. RMF supports this new feature by extending the following reports: v The Postprocessor I/O Queuing Activity report shows information about HyperPAV activity. v Several device related and delay related reports show the average number of assigned HyperPAV-aliases.
New Coupling Facility measurements

RMF provides additional data in the Monitor III Coupling Facility reports and the Postprocessor Coupling Facility Activity report. For example, the reports display CF utilization per structure and whether dynamic CF dispatching is turned on. In addition, RMF provides new overview conditions for the Postprocessor based on SMF record 74-4.

Support of IBM System z9 Integrated Information Processors (zIIP)

RMF provides measurements about zIIP activity in various Postprocessor and Monitor III reports.
Terminology: RMF provides a consistent naming for zAAPs and zIIPs. The term IFA which was used for the zAAP in all RMF reports, help panels and manuals is now replaced by AAP. That is, AAP is used for the zAAP and IIP is used for the zIIP as field names.
Capacity groups
A new function of z/OS 1.8 allows customers to apply the defined capacity limit not only to a single LPAR, but also to a group of LPARs running on the same CEC, a so-called capacity group. RMF extends its CPU activity reporting with information about capacity groups. This allows to assess the capacity group definitions and their effect on the system. v A new Group Capacity section is available in the Postprocessor CPU Activity report. This section informs about the share each LPAR in the group can take, the guaranteed minimum MSU share of the LPARs and the overall MSU consumption within the group. v If a partition is a member of a capacity group, the Monitor III CPC Capacity report and the Partition Data Report section of the CPU Activity report include new header information showing the groups name and capacity limit.
Summary of changes
xiii
Summary of changes

Support for IBM System z9 Enterprise Class processors

On z900 and z990 processors, special purpose processors IFLs (Integrated Facility for Linux) and IFAs (Integrated Facility for Applications) are reported as ICFs (Internal Coupling Facility). Starting with IBM System z9 Enterprise Class (z9 EC) processors, IFLs, IFAs and ICFs are reported separately in the Postprocessor Partition Data Report and the Monitor III CPC Report.
Queue length distribution in the CPU Activity report

The presentation of the queue length distribution in the System Address Space Analysis section of the CPU Activity Postprocessor report is changed to reflect the higher number of CPUs available per MVS image.

Whats new in SC33-7992-04

Obsolete information refreshed or deleted: The former Appendix C. Data-in-Memory has been deleted and the Introduction section of Improving I/O performance with the Enterprise Storage Server on page 85 has been reworked. Also, various other minor changes have been implemented to reflect up-to-date information.
Whats new in SC33-7992-03

Support of the zSeries Application Assist Processor: z/OS V1R6 provides the ability to run Java applications on a new type of processor called the zSeries Application Assist Processor (zAAP). The zSeries Application Assist Processor is also known as an IFA (Integrated Facility for Applications). RMF uses the term IFA in the affected reports, messages, and in this and other publications. RMF provides measurements of IFA processor activity in the following reports: v Postprocessor: CPU Activity report and its Partition Data section Workload Activity report v Monitor III: CPC Capacity report Enclave report System information (SYSINFO) report This new functionality is available as SPE and needs to be installed as APAR OA05731.
xiv
Summary of changes

Note: An appendix with z/OS product accessibility information has been added.
State samples breakdown in the WLMGL report

Up to this release, state samples have been reported as a percentage of average transaction response time (response time breakdown). The response time is calculated when a transaction completes. This can result in percentages greater than 100 when samples are included for long running transactions which have not completed in the gathering interval (see also Understanding Workload Activity data for IMS or CICS on page 146). Percentages greater than 100 in the breakdown section are now avoided by showing the state values as percentages of the total transaction samples (state samples breakdown) instead of percentages of response time. This functionality is available as SPE and needs to be installed as APAR OW52227.

Software pricing online support
A new Monitor III CPC Capacity report is available which provides information about defined and consumed processor capacity of all partitions in a CPC (central processor complex). This report is available as SPE and needs to be installed as APAR OW49807.
Reporting of report class periods

Statistics on response times is a key performance metric used for groups of work having response time objectives. WLM report classes allow aggregation of performance data so that you can evaluate performance of applications independent of the definition how they are managed. Now, reporting of report classes is available with the same granularity as for service classes. This can be seen in the Postprocessor WLMGL report, and in several Monitor III reports: GROUP, STORS, SYSINFO, SYSRTD, SYSSUM, and SYSWKM.
Enhanced reporting for the coupling facilities

CF duplexing ensures high application availability in a Parallel Sysplex. The performance management aspects which have to be covered for CF duplexing are provided by RMF in the Postprocessor CF Activity report about new peer CF connectivity. This enables you to evaluate and monitor your CF configuration, and you can apply the necessary changes to tune accommodation of new structure instances resulting from system-managed duplexing.

The following information describes the enhancements that are being distributed with z/OS Version 1 Release 1. It is indicated by a vertical line to the left of the text. There is one release RMF 2.10 that can run in OS/390 2.10 as well as in z/OS V1R1. Nevertheless, some new functions are available only when z/OS is running on a zSeries 900 server.
Summary of changes
xv
Summary of changes
Intelligent Resource Director (IRD)

The Workload Manager is extended to work with PR/SM on zSeries 900 (z900) servers to dynamically expand resources that are available across LPARs. An LPAR cluster is defined as the set of logical partitions in a single CEC that belong to the same parallel sysplex. Based on business goals, WLM can direct PR/SM to enable or disable CP capacity for an LPAR, without human intervention. This combination of WLM working with PR/SM on a z900 server is called IRD. IRD is made up of three parts which work together to help increase your business productivity: v LPAR CPU Management Based on workload resource demand, the Workload Manager is able to dynamically adjust the number of logical processors and the weight of a logical partition. This allows the system to distribute the CPU resource in an LPAR cluster to partitions where the CPU demand is high. The dynamic adjustment of processor resources within the partitions is reflected in the Postprocessor CPU Activity (Partition Data) report which provides LPAR views as well as aggregated views on LPAR cluster level. v Dynamic Channel Path Management Dynamic Channel Path Management provides the capability to dynamically assign channels to control units in order to respond to peaks in demand for I/O channel bandwidth. This is possible by allowing you to define pools of so-called managed channels that are not related to a specific control unit. With the help of the Workload Manager, channels can float between control units to best service the work according to their goals and their importance. All channel and I/O queuing reports have been extended to differentiate static channels from floating channels. v Channel Subsystem Priority Queuing This topic is not reflected directly in any RMF report.
FICON director
RMF offers new reporting capabilities for the FICON director. Due to the different technology and implementation compared to ESCON, the new Postprocessor FICON Director Activity report will provide information about director and port activities. This will assist you in analyzing performance problems and in capacity planning.
Goal mode enhancements

The Workload Manager has implemented some enhancements to help increase production use of goal mode. They are v Classification by System Name v Separation of Production from Test CICS/IMS Regions v Address Space Storage Isolation v Pro-active CPU Protection This is done by new specifications (CPU protection and storage protection) in the service policy. RMF supports these enhancements in the WLMGL Postprocessor report and in several online reports: v Monitor II ARD/ARDJ report v Monitor III SYSWKM report v Monitor III DELAY report
xvi
Summary of changes
v Monitor III JOB report v Monitor III STORF report v Monitor III ENCLAVE report - Details
Summary of changes
xvii
Summary of changes
xviii
Chapter 1. Performance overview
Lets Start with an Overview This chapter contains an overview of key performance topics. It explains the concepts and terms we will be using throughout the book: v Performance Management Definition v Service Level Agreements v Capacity Planning Concepts v Using RMF v Components of Transaction Response Time v Analyzing Workload Characteristics v Processor v Processor Storage v I/O Activity v Processor Speed Indicators
Performance management
Defining performance management

Performance management means monitoring and allocating data processing resources to applications according to Service Level Agreements (SLA) or informal objectives. It involves an ongoing cycle of measuring, planning, and modifying.
The goal of performance management is to make the best use of your current resources, to meet your current objectives without excessive tuning effort.
Setting performance goals

The human view of the performance of a system is often subjective, emotional and difficult to manage to. However, meeting the business needs of the users is the reason the system exists. To match business needs with subjective perception, the concept of SLA was introduced. The SLA is a contract that objectively describes such measurables as: v Average transaction response time for network, I/O, CPU, or total v The distribution of these response times (for example, 90% TSO trivial at less than 0.2 of a second) v Transaction volumes v System availability A transaction is a business unit of work and can be a CICS end user interaction, a batch job, etc. Ideally, the definition of a transaction is from a users point of view.
Planning system capacity

We can define capacity planning as:
A process of planning for sufficient computer capacity in a cost-effective manner to meet the service needs for all users.
Capacity planning involves asking the following questions: v How much of your computer resources are being used? CPU Processor storage I/O Network v Which workloads are consuming the resources (workload distribution)? v What are the expected growth rates? v When will the demands on current resources impact service levels?
Benefits of capacity planning

An effective capacity planning process provides: v A mapping of business objectives (user requirements) into quantifiable information technology (IT) resources. v Management oriented reporting of service, resource usage and cost. In an objective way this makes clear what is involved in providing users with good performance. v Input for making business decisions which involve IT. v A way to avoid surprises.
Common mistakes in capacity planning

Things to avoid include: v Forgetting the concept of balanced systems. Remember that it takes a combination of resources to process work: CPU, processor storage, and I/O. Increasing the capacity of one resource by itself, may not allow more work to be processed. v Ignoring latent demand. Latent demand is work that is hidden and waiting to be unleashed. One way to begin quantifying latent demand is to note your peak-to-average ratio (e.g. peak hour CPU busy compared to prime-shift average CPU busy). This ratio will drop as latent demand builds up. Track it and compare it to your current system. Relieving any bottleneck (for example, upgrading constrained CPU, increasing network bandwidth, or improving I/O subsystem) will release latent demand. v Being out of touch with business forecasts. This one is obvious, but can be difficult to avoid. The capacity planner needs to stay informed of business plans that will impact IT resource requirements (for example, departmental growth, new applications, and mergers). v Ignoring complexity. Factors other than transaction growth will increase demand on IT resources. Enhancements to existing applications, new government regulations, growing databases, etc. - all can result in an increase in resource consumption. This means that the same transaction running today may require 10-15% more resource than it did last year. v Striving for total precision. Many variables will impact your capacity plan (growth projections, resource capacity estimates, measurement precision,
politics). A plan which achieves an accuracy of 10% is considered good. Your time is probably better spent on reasonability checks on your input than on trying to achieve total accuracy.
Performance management versus capacity planning

These two disciplines have much in common. Performance management concentrates on allocating existing resources to meet service objectives. Capacity planning is a means of predicting resources needed to meet future service objectives. Similar approaches can be used for both; here are some examples: v Rules of thumb v Comparison with other systems v Parametric methods Transaction profile (10 read calls, 2 update calls, 8 physical I/Os) Cost of function (CICS: 15ms per physical I/O in a 3390 non-cached) v Analytic (queuing theory) models: For example the IBM capacity planning tool CP90. v Simulation, using a computer program that has the essential logic of the system: for example the Snap/Shot modelling system from IBM. v Benchmarks, Teleprocessing Network Simulator (TPNS). In addition, performance and capacity work tend to require similar input data from the system. In particular, both require resource consumption by workload data (see Analyzing workload characteristics on page 9).
Using RMF
RMF issues reports about performance problems as they occur, so that your installation can take action before the problems become critical. Your installation can use RMF to: v Determine that your system is running smoothly v Detect system bottlenecks caused by contention for resources v Evaluate the service your installation provides to different groups of users v Identify the workload delayed and the reason for the delay v Monitor system failures, system stalls, and failures of selected applications RMF comes with three monitors, Monitor I, II and III. Monitor III with its ability to determine the cause of delay is where we start: Monitor III provides short term data collection and online reports for continuous monitoring of system status and solving performance problems. Monitor III is a good place to begin system tuning. It allows the system tuner to distinguish between delays for important jobs and delays for jobs that are not as important to overall system performance. Monitor I provides long term data collection for system workload and resource utilization. The Monitor I session is continuous, and measures various areas of system activity over a long period of time. You can get Monitor I reports directly as real-time reports for each completed interval (single-system reports only), or you let run the Postprocessor to create the reports, either as single-system or as sysplex reports. Many installations produce daily reports of RMF data for ongoing performance management. In this publication, sometimes a report is called a Monitor I report (for example, the Workload Activity report) although it can be created only by the Postprocessor. Monitor II provides online measurements on demand for use in solving immediate problems. A Monitor II session can be regarded as a snapshot session. Unlike the continuous Monitor I session, a Monitor II session generates a requested report from a single data sample. Since Monitor II is an ISPF application, you might use Monitor II and Monitor III simultaneously in split-screen mode to get different views of the performance of your system. In addition, you can use the Spreadsheet Reporter for further processing the measurement data on a workstation by help of spreadsheet applications. The following chapters provide sample reports including the name of the corresponding macro. You find a detailed description on how to create the reports and records and on how to use the macros in the z/OS RMF Users Guide. There is another function in RMF to exploit the workstation for monitoring and analyzing your system, called PMOS390; (RMF PM). You find a detailed description on how to use this function in the z/OS RMF Users Guide.
This book will discuss key RMF indicators from the different monitors that can be used in a daily report and in problem diagnosis.
What sampling cycle and reporting interval should you use?

For Monitor III: Use the default sampling cycle of 1 second, with the default of 100 seconds for the reporting interval. Adjust this interval if needed to match a problem occurrence that you are investigating. For Monitor I: Again use the default sampling cycle of 1 second. For the reporting interval, the default of 30 minutes is fine to start with. Adjust this if you prefer, or if you need to home in on a problem.
Analyzing transaction response time

To characterize the performance of a transaction you need to understand its different response time components.
User Response Time

Mount Delay Operator Delay
Network Server Response
CPU
Storage I/O Enqueue

Others
T service Twait
Figure 1. End-to-End Response Time Components
The reasons you should care about this are: v For capacity planning you need to know resource consumption. v For performance management you need to break down response time into components, to see where tuning can be done.
General formulas
Response time is made up of service time (the time actual work is done) and waiting time (the time waiting for resource): Tr = Ts + Tw v Tr is Response Time v Ts is Service Time v Tw is Waiting Time (that is Queue time or Delay in RMF Monitor III). Similarly total transaction service and wait times are made up of the individual resource service and wait times. Ts = Ts(CPU) + Ts(I/O) + Ts(TP) + Ts(Other) Tw = Tw(CPU) + Tw(I/O) + Tw(TP) + Tw(Storage) + Tw(Other) The Monitor I Workload Activity report shows some of this data for a group of transactions (service or report class). This applies for TSO and batch work (and potentially CICS, with interval recording). The Monitor III Group Response Time report also shows this data, with a more useful breakdown of response time components (see Figure 2 on page 8).
GROUP report
RMF V1R11 Command ===> Samples: 100 System: PRD1 Date: 04/07/09 Time: 10.32.00 Range: 100 Sec Group Response Time
Class: TSOPROD Period: 1 Description: TSO Production Primary Response Time Component: Storage delay for local paging TRANS Ended Rate 0.2 --- Response Time ----- Ended TRANS-(Sec) WAIT EXECUT ACTUAL 0.000 3.432 3.432
WFL % 50
Users TOT ACT 2 1
Frames %ACT 5
Vector UTIL 0 -AVG PROC 0.10 0.57
EXCP Rate 3.2 USGDEV 0.17 0.97
PgIn Rate 4.1
Total Average Users 0.600 Response Time ACT 3.432
-------------Average Delay-------------PROC DEV STOR SUBS OPER ENQ OTHER 0.02 0.03 0.22 0.00 0.00 0.00 0.06 0.11 0.17 1.25 0.00 0.00 0.00 0.34
Average Users Response Time ACT
---STOR Delay--- ---OUTR Swap Reason--- ---SUBS Delay--Page Swap OUTR TI TO LW XS JES HSM XCF 0.19 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.08 0.17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Figure 2. Monitor III Group Response Time Report
Report Analysis See Response Time ACT field for a breakdown of the components of response time. Think of USING as service time (Ts) and DELAY as wait time (Tw) v AVG USG: 0.57 + 0.97 = 1.54 sec v Average Delay: 0.11 + 0.17 + 1.25 + 0.34 = 1.87 sec So here, if you were investigating a response time problem, STOR would be the best starting point. Monitor III provides insight into the components of response. With an emphasis on causes of delay, it shows you the resources for which work is being delayed and the address spaces which are holding the resources.
Workload characteristics
Analyzing workload characteristics

Much of performance work is done, not at the individual transaction level, but at a larger workload level. Understanding resource requirements by workload is key to effective performance management. The reasons you should analyze your workload are: v To understand your systems behavior v For setting your SLA v For tuning Where is the pain? Interactions with other workloads v For input to the capacity planning process Workload growth projections Processor requirements Storage requirements DASD requirements Network requirements
Identifying workloads
You will need to identify the different types of work in your system. Usually this is done at a service class level. To assign work to service classes you must classify the various workloads by their unique characteristics and requirements, that is: v Response time needs v Resource consumption (CPU, Storage, I/O) v Priority v Anticipated growth Examples of workload identification at the service class level include: v Trivial TSO (first period) v Non-trivial TSO v Batch v Production CICS v Development CICS v IMS v Graphics application Ideally, you may want to take workload differentiation one level further, matching workloads to true business functions (for example, claims processing and order fulfilment). This may require more detailed data from SMF records. RMF reports data about different workloads grouped into categories which you have defined as workloads and service classes in your service policy. The appropriate grouping of workloads is important. v If you have different applications which should be managed according to the same goals, you should define the same service class for them. Applications with different goals need to be assigned to different service classes.
v If you want to get separate reporting for different applications in the same service class, you can define separate report classes for each of them. Reporting for report classes is possible with the same level of detail (report class period) as for service classes.
Measuring resource utilization by workload
System Resource By Workload
CPU
Storage I/O
Figure 3. System Resource Summary by Workload
This figure illustrates one way to report your resource utilization by workload. You can use the Spreadsheet Reporter to create spreadsheet files from Postprocessor data and to display them graphically. The following chapters show how to use the sample macros for creating the graphics that help you in understanding and analyzing the performance of your system.
10
In a similar fashion (could be graphical or numeric) you will need to record such data to: v Measure resource consumption: CPU utilization Processor storage usage (central storage (CS) and expanded storage (ES)) I/O rates v Understand what makes up response time: Waiting for and using the resources Use Monitor III Group Response Time report v Understand the factors that influence the above Virtual storage is another type of resource. While not as dynamic as I/O, CPU, or processor storage, virtual storage is one of the basic resources that you need to plan, even though you cannot buy it. Virtual storage can have a critical impact on delivery of service to end users: an initial program load (IPL) is required when not enough virtual storage is available. RMF can measure virtual storage. It should be monitored, and the Virtual Storage report helps you verify the size of the different MVS components, as explained in Virtual Storage Activity report on page 57.
Using Monitor I reports to analyze workloads

The CPU Activity report and the Workload Activity report provide the data you can use for a workload analysis.
CPU Activity report - part 1

C P U z/OS V1R11 SYSTEM ID OS04 RPT VERSION V1R11 RMF H/W MODEL E26 A C T I V I T Y PAGE DATE 03/14/2009 TIME 09.29.00 INTERVAL 14.59.996 CYCLE 1.000 SECONDS 1
CPU
2097
MODEL
720
SEQUENCE CODE 00000000000473BF LOG PROC SHARE % 35.0 35.0 35.0 35.0 35.0 35.0 35.0 35.0 35.0 315.0
HIPERDISPATCH=NO --I/O INTERRUPTS-RATE % VIA TPI 320.9 324.0 321.6 329.5 325.9 328.7 338.4 341.1 335.9 2966 0.33 0.22 0.22 0.25 0.33 0.36 0.46 0.51 0.67
---CPU--NUM TYPE
---------------- TIME % ---------------ONLINE LPAR BUSY MVS BUSY PARKED 76.93 76.85 76.86 76.87 76.89 76.87 76.88 76.85 76.82 76.87 99.15 99.13 99.09 98.99 99.01 98.88 98.86 98.73 98.59 98.94 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0 CP 100.00 1 CP 100.00 2 CP 100.00 3 CP 100.00 4 CP 100.00 5 CP 100.00 6 CP 100.00 7 CP 100.00 8 CP 100.00 CP TOTAL/AVERAGE
Figure 4. Monitor I CPU Activity Report - Part 1
The first part of this report provides an overview on all processors belonging to this system.
11
Workload Activity report

W O R K L O A D z/OS V1R11 SYSPLEX SVPLEX3 RPT VERSION V1R11 RMF A C T I V I T Y PAGE DATE 11/28/2009 TIME 12.00.00 INTERVAL 14.59.995 MODE = GOAL 1
POLICY ACTIVATION DATE/TIME 11/03/2008 10.12.11 - WORKLOAD & SERVICE CLASS PERIODS ------------------------------------------------------------------------------------------------------------ SERVICE CLASS(ES) REPORT BY: POLICY=BASEPOL WORKLOAD=STC_WLD SERVICE CLASS=STCHIGH RESOURCE GROUP=*NONE CRITICAL =NONE DESCRIPTION =High priority for STC workloads ---SERVICE--IOC 250903 CPU 2946K MSO 0 SRB 7466 TOT 3204K /SEC 1780 ABSRPTN TRX SERV PERF INDX 4.0 0.3 --TRANSACTIONS--NUMBER-%35 80 9 20 44 100 15K 15K -EX VEL%GOAL ACT 100 30 100 TOTAL USING% 0.0 0.2 SERVICE TIME CPU 14.636 SRB 0.032 RCT 0.024 IIT 0.017 HST 0.000 AAP 0.000 IIP 0.000 ---APPL CP AAPCP IIPCP AAP IIP %--- --PROMOTED-0.82 BLK 0.000 0.00 ENQ 0.000 0.00 CRM 0.000 LCK 0.000 0.00 0.00 ----STORAGE---AVG 679.09 TOTAL 81.44 SHARED 0.00 -PAGE-IN RATESSINGLE 0.0 BLOCK 0.0 SHARED 0.0 HSP 0.0
-TRANSACTIONSAVG 0.12 MPL 0.12 ENDED 44 END/S 0.02 #SWAPS 200 EXCTD 0 AVG ENC 0.00 REM ENC 0.00 MS ENC 0.00 PER IMPORTANCE 1 2 TOTAL 2 4
TRANS-TIME HHH.MM.SS.TTT ACTUAL 5.341 EXECUTION 5.341 QUEUED 0 R/S AFFIN 0 INELIGIBLE 0 CONVERSION 0 STD DEV 7.699
--DASD I/O-SSCHRT 0.1 RESP 2.7 CONN 2.5 DISC 0.0 Q+PEND 0.1 IOSQ 0.0
-------------RESPONSE TIME-------------------GOAL------ ---ACTUAL--TOTAL 00.00.01.000 85% 14.3% 11.4%
-EXE-DELAY% 0.0 0.0
Figure 5. Workload Activity Report - Service Class

W O R K L O A D z/OS V1R11 SYSPLEX SVPLEX3 RPT VERSION V1R11 RMF A C T I V I T Y PAGE DATE 11/28/2009 TIME 12.00.00 INTERVAL 14.59.995 MODE = GOAL 1
POLICY ACTIVATION DATE/TIME 11/03/2008 10.12.11 ------------------------------------------------------------------------------------------------------------ SERVICE POLICY REPORT BY: POLICY=BASEPOL Base Policy -TRANSACTIONSAVG 203.75 MPL 203.70 ENDED 15298 END/S 17.00 #SWAPS 6378 EXCTD 6937 AVG ENC 0.00 REM ENC 0.00 MS ENC 0.00 TRANS-TIME HHH.MM.SS.TTT ACTUAL 16.29.483 EXECUTION 3.092 QUEUED 15.35.046 R/S AFFIN 0 INELIGIBLE 52.969 CONVERSION 4 STD DEV 3.48.42.725 --DASD I/O-SSCHRT 1955 RESP 13.8 CONN 3.0 DISC 9.5 Q+PEND 0.6 IOSQ 0.7 ---SERVICE--IOC 17400K CPU 386675K MSO 19213K SRB 29845K TOT 453133K /SEC 503690 ABSRPTN 2473 TRX SERV 2472 SERVICE TIME CPU 5333.700 SRB 411.700 RCT 5.400 IIT 44.700 HST 0.200 AAP 0.000 IIP 0.000 ---APPL %--CP 644.20 AAPCP 0.00 IIPCP 0.00 AAP IIP 0.00 0.00 --PROMOTED-- ----STORAGE---BLK 0.000 AVG 4711.96 ENQ 0.000 TOTAL 959830 CRM 0.000 SHARED 584.45 LCK 0.000 -PAGE-IN RATESSINGLE 0.0 BLOCK 0.0 SHARED 0.0 HSP 0.0
Figure 6. Workload Activity Report - Service Policy
The Workload Activity report shows workload-related data on resource consumption with different levels of detail. The above examples summarize data for one service class (STCHIGH) and for the total system (service policy).
12
Analyzing processor characteristics
Figure 7. Processor Utilization by Workload. Spreadsheet Reporter macro RMFY9WKL.XLS (Workload Trend Report) Day Utilization
This information unit discusses how to measure CPU utilization by workload. Our approach is to convert application time for each service class into overall percent of CPU used for that workload. The total of all these numbers is the base for the APPL% CP value in the report. The concept of capture ratio (CR) must be understood in order to do this. Simply put, the CPU time reported for the sum of all your workloads never adds up to the total CPU time used by the system. The total CPU time reported for all your workloads will typically account for 85-90% of the total CPU time used. This is not an error. It is the best level of accuracy that the reporting tools can achieve, in a consistent, repeatable manner. The uncaptured time is sometimes, misleadingly, called system overhead. This is incorrect, because most of it is genuine user work it is either a political or philosophical view whether activities like paging are seen as genuine work or just overhead. Thus the capture ratio is:
CR = Captured CPU Time ----------------Total CPU Time
The question now is: how do you account for the uncaptured, though real, CPU time? The answer is to distribute the uncaptured time among your workloads, so that the total CPU time is accounted for. Several approaches how to do this are discussed.
13
How to compute CPU time per workload type

Following are the basic steps for computing CPU time by workload. A graphical view of CPU consumption over time (for example, by hour or by day) can help you to understand your workload flow, as shown in Figure 7 on page 13 as an example of a Spreadsheet Reporter macro. 1. Calculate total CPU time in seconds:
Total CPU time = MVS Busy % * Interval * #CPs
If your system is running in an PR/SM environment, the calculation has to be performed using LPAR BUSY and the number of logical processors. 2. Get the captured time for all address spaces. The Workload Activity report shows the percentage of application time. It is the sum of task control block and preemptible-class SRB (CPU) time, non-preemptible service request block (SRB) time, region control task (RCT) time, I/O interrupt (IIT) time, and hiperspace service (HST) time. Subtract zAAP and zIIP time, because it is contained in CPU time.
(CPU + SRB + RCT + IIT + HST - AAP - IIP) APPL% CP = ----------------------------------------- * 100 Interval Captured time = (APPL% CP / 100) * Interval
This results in the uncaptured CPU time:

Uncaptured CPU time = Total CPU time - Captured time
3. Calculate the capture ratio:

CR = Captured CPU Time ----------------Total CPU Time
4. Get application (captured) percentage for one workload you are interested in: This APPL% CP value represents the percentage based on one processor. You get the percentage based on the total capacity by:
APPL% CP Total Captured % = -------#CPs
5. Distribute uncaptured time based on the capture ratio:

Total Captured % Total % = ---------------CR
There are more ways to distribute the uncaptured time, see Methods of distributing uncaptured time on page 15 for a discussion of this topic.
Report Analysis Figure 4, Figure 5, and Figure 6 give us the following values for standard processors: v Total CPU time = 0.7687 900 9 = 6226.47 sec v Captured time (POLICY) = 6.442 900 = 5797.8 sec v Uncaptured time = 6226.47 - 5797.8 = 428.67 sec v Capture ratio = 5797.8 / 6226.47 = 0.93 v DB2 (Service Class SCONL006) = 64.40 % of one processor Figure 7 is a diagram being created with the Spreadsheet Reporter, it shows the workload utilization for different service classes during one day. The macro provides different ways to distribute the uncaptured time.
14
Methods of distributing uncaptured time

To get a full picture of CPU consumption, the uncaptured time needs to be distributed across the workloads. There are several methods of doing this: 1. Not bad: By workload APPL time - as done in the previous calculation. 2. Better: By workload I/O rate (% of system total) 3. Best: a. Distribute uncaptured time by I/O rate, AND... b. Distribute system address space (AS) time to relevant users, e.g. v VTAM time: to TSO and CICS v JES time: to TSO and BATCH v etc. 4. Probably not worth the effort: Regression analysis, etc.
How to compute CPU time per transaction

There are times when additional detail of CPU utilization is needed. For capacity planning especially, you may need to know the average CPU time consumed per transaction. This is a simple calculation, dividing application time by number of transactions.
Report Analysis Figure 5 on page 12 gives us the following values for DB2 (Service Class SCONL006): v ENDED TRANSACTIONS: 7288 v APPL% CP * INTERVAL : 581.4 sec (Application time) This results in
581400 CPU time per transaction = ------- = 80 msec 7288
15
Analyzing workload I/O characteristics

It can also be useful to know the I/O rate each different workload generates. This is especially useful in capacity planning, but also for performance management - a change in workload I/O rates (or relative I/O content) can significantly change its performance. Base for the following graphic are these OVW statements:
OVW(IOBATCH(SSCHRT(W.WLBATCH))) OVW(IOINFRA(SSCHRT(W.WLINFRA))) OVW(IOONLIN(SSCHRT(W.WLONLINE))) OVW(IOSYSTK(SSCHRT(W.WLSYSTEM))) OVW(IOSYSTM(SSCHRT(W.SYSTEM)))
Figure 8. I/O Activity by Workload
16
SERVICE CLASS=SCONL006 RESOURCE GROUP=*NONE CRITICAL =NONE DESCRIPTION =DB2 Production SERVICE TIME CPU 581.600 SRB 0.000 RCT 0.000 IIT 0.000 HST 0.000 AAP 0.000 IIP 0.000 ---APPL %--- --PROMOTED-- ----STORAGE---CP 64.40 BLK 0.000 AVG 0.00 AAPCP 0.00 ENQ 0.000 TOTAL 0.00 IIPCP 0.00 CRM 0.000 SHARED 0.00 LCK 0.000 AAP 0.00 -PAGE-IN RATESIIP 0.00 SINGLE 0.0 BLOCK 0.0 SHARED 0.0 HSP 0.0
REPORT BY: POLICY=SP000010
WORKLOAD=WLONLINE
-TRANSACTIONSAVG 28.23 MPL 28.23 ENDED 7288 END/SEC 8.10 #SWAPS 0 EXCTD 0 AVG ENC 0.00 REM ENC 0.00 MS ENC 0.00
TRANS-TIME HHH.MM.SS.TTT --DASD I/O-- ---SERVICE--ACTUAL 769 SSCHRT 41.4 IOC 0 EXECUTION 768 RESP 2.9 CPU 42164K QUEUED 0 CONN 2.4 MSO 0 R/S AFFIN 0 DISC 0.0 SRB 0 INELIGIBLE 0 Q+PEND 0.5 TOT 42164K CONVERSION 0 IOSQ 0.0 /SEC 46869 STD DEV .918 ABSRPTN 1660 TRX SERV 1660
REPORT BY: POLICY=SP000010 Default Policy -TRANSACTIONSAVG 203.75 MPL 203.70 ENDED 15298 END/SEC 17.00 #SWAPS 6378 EXCTD 6937 AVG ENC 0.00 REM ENC 0.00 MS ENC 0.00 TRANS-TIME HHH.MM.SS.TTT --DASD I/O-- ---SERVICE--ACTUAL 16.29.483 SSCHRT 1955 IOC 17400K EXECUTION 3.092 RESP 13.8 CPU 386675K QUEUED 15.35.046 CONN 3.0 MSO 19213K R/S AFFIN 0 DISC 9.5 SRB 29845K INELIGIBLE 52.969 Q+PEND 0.6 TOT 453133K CONVERSION 4 IOSQ 0.7 /SEC 503690 STD DEV 3.48.42.725 ABSRPTN 2473 TRX SERV 2472 SERVICE TIME CPU 5333.700 SRB 411.700 RCT 5.400 IIT 44.700 HST 0.200 AAP 0.000 IIP 0.000 ---APPL %--- --PROMOTED-- ----STORAGE---CP 644.20 BLK 0.000 AVG 4711.96 AAPCP 0.00 ENQ 0.000 TOTAL 959830 IIPCP 0.00 CRM 0.000 SHARED 584.45 LCK 0.000 AAP 0.00 -PAGE-IN RATESIIP 0.00 SINGLE 0.0 BLOCK 0.0 SHARED 0.0 HSP 0.0
Figure 9. Workload Activity Report - I/O Activity
The Workload Activity report directly shows the I/O activities by workload in the value SSCHRT. This is the number of start subchannels (SSCH) per second and gives the number of DASD non-paging I/Os.
Report Analysis We see the following numbers in the sample report:

Workload DB2 Production Policy - Total System SSCH/second 41.4 1955
17
Analyzing processor storage characteristics

As with CPU and I/O, it is useful to build a clear picture of processor storage use on your system. The Workload Activity report can give you a good start on the processor storage use by workload, at the workload level. The Spreadsheet Reporter does not contain a macro for displaying storage utilization by workload. But you can create very easily a graphic as shown in the following example. Assuming that you have the workloads WLBATCH, WLINFRA, WLONLINE, and WLSYSTEM, you can use the following OVW statements to create the required spreadsheet data:
OVW(TOTBATCH(STOTOT(W.WLBATCH))) OVW(TOTINFRA(STOTOT(W.WLINFRA))) OVW(TOTONLIN(STOTOT(W.WLONLINE))) OVW(TOTSYSTK(STOTOT(W.WLSYSTEM))) OVW(TOTSYSTM(STOTOT(W.SYSTEM)))
Figure 10. Processor Storage by Workload
Of course, you can display this in more detail by using other OVW conditions, for example STOCEN and STOEXP instead of STOTOT. You find a detailed description of all OVW conditions in chapter Overview and Exception Conditions in the z/OS RMF Users Guide.
18

REPORT BY: POLICY=SP000010 WORKLOAD=OMVS_WLD SERVICE CLASS=OMVSLOW RESOURCE GROUP=*NONE CRITICAL =NONE DESCRIPTION =Low priority for OMVS workloads SERVICE TIME CPU 3.736 SRB 0.086 RCT 0.119 IIT 0.023 HST 0.000 AAP 0.000 IIP 0.003 ---APPL CP AAPCP IIPCP AAP IIP %--- --PROMOTED-- ----STORAGE---0.44 BLK 0.000 AVG 2068.48 0.00 ENQ 0.000 TOTAL 7185.56 0.00 CRM 0.000 SHARED 1.27 LCK 0.000 0.00 -PAGE-IN RATES0.00 SINGLE 0.0 BLOCK 0.0 SHARED 0.0 HSP 0.0
-TRANSACTIONSAVG 3.47 MPL> 3.47 ENDED 135 END/S 0.15 #SWAPS 306 EXCTD 0 AVG ENC 0.00 REM ENC 0.00 MS ENC 0.00
TRANS-TIME HHH.MM.SS.TTT --DASD I/O-- ---SERVICE--ACTUAL 4.912 SSCHRT 1.4 IOC 345125K EXECUTION 4.911 RESP 70.0 CPU 6951K QUEUED 1 CONN 4.3 MSO 6470 R/S AFFIN 0 DISC 0.9 SRB 163 INELIGIBLE 0 Q+PEND 64.9 TOT 352083K CONVERSION 0 IOSQ 0.0 /SEC 391203 STD DEV 11.041 ABSRPTN 113K TRX SERV 113K
Figure 11. Workload Activity Report - Processor Storage Use
Processor Storage Use is given in the fields: AVG TOTAL # of processor frames on average per swapped-in AS in the group. # of processor frames on average for all AS in the group.
TOTAL = AVG * MPL
If your system is running on a zSeries processor, all fields in the RMF reports which are related to expanded storage will be empty. Expanded storage is no longer supported in the zArchitecture. This workload use of processor storage will account for most of the processor storage on your system. To account for the rest, you should also look at: v System area use of storage (see Monitor I Paging Activity report) v Available frames (see Monitor I Paging Activity report) v Swapped-out users (both in CS & ES - see Monitor II ASD report) It can also be useful to know how much paging your individual workloads are doing. From the Workload Activity report, you also get paging data from the fields: SINGLE BLOCK Page-ins per second from DASD Page-ins per second from DASD, for blocked pages
Use this paging information to help decide whether a workloads current processor storage is sufficient (see Chapter 5, Analyzing processor storage activity for more detail on paging). Monitor II and III also provide processor storage use (and paging) data, down to the AS level.
19
Processor speed indicators

This information unit presents some of the terms and formulas used in discussing CPU performance and capacity. v Cycle Time
CPU time = Path length * (Cycles / Instruction) * Cycle time
Path length depends on the instruction set and the compiler Cycles/instruction depends on the design (e.g. microcoding of instructions) Cycle time depends on the technology (for example, CPU model) Thus, cycle time by itself is not a complete indicator of CPU speed. v MIPS This term depends on cycle time and number of cycles per instructions, it is only valid for comparison if the instruction sets are the same. In common usage today, MIPS are used as a single number to reflect relative processor capacity. As with any single-number capacity indicator, your experience may vary considerably. These numbers are often based on vendor announcements. By the way, MIPS means Millions of Instructions Per Second (some people say Misleading Information about Processor Speed). There does not exist a term MIP which often is used erroneously. A small processor has a speed of 1 MIPS, not 1 MIP. v SRM Constant MSU The SRM constant is a number derived by product engineering to normalize the speed of a CPU, so that a given amount of work would report the same service unit consumption on different processors. Service Units (SU) are typically used by the System Resource Manager (SRM) as a basis for resource management.
Service Units = CPU seconds * SRM constant
Many installations who have to charge the users do so on the basis of SUs. This is a reasonable approximation of relative capacity. But of course, any single-number metric for processor capacity comparison can be misleading, and LSPR numbers, based on your own workload mix, are the best to use. You get more details about this in the following section that describes ITR and LSPR. The SRM constants are contained in internal tables within MVS and are published in the z/OS MVS Initialization and Tuning Guide. The SUs are reported in many Monitor I and Monitor III reports. The power of a system can be characterized also in terms of MSUs (millions of service units per hour). These numbers are published in the LSPR documentation.
20
v Internal Throughput Rate (ITR) ITR numbers are measurements by workload type which determine the capability of a machine in terms of the number of transactions per CPU second. This is important, since processor capacity varies according to the workload mix.
Number of transactions ITR = ---------------------CPU time
ITR is a function of: CPU speed Operating system Transaction characteristics (workloads differ) ITRs are derived from the LSPR methodology. See Large Systems Performance Reference for details. ITR provides a reliable basis for measuring capacity and performance. v External Throughput Rate (ETR)
Number of transactions ETR = ---------------------- = ITR * CPU utilization Elapsed time
Any bottleneck, internal or external to the system, will effect the ETR. Examples include: I/O constraints, tape mount delay and paging delay. Thus it is more difficult to get a repeatable measure than with ITR. v Relative Processor Power (RPP) RPP is the ratio of ITRs for a specific workload mix (usually 25% Batch, 25% TSO, 25% CICS, and 25% IMS) for different machines. RPP is normalized to a base machine. So, if a given function takes a certain amount of RPPs, you can estimate how much it would consume on a different machine by using the formula:
Used RPPs * 100 Your utilization = ------------------Your machine RPPs
v MFLOPS MFLOPS = millions of floating point instructions per second. Used only in numerically intensive computing (scientific/technical). In summary, various numbers may commonly be used as rough indicators of CPU capacity. Remember that most of these are very rough approximations only, and your actual capacity may vary significantly. To get the most accurate assessment of CPU capacity for your workload, differences in workload processing characteristics must be taken into account using a methodology such as that described in the Large Systems Performance Reference. You find the most current version via
http://www.ibm.com/servers/eserver/zseries/lspr/zSeries.html
21
Where do you go from here?

Having reviewed some of the main MVS performance concepts, the following chapters will discuss how to recognize and resolve specific performance problems. One thought to bear in mind on your journey: You can only do so much as a system programmer; you need help from your application designers and data base designers. A poorly written application can nullify a lot of system tuning effort. Participate in the design reviews, where you can stress that you need reasonable call structures to data bases, practical data base design etc.
22
Chapter 2. Diagnosing a problem: The first steps
Lets Start the Diagnosis This chapter explains the first steps on your way to performance management: v How to recognize a performance problem v How to find that system in the sysplex that contributes most to the problem v How to find the major cause of the problem (CPU, processor storage, I/O, etc.). Generally, there are two different approaches: One for response time problems (primarily using Monitor III delay reports) And one for system indicators showing stress (primarily using Monitor I/II resource-usage reports) v How to continue by reading the other chapters in this book for further problem analysis and resolution
23
Performance analysis
What is a performance problem?

There are many views on what constitutes a performance problem. Most of them revolve around unacceptably high response times or resource usage, which we can collectively refer to as pain. The need for performance investigation and analysis is for example indicated by: v Bad or erratic response time Service level objectives are being exceeded Users complaining about slow response Unexpected changes in response times or resource utilizations v Other indicators showing stress Monitor III Workflow/Exceptions System resource indicators (for example, paging rates, DASD response) Expected throughput on the system not being attained Ultimately, you will have to decide for yourself whether a given situation is a problem worth pursuing or not. This will be based on your own experience, knowledge of your system, and sometimes politics. We will simply assume for the following discussions that you are trying to relieve some sort of numerically quantifiable pain in your system. Generally, a performance problem is the result of some workload not getting the resource(s) it needs to complete in a timely manner. Or, less common, the resource is obtained but is not fast enough to provide the desired response time. There are three ways to gain the resource needed: v Buy it. v Create the illusion you bought it. This is known as tuning. Capability to do this implies you have been previously wasting resources. Like purchasing, there is a cost. The cost is people; while it may be higher, it is always less visible than purchase. v Steal it (take it from a less important application). Again there is a cost. Here the cost is lower service to the application from which the resources were stolen. If none of these options are technically or financially possible, it will be necessary to change users (and managements) expectations. You may have experienced the situation where you complete an extensive performance analysis only to conclude that no further tuning or stealing of resources can be done. One of the goals of this book is to assist you in determining whether you have reached this point.
24
Getting started: A top-down approach to tuning
Daily Monitoring
No
Pain ?
Yes
SYS1
CPU I/O -
Chapter 3 Chapter 4 Chapter 5 Chapter 6 Appendices
Sysplex
SYSx SYSn
Storage WLM / CF LPAR / IRD -
Figure 12. Simplified View of Performance Management
This book will take the following top-down approach to investigating and resolving performance problems. Figure 12 is a simplified view: 1. Through your daily performance monitoring activities recognize any of the pain indicators as being a problem on your sysplex. 2. Discover the system where you can see the most important performance problems. 3. Analyze the data to determine which component of the response time holds the most promise for relief. 4. Go to the appropriate chapter or appendix for more detailed advice. Daily performance monitoring what does this mean? If you ask ten people, probably you will get eleven answers. Or vice versa, there is not just one answer to the question: How should I do daily performance monitoring? Having in mind customers who want to be satisfied with the service that they receive from the system you are responsible for, the focus should be on observing the service level agreements. If the goals being described there can be attained then you are providing the service that your customers are expecting. In the past, it was not a simple task of monitoring the response time and throughput objectives that are described in the service level agreements. And furthermore, it was not a simple task to adjust all parameters in the Parmlib members IEAIPSxx and IEAICSxx to reach the optimal performance of your system. But now, its easy and simple: you just define your goals in the service policy and let the system run. The workload manager is doing its best in managing the workload in your system for achieving all goals your job is done automatically, be happy. STOP if this would be true we could stop at this point of time. Of course, its not true because there are many aspects which have to be mentioned with regard to performance management.
25
Starting with the sysplex view, you get a high-level overview about the performance indicators of your workloads. This will lead you to the most relevant systems in the sysplex. There, you may start with the largest cause of delay, do what you can to address it, go back and address the next largest cause of delay, and so on. Continue to do this until the service objectives are met, or until you reach a point of diminishing returns, where further effort does not pay off.
In short ... ... use the data available from RMF to ensure that your efforts are focused where they can do the most good. Of course, with any performance problem there are many different approaches you can take to analyze the data. We have selected these two paths: v Response time problems. Follow this path if you have a user or group of users complaining about response time. v System indicators. Follow this path if your ongoing performance monitoring shows signs of a problem, or for a general health check. This path starts on page 41.
Where to start sysplex monitoring?

When using RMF, you are interested in getting a performance overview at a glance. You have defined a service policy that contains several workloads, each of them with service classes and report classes that have goals of different types. Monitor III and the Postprocessor offer you some reports that show how your sysplex is running at different levels of detail. To start with:
Monitor III Sysplex Summary report

This report shows (based on your option selection) all or only selected workloads, service and report classes, and periods with their goals and actual performance values. When you run Monitor III in GO mode, you see the Performance Status Line that summarizes the key performance indicators for up to 80 time ranges, showing you the performance history of your sysplex. To continue with:
Monitor III Response Time Distribution report

You can use this report to see how several systems of the sysplex contribute in servicing one specific service class. This is the entry point from the sysplex to single-system reports, for example: GROUP STOR DELAY JOB WFEX Group Response Time Report Storage Delays Report Delay Report Job Delays Report Workflow/Exceptions Report
26
Sysplex monitoring
Monitor III Work Manager Delays report

Here, RMF provides detailed performance data about the CICS and IMS subsystems. The Work Manager Delays report shows you for CICS and IMS several kinds of delay data about your transactions, being in the begin-to-end or in the execution phase. You get an overview of how the different CICS address spaces (for example, the AOR, TOR, or FOR) provide service to the service classes your transactions belong to.
Postprocessor Workload Activity report

You can use this report to get performance and resource data for your service classes and workloads in the sysplex. You will find a detailed discussion including report samples in Understanding Workload Activity data for IMS or CICS on page 146.
27
Sysplex monitoring
Monitor III indicators

Use the following reports for monitoring performance of your sysplex: v Sysplex Summary report v Response Time Distribution report on page 32 v Work Manager Delays report on page 34
Sysplex Summary report

RMF V1R11 Command ===> WLM Samples: 385 Systems: 3 Sysplex Summary - WTSCPLX1 Line 1 of 72 Scroll ===> HALF Sec
Date: 04/30/09 Time: 09.30.00 Range: 100
>>>>>>>>XXXXXXXXXXXXXXXXXX<<<<<<<< Service Definition: CICSPOL Active Policy: CICSMAJ Installed at: 03/24/09, 11.57.51 Activated at: 03/24/09, 11.58.16 --Avg. Resp. TimeWAIT EXECUT ACTUAL Time Time Time 0.000 0.000 0.580 2.970 0.000 0.563 0.000 0.000 0.000 0.006 0.000 0.000 1.576 2.109 0.000 1.576 0.000 0.000 0.000 3.460 0.000 0.000 2.156 5.079 0.000 2.139 0.000 0.000 0.000 3.466
Name BATCH BATCHLOW CICS CICSDFLT CICSRGN CICUSRTX SYSTEM SYSSTC SYSTEM TSO
------- Goals versus Actuals -------- Trans Exec Vel --- Response Time --- Perf Ended I Goal Act ---Goal--- --Actual-- Indx Rate 12 12 N/A N/A 0.300 AVG 70 70 N/A 0.090 AVG 68 N/A 70 N/A N/A 63 N/A 68 1 0.000 0.000 18.86 16.9 0.111 1.00 0.000 23.8 18.75 0.000 0.000 0.000 0.537 0.09
W S 5 W S 2 S 2 S 2 W S S W
5.079 2.139
AVG AVG
Figure 13. Monitor III Sysplex Summary Report
You can select what should be part of this report by using the report options: v What types should be reported? These can be everything from workload to service class period, as well as report classes. v What performance index should be used as threshold? This allows to show only the data that have an index high enough for you to want to be informed about it. v Is any importance important? If you want to see only high or medium important work - you can select this. v Are you interested in all workload groups you have defined? Then you can display them - otherwise the report shows only active workload groups and their details.
28
Sysplex monitoring
Performance Status Line in the Sysplex Summary Report
RMF V1R11 Command ===> WLM Samples: 385 Systems: 3 Sysplex Summary - WTSCPLX1 Line 1 of 72 Scroll ===> HALF
Date: 04/30/09 Time: 09.30.00 Range: 100 Sec Refresh: 100 Sec
----|||||--|----X--XX-XXXX-XX-XX--||---XXXXX---|--||||||||--XXXX--X--XX----XX Figure 14. Monitor III Sysplex Summary Report - GO Mode
Indicator Field: ----|||||--|----X--XX-XXXX-XX-XXDescription: The performance status line gives a performance indication, by its symbols and colors, for each range when the Monitor III reporter session was in GO mode: Green All goals have been attained for this range Yellow Service class periods with low or medium importance have not attained the goal. Red Service class periods with high importance have not attained the goal.
If you switch from GO mode to STOP mode, the reporter builds no reports for this time, and therefore, no indicator for the status line will be created. When continuing with GO mode, you will see one or several blank fields in the status line. The status line is updated with each refresh of the report in GO mode. Assuming a range and refresh value of 100 seconds, you will see the full status line after 8000 seconds with each interval indicator shifted to the left by one position. You can take this status line to see at a glance the performance status of your sysplex during the previous minutes or hours. If green is the dominant color, your sysplex seems to be in good shape, as more yellow or red indicators appear, you might start investigating the sources of possible problems.
29
Sysplex monitoring
Indicator Field: Perf Indx Description: The performance index shows how well a performance goal could be achieved and is calculated from the goal and actual performance data. Guideline: A value of 1 means that the actual value met the goal exactly, a lesser value indicates that the actual value is better than the goal, a higher value could be seen as an indication of a performance problem. If the goal could not be attained, the lines for the service class period, service class and workload are displayed in red (for high importance) and yellow (for medium and low importance). The same color is also given to the corresponding field in the performance status line. Problem Area: The red and yellow lines can be indicators of performance problems in the sysplex. If it is a one-time event, you might ignore it. If some lines show red continuously, further investigation is recommended. Potential Solution: Depending on the type of the workload you want to study, you can proceed with the Response Time Distribution report to get an impression of how this service class performs on the different systems in the sysplex, or you might choose the Work Manager Delays report that shows you more about CICS and IMS delays. Note: The calculation of the performance index is done with actual values that are averages for the reporting range (could be, for example, 100 seconds for Monitor III, or 30 minutes in the Workload Activity report). You might distinguish between this value and the performance index that is used by the workload manager internally to manage all service classes. This index is calculated every 10 seconds and you find the values in SMF records type 99. Due to the different ranges it is obvious that a difference will be seen.
30
Sysplex monitoring
Indicator Field: Avg. Resp. Time Description: The response time information for all workloads and service classes is given in three fields: 1. ACTUAL Time The average response time for all ended transactions. Note that these response times are for ended transactions only. Thus, if there is a problem where transactions are completely locked out, either while queued or running, you cant see the problem on this report until the locked out transactions end. 2. EXECUT Time For CICS transactions, this includes execution time in AOR and following regions. For IMS transactions, this includes execution time within the MPR. For Batch, TSO, etc., this is the average time that transactions spent in execution. Note: In the Postprocessor Workload Activity report, you see this field as EXECUTION TIME. 3. WAIT Time This time is calculated as the difference between ACTUAL and EXECUT time, as long as ACTUAL time is the bigger value. However, for subsystem data, it can happen that EXECUT time is more than the ACTUAL time. For CICS transactions, this includes not only queuing in the TOR and AOR, but also processing time within the TOR. For IMS transactions, this includes not only queuing for the MPR, but also processing time within the CTL region. Otherwise, this is the average time that transactions spent waiting on a JES or APPC queue. Note that WAIT time may not always be meaningful, depending on how the customer schedules work. For example, if a customer submits jobs in hold status and leaves them until they are ready to be run, all of the held time counts as queued time. The server service classes are blank in the Avg. Resp. Time the and Trans Ended Rate columns, because their transactions are address spaces, and response times are available only for ended transactions. Problem Area / Potential Solution: See chapter Understanding Workload Activity data for IMS or CICS on page 146.
31
Sysplex monitoring
Response Time Distribution report

RMF V1R11 Response Time - WTSCPLX1 Command ===> WLM Samples: 385 Line 1 of 7 Scroll ===> HALF
Systems: 3 Date: 04/30/09 Time: 09.30.00 Range: 100 Sec
Class: CICUSRTX % 60| Period: 1 of | XXXX TRX | |||| XXXX Goal: | |||| XXXX 0.090 sec avg 30| |||| XXXX | |||| XXXX | |||| XXXX | |||| ................................ XXXX Resp. 0|----+//---+---+---+---+---+---+---+---+---+//---- Time <.054 0.063 0.090 0.126 >.135 (sec) --Avg. Resp. Time-- Trx --Subsystem Data-- --Exec Data-System Data WAIT EXECUT ACTUAL Rate Actv Ready Delay Ex Vel Delay *ALL SC47 SC49 SC50 SC52 SC53 SC54 all all none all none none 0.563 1.576 2.139 0.413 1.312 1.725 0.703 1.845 2.548 0.660 1.692 2.352 18.75 8.24 5.16 5.35 1 1 1 1 2 2 2 2 79 71 85 87
Figure 15. Monitor III Response Time Distribution report
This report enables you to analyze the distribution of a response time to see whether a response time goal was met and, if not, how close it came to failing. The report shows how the response time for a specific service or report class period is distributed. Two levels of detail are shown: v A character graphic shows the distribution of response time for all systems in a sysplex which have data available in this period. This graphic is shown only for a periods with a response time goal. v The details of how each system contributed to the overall response time. Using cursor-sensitive control, you can navigate from this report to the GROUP report, which provides a detailed analysis of the response time, or to the SYSINFO report.
32
Sysplex monitoring
Indicator Field: Response Time Distribution Graphic Description: The horizontal axis shows response time (in seconds) with the response time goal in the middle. The middle section of the graph surrounding the goal shows the distribution of transactions that met between 60% and 150% of the goal. Guideline: Easy the more green fields you see the better your system is performing. Problem Area: A big number of transactions on the right of the graphic in red indicates a problem. Potential Solution: There can be different reasons for getting a distribution graphic like the one shown in the sample report: v If you get this graphic more or less permanently, then probably different types of transactions are classified for the same service class. You can analyze the response time of the transactions by classifying them to different report classes, and then as the second step to different service classes to get more meaningful reports. v The problem can also be either an untuned sysplex, a temporary performance problem, or an unrealistic performance goal. Then you need to investigate further.
Indicator Field: System Description: The bottom part of the report shows each system in the sysplex that provides service to the chosen service class. Guideline: You can use this report to evaluate possible anomalies among the different systems that provide service. Problem Area: This part of the report shows whether the any system has problems achieving the required goal. Potential Solution: With cursor-sensitive control you can navigate from this report into the SYSINFO report for the system you are interested in. Then you can continue performance analysis as described in Using Monitor III reports on page 36.
33
Sysplex monitoring
Work Manager Delays report

RMF V1R11 Command ===> WLM Samples: 385 Systems: 3 Work Manager Delays - WTSCPLX1 Line 1 of 7 Scroll ===> HALF Sec
Date: 04/30/09 Time: 09.30.00 Range: 100
Class: CICUSRTX Period: 1 Goal: 0.090 sec average Actual: 2.139 sec average Sub P Type
Avg. Resp. time: 2.139 sec for 1805 TRX. Avg. Exec. time: 1.576 sec for 1558 TRX. Abnormally ended: 0 TRX.
--------------- Response time breakdown (in %) ---------- -Switched-Tot Act Rdy Idle -----------------Delayed by------------ Time (%) LOCK I/O CONV DIST SESS TIME PROD MISC LOC SYS REM 79.0 66.4 0.5 0.5 1.8 0.4 0 0 2.3 1.3 67.0 2.8 21.6 0 0 6.1 0 40.8 0 0 0 0 67 0 0.3 22 0 0 0 0
CICS B CICS X
----------- Address Spaces Serving this Service Class CICUSRTX -------------Jobname M ASID System Serv-Class Service Proc-Usg I/O-Usg Veloc Capp Quies CICSAOR1 CICSAOR2 CICSAOR3 CICSAOR4 CICSCMAS CICSFOR CICSTOR 46 49 50 52 54 39 51 SC47 SC47 SC49 SC52 SC47 SC47 SC47 SYSSTC SYSSTC SYSSTC SYSSTC SYSSTC CICSRGN CICSRGN 100 100 100 100 50 100 100 .61 3.7 4.2 2.1 11 14 14 4 3 1 2 8 6 8 55 74 43 79 54 77 63 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Figure 16. Monitor III Work Manager Delays Report
Indicator Field: Response time breakdown (in %) Description: This part of the report provides performance information for the begin-to-end phase (CICS only) and the execution phase of CICS and IMS. Guideline: This information is the same as in the Postprocessor Workload Activity report with just the difference in the interval length. For details, refer to Understanding Workload Activity data for IMS or CICS on page 146.
Indicator Field: Address Spaces Serving this Service Class Description: You can see in this part of the report what address spaces provide service on different systems in the sysplex. Guideline: The velocity (Veloc) and the processor service (Proc-Usg) for each address space can be seen as initial indicators to know how well the address spaces are providing service to the service classes of your transactions.
34
Response time analysis
Identifying the major delay in a response time problem

You have a response time problem? Want to know where to go next? This section will take you through several methods of analyzing a response time problem to find the major cause. It then points you to the relevant chapter for further diagnosis and problem resolution suggestions. For a response time problem this means understanding the end-to-end response time components as completely as possible. As shown in Figure 17, most transactions can be viewed as having two major response time components: v internal (for example, CPU time, paging, I/O) v external (for example, network delay, server delay from other platforms) RMF can help you analyze the internal response time components, and that is where this book will focus. The external response time components are beyond the scope of RMF, and will require other monitoring tools such as NetView*.
User Response Time

RMF Report
INTERNAL
Mount Delay Operator Delay
EXTERNAL
Network Server Response
CPU
Storage I/O Enqueue

Others
Figure 17. End-to-End Response Time Components
Understanding the relative proportions of these components allows you to see where you have the most leverage to improve response times. For example, if 90% of your response time is in I/O, why start with the CPU? For response time problems, generally the best place to start is with Monitor III. It reports on contention for resources (both hardware and software) and any associated delays. Start by setting the range (or interval) of the data youre looking at to match at least a portion of the time in which the problem occurred. This could be current data in GO mode, or prior data. The choice of the range value must not be too high, which would smooth the peaks, or too low, to have enough samples to provide useful data. The range should contain at least about 100 cycles. If the cycle is one-second (which is a good value, and is the default in RMF) a 1-5 minute time for the interval will generally fit. 100 seconds is the recommended value. Displayed
35
Response time analysis

percentages are then easily converted into elapsed time. If you are unfamiliar with altering these values, see the z/OS RMF Users Guide for more details. If your response problem is sporadic, or due to brief spikes, you may need to adjust your range, moving forward or backward in time to find the problem. If youre having trouble, set the range to 100 seconds and move through time to pinpoint a spike. Do not base your analysis on a single occurrence, as it could be an anomaly. Check over multiple days or times of day, to verify that the problem is real.
Using Monitor III reports

This section will discuss several reports, showing you how to identify the major cause of delay to a workload. Your starting point might be the GROUP report when you used the cursor sensitivity in the SYSRTD report.
36
Performance analysis with Monitor III
GROUP Report
If you have a response time problem for TSO or batch, go to the Group Response Time report. Look at the Primary Response Time Component field to see the largest component of response time.
RMF V1R11 Command ===> Samples: 100 System: PRD1 Date: 04/07/09 Time: 10.32.00 Range: 100 Sec Group Response Time
Class: TSOPROD Period: 1 Primary Response Time Component:
Description: TSO Production Storage delay for local paging EXCP Rate 3.2 PgIn Rate 4.1 TRANS Ended Rate 0.2 --- Response Time ----- Ended TRANS-(Sec) WAIT EXECUT ACTUAL 0.000 3.432 3.432
WFL % 50
Users TOT ACT 2 1
Frames %ACT 5
USGDEV 0.17 0.97
Report Analysis This report has two indicators for storage problems: v Primary Response Time Component: Storage delay for local paging v Response Time ACT: Average storage delay of 1.25 seconds For more information, position the cursor on the field you are interested in and press ENTER. For this example, the reporting link will take you to the Storage Delay report, shown in figure 19.
37
STOR Report
RMF V1R11 Command ===> Samples: 100 System: PRD1 Date: 04/07/09 Time: 10.32.00 Storage Delays Line 1 of 2 Scroll ===> HALF Range: 100 Sec
Jobname BAJU BHOL Report is
Service DLY ------- % Delayed for -----C Class % COMM LOCL VIO SWAP OUTR T TSO 21 0 18 0 3 0 T TSO 1 1 0 0 0 0 for service class TSO only.
-- Working Set -Central Expanded 273 5 1111 4
Figure 19. Monitor III Storage Delays Report
This report shows the address spaces which are affected by storage delays, in this example just for service class TSO.
Report Analysis Address space BAJU is shown with the largest delay.
DELAY Report
If you do not know which group a delayed user belongs to, use the Delay report.
RMF V1R11 Command ===> Samples: 100 System: PRD1 Date: 04/07/09 Time: 10.32.00 Delay Report Line 1 of 43 Scroll ===> HALF Range: 100 Sec
Name BHOLEQB JES2 SMF BAJU RMFGAT BHOL BHOLPRO2
CX B S S T S T B
Service WFL USG DLY IDL UKN ------ % Delayed for ---- Primary Class Cr % % % % % PRC DEV STR SUB OPR ENQ Reason BATCH SYSSTC SYSSTC TSO SYSSTC TSO BATCH 0 0 51 0 49 0 0 2 0 98 0 0 1 0 99 29 9 22 67 2 50 1 1 0 98 78 18 5 73 4 93 93 7 0 0 0 0 0 1 1 1 7 0 0 2 0 1 0 0 21 0 0 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 SYSDSN 0 SYSPAG 0 SYSPAG 0 LOCL 0 BAJU 0 RMFUSR 0 BHOLPRO1
Figure 20. Monitor III Delay Report
The report lists all delayed address spaces sorted in ascending order by workflow percentage.
Report Analysis There are two jobs with significant delays: v BHOLEQB with 51% enqueue delay v BAJU with 21% storage delay We will continue analyzing the storage problem by positioning the cursor at the BAJU address space name for further processing. This leads to the Job Delay report with initial information about possible causes, as shown in figure 21.
38
JOB Report
RMF V1R11 Command ===> Samples: 100 Job: BAJU System: PRD1 Date: 04/07/09 Time: 10.32.00 Job Delays Line 1 of 4 Scroll ===> HALF Range: 100 Sec
Primary delay: Paging of private area storage.
Probable causes: 1) Job may be using excessive central storage. 2) Paging configuration may need tuning. Help panels contain more possible causes. --------------------------- Job Storage Usage Data ---------------------------Average Frames: 300 Working Set: 278 Fixed Frames: 43 Active Frames: 92 Aux Slots: 978 DIV Frames: 0 Idle Frames: 209 Page In Rate: 14.1 ES Move Rate: 0.0 --------------------------- Job Performance Summary --------------------------Service P WFL -Using%- DLY IDL UKN ----- % Delayed for ----- Primary CX ASID Class P Cr % PRC DEV % % % PRC DEV STR SUB OPR ENQ Reason T 0027 TSO * 29 4 5 22 67 2 1 0 21 0 0 0 LOCL TSO 1 33 1 1 4 67 2 0 0 4 0 0 0 SWAP TSO 2 35 2 4 11 0 0 1 0 10 0 0 0 LOCL TSO 3 13 1 0 7 0 0 0 0 7 0 0 0 LOCL Figure 21. Monitor III Job Delays Report
Report Analysis Monitor III has done basic analysis of the status and provides some possible causes. Use these as a starting point, and investigate further to confirm or rule-out these possibilities. In this example, swap trim may also be worth investigating. In this example you would need to investigate processor storage further. See Chapter 5, Analyzing processor storage activity for details.
39
WFEX report
If your help desk uses the Workflow/Exceptions report as a continual system monitor, they would be alerted to exception conditions. These could be your first indicators to a problem. This depends on your having customized the screen to your particular needs and thresholds. See the z/OS RMF Report Analysis for tailoring information. Or, you can use this screen as your first diagnosis aid if investigating a user problem.
RMF V1R11 Command ===> Samples: 100 System: PRD1 Date: 04/07/09 Time: 10.32.00 Workflow/Exceptions Line 1 of 4 Scroll ===> HALF Range: 100 Sec
--------------------------- Speed (Workflow) --------------------------------Speed of 100 = Maximum, 0 = Stopped Average CPU Util: 100 % Name Users Active Speed Name Users Active Speed *SYSTEM 43 3 72 TSOPROD 2 1 50 ALL TSO 2 1 50 BATCHPRD 6 3 75 ALL STC 35 0 84 ALL BATCH 6 3 75 ALL ASCH Not avail *PROC 17 2 93 *DEV 6 0 21 ------------------------------ Exceptions ------------------------------------Name Reason Critical val. Possible cause or action *ECSA* SECS% > 85 90.6 % System ECSA usage 91 %. *STOR TSQAO > 0 784K bytes SQA overflow into CSA 784K. BHOLEQB ENQ -SYSDSN 51.0 % delay ENQ.TEST.DAT HSM Not avail Job HSM is not running. Figure 22. Monitor III Workflow/Exceptions report
Speed, or workflow, can be used as a performance indicator either for the total system or for groups of users or resources. It is an indicator of how well a workload is able to get the system resources it wants. 100% means it gets whatever it wants whenever it wants it; 0% means it never gets what it wants. Speed is also used on a resource level (processors or devices) to reflect contention for that resource. Speed can also be a useful indicator for exception reporting for your workloads. Make note of the workflow values reported when your system is running well. Use these as your own guideline values to set up exception reporting.
Report Analysis The Speed of 50% for all TSO users might suggest a response time problem, if users are complaining, or if your experience has shown that a higher workflow is required to meet service objectives on your system. However, your system may function well at a 50% speed. Use your judgment and review related indicators to determine whether or not a low speed is a problem.
40
DELAY versus USING

So far, we have been concentrating on workload delays. Your workloads dominant state could also be USING. Both cases will be discussed further in the relevant chapters. Performance problem due to DELAY: If the largest single component of response or delay for the workload youre interested in is: v PROC (or processor delays), go to Chapter 3, Analyzing processor activity for further analysis v DEV (or device delays), go to Chapter 4, Analyzing I/O activity for further analysis v STOR (or storage delays), go to Chapter 5, Analyzing processor storage activity for further analysis If some other component (ENQ, IDL, SUBS, OPER, UKN) indicates a problem, see Appendix D. Other delays. Performance problem due to USING: If there are no delays of any significance (at least 5-10%), look at the USING% values for PROC and DEV in the Job Delay report (other reports also have this information). Select the larger, and go to the corresponding CPU or I/O chapter for some thoughts on reducing the workloads need for that resource or speeding up the processing.
Monitoring daily performance: System indicators

As discussed at the beginning of this chapter, there are different kinds of performance problems. Response time complaints or SLA exceptions are two examples. Potential problems may also be indicated by your daily performance monitoring. This section will discuss some RMF indicators which will be of use in your daily monitoring. Here are some of the most frequently used global indicators of MVS system health, along with some rules-of-thumb (ROTs), on what constitutes potential trouble.
As with any ROTs, these should be looked at as a starting point only. There is no magic single number for any indicator that suggests you have a problem.
The most important thing is that your service levels are being met. We will highlight the indicators by using RMF reports, with pointers to the appropriate chapter of this book for more details. As you read these, remember that Monitor III will show you the impact of most of these indicators: v WHO is being delayed? v For HOW LONG? v And BY WHOM? Use this delay information to help you decide whether a given indicator really means trouble or not.
41
System indicators
Summary of major system indicators

|
| | Figure 23. Summary of Major System Indicators. Spreadsheet Reporter macro RMFY9OVW.XLS (System Overview | Report - Summary) The Summary report displays key indicators representing the performance of your system. At a glance, you see whether your system experiences processor or storage contention. A resource contention analysis is done for three indicators: CPU Contention | | Storage Contention The percent of time with at least one job waiting for the CPU. The page fault rate is used as indicators for storage contention.
You can specify a time range and a day to perform the contention evaluation. The calculated values will be compared against a set of defined thresholds: CPU Contention | Storage Contention 20 - 80 10 - 50 (page fault rate)
You can specify the low and high thresholds which are used to set the traffic light to red or green according to your own experiences. Especially, the storage contention threshold depends on the key workload you are running in your system. If it is an online application (for example CICS or DB2), you should decrease the values to 10 - 30 (page fault rate). On the other hand, a TSO system can run with much higher values, here you might specify 20 - 70.
42
System indicators
Please keep in mind that the contention analysis should be done with data for a longer interval, the sample report covers prime shift data (8:00am to 3:59pm), therefore, the thresholds are different to other values in this publication, because those will be taken for interval reports of 15 or 30 minutes.
Using Monitor I reports

This section discusses the use of the following Monitor I reports: v CPU Activity report on page 44 v Partition Data report on page 47 v Channel Path Activity report on page 48 v I/O Queuing Activity report on page 49 v v v v v DASD Activity report on page 50 Paging Activity - Swap Placement Activity on page 53 Page/Swap Data Set Activity report on page 54 Workload Activity report on page 55 Virtual Storage Activity report on page 57
43
System indicators
CPU Activity report

C P U A C T I V I T Y PAGE 1 z/OS V1R11 SYSTEM ID OS04 RPT VERSION V1R11 RMF H/W MODEL E26 DATE 06/05/2009 TIME 09.29.00 INTERVAL 14.59.996 CYCLE 1.000 SECONDS
CPU
2097
MODEL
720
SEQUENCE CODE 00000000000473BF LOG PROC SHARE % 35.0 35.0 35.0 35.0 35.0 35.0 35.0 35.0 35.0 35.0 35.0 385.0 7.1 7.1 14.2
HIPERDISPATCH=NO --I/O INTERRUPTS-RATE % VIA TPI 382.8 249.3 273.4 534.1 519.6 874.6 866.3 847.9 828.6 833.0 816.1 7026 0.38 0.43 0.35 0.20 0.21 0.17 0.18 0.16 0.16 0.16 0.15 0.20
---CPU--NUM TYPE 0 1 2 3 4 5 6 7 8 9 A CP
---------------- TIME % ---------------ONLINE LPAR BUSY MVS BUSY PARKED 92.55 75.66 72.95 70.05 68.05 91.00 90.26 89.30 88.19 86.92 85.46 82.76 1.83 1.51 1.50 94.06 96.81 99.04 96.13 99.66 97.42 96.57 97.19 99.43 98.09 96.01 97.31 1.31 0.84 0.86 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
CP 100.00 CP 100.00 CP 100.00 CP 100.00 CP 100.00 CP 100.00 CP 100.00 CP 100.00 CP 100.00 CP 100.00 CP 80.00 TOTAL/AVERAGE
B AAP 100.00 C AAP 100.00 AAP AVERAGE
0.00 0.00
SYSTEM ADDRESS SPACE ANALYSIS
SAMPLES =
900
--------NUMBER OF ADDRESS SPACES--------QUEUE TYPES IN IN READY OUT READY OUT WAIT LOGICAL OUT RDY LOGICAL OUT WAIT ADDRESS SPACE TYPES BATCH STC TSO ASCH OMVS 42 100 16 0 0 88 185 16 0 38 74.2 138.7 16.0 0.0 15.6 MIN 86 2 0 1 0 32 MAX 181 67 18 48 16 135 AVG 134.5 13.5 0.2 24.3 0.2 85.3
-----------------------DISTRIBUTION OF IN-READY QUEUE-----------------------NUMBER OF ADDR SPACES <= = = = <= <= <= <= <= <= <= <= > N N N N N N N N N N N N N 0 10 20 30 40 50 60 70 80 90 100 !....!....!....!....!....!....!....!....!....!....! >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>> >> >>> >>>>> >>> >> >> > >
(%) 58.8 6.4 4.3 3.9 4.9 8.3 4.5 3.7 3.2 1.2 0.8 0.0 0.0
+ + + + + + + + + + + +
1 2 3 5 10 15 20 30 40 60 80 80
N = NUMBER OF PROCESSORS ONLINE ( 10.8 ON AVG)
Figure 24. Monitor I CPU Activity Report
44
System indicators
Rules-of-Thumb BUSY TIME PERCENTAGE A value of over 85%, along with the value for the NUMBER OF ADDRESS SPACES in the distribution of the IN READY QUEUE for line <= N below 60% implies contention for the CPU. If your CPU is more than 85% busy, look at the values for the DISTRIBUTION OF QUEUE LENGTHS (%). If the number of address spaces in line <= N for example is 58.8 , this means that in 58.8% of the time, you had enough CPs to handle all the work on the IN READY queue. In the remaining time you had ready work waiting for CPs. This does not mean that you cannot run a z/OS system at higher utilizations, up to 100% busy. You can. Just be aware that the busier the CPU, the longer the CPU delay for lower-priority workloads will be. The more low-priority work you have, and the fewer non-CPU bottlenecks you have (for example, responsive I/O), the busier you can run your CPU and still maintain good response for your high-priority workloads. If your goal is to run your system to 100% busy, minimize your non-CPU bottlenecks and use the information from the Monitor III Processor Delay report to keep an eye on CPU delay for your low-priority work. BUSY TIME PERCENTAGE Conversely, a low CPU busy could indicate that other bottlenecks in the system are preventing work from being dispatched. A peak-to-average ratio (for example, peak hour busy/prime shift average busy) of less than 1.3-1.4 for a commercial workload could also indicate bottlenecks preventing work from using the CPU. This ratio is inversely correlated to IN READY users: the numbers of IN READY users go up as the peak-to-average ratio comes down. This is due to latent demand and the inability of the system to dispatch all the ready work. OUT READY Number of address spaces swapped out (probably TSO and batch), but ready to execute. If it is greater than 1, and TSO or batch response is an issue, look into this. This could reflect processor storage constraints, and probably you may need to update the WLM service policy.
Report Analysis v BUSY TIME PERCENTAGE is 97.31 % v Distribution of address spaces in IN READY QUEUE <=N: 58.8 % This might point to a problem, because at least 41.2% of the time one or more address spaces are delayed for CPU. The Monitor III Processor Delay report will tell you which AS were delayed, and for how long.
45
System indicators
See Chapter 3, Analyzing processor activity and Chapter 5, Analyzing processor storage activity for further analysis.
Figure 25. CPU Contention. Spreadsheet Reporter macro RMFY9OVW.XLS (System Overview Report - DaysCont)
The CPU Contention report helps you in understanding your processor capacity. It extends the long-term indicator in the Summary report (Figure 23 on page 42) and informs you about the contention of your system in some more detail.
46
System indicators
Partition Data report

P A R T I T I O N z/OS V1R11 SYSTEM ID NP1 RPT VERSION V1R11 RMF D A T A R E P O R T PAGE DATE 11/28/09 TIME 09.30.00 INTERVAL 14.59.678 CYCLE 1.000 SECONDS 2
MVS PARTITION NAME IMAGE CAPACITY NUMBER OF CONFIGURED PARTITIONS WAIT COMPLETION DISPATCH INTERVAL
NP1 100 9 NO DYNAMIC
NUMBER OF PHYSICAL PROCESSORS CP ICF
9 7 2
GROUP NAME LIMIT AVAILABLE
N/A N/A N/A
--------- PARTITION DATA --------------------MSU---DEF ACT 100 0 5 95 50
-- LOGICAL PARTITION PROCESSOR DATA -----DISPATCH TIME DATA---EFFECTIVE TOTAL 00.04.27.302 00.04.27.519 00.00.21.680 00.00.22.083 00.03.35.761 00.03.35.859 01.12.05.405 01.12.06.405 00.24.11.147 00.24.11.311 00.00.03.103 ------------ -----------01.44.41.295 01.44.46.280 00.14.59.611 00.00.00.000 -----------00.14.59.625 00.00.00.000 00.00.00.065 ------------
-- AVERAGE PROCESSOR UTILIZATION PERCENTAGES -LOGICAL PROCESSORS --- PHYSICAL PROCESSORS --EFFECTIVE TOTAL LPAR MGMT EFFECTIVE TOTAL 24.86 0.60 23.97 80.10 40.30 24.92 0.61 23.98 80.15 40.31 0.01 0.01 0.02 0.01 0.01 0.05 -----0.11 0.01 0.00 0.03 -----0.04 4.25 0.35 3.43 68.69 23.03 0.05 ------ -----99.69 99.80 99.96 0.00 0.03 ------ -----99.95 99.99 99.95 0.00 4.24 0.34 3.41 68.68 23.02
NAME NP1 NP2 NP3 NP4 NP5 *PHYSICAL* TOTAL
S A A A A A
WGT 20 1 10 300 200
-CAPPING-- PROCESSORDEF WLM% NUM TYPE 62.2 1.2 0.0 4 3.3 1.0 0.0 6.0 0.0 4 CP CP CP CP CP
10 NO 1 YES 8 NO 155 NO 52 NO
CFC1 A CFC2 A *PHYSICAL* TOTAL CB88 CB89 D D
DED DED
1 1
ICF ICF
99.98 0.00
99.99 0.00
Figure 26. Monitor I CPU Activity Report - Partition Data Report Section
If your system is showing signs of CPU constraint (see the CPU Activity Report from Figure 24 on page 44), check the Partition Data report to be sure that your LP is correctly configured. You would do this to allocate more CPU resource to a particular LP. See Appendix A. PR/SM LPAR considerations for further analysis.
47
System indicators
Channel Path Activity report

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
C H A N N E L z/OS V1R11 SYSTEM ID CB88 RPT VERSION V1R11 RMF CR-TIME: 16.47.03 P A T H A C T I V I T Y PAGE DATE 11/28/2009 TIME 08.00.00 ACT: POR MODE: LPAR INTERVAL 30.00.007 CYCLE 1.000 SECONDS CPMF: EXTENDED MODE CSSID: 1 1
IODF = 8E ... ... ...
CR-DATE: 01/21/2009
-----------------------------------------------------------------------------------------------------------------------------------DETAILS FOR ALL CHANNELS -----------------------------------------------------------------------------------------------------------------------------------CHANNEL PATH ID TYPE G SHR 12 14 16 20 27 2B 2C 30 31 36 37 38 39 3A 3E 7C 7D 81 82 83 84 85 8C A6 B6 OSD OSD OSD CTC_S CNC_S CNC_S CNC_S FC_S FC_S FC_S FC_S FC_S FC_S FC_S FC_S CNCSM CNCSM FC_S FC_S FC_S FC_S FC_S FC_S FC_SM FC_SM Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y UTILIZATION(%) READ(MB/SEC) WRITE(MB/SEC) PART TOTAL BUS PART TOTAL PART TOTAL 0.00 0.00 0.33 0.00 0.00 1.23 37.13 0.24 0.22 0.00 0.00 0.00 0.00 0.00 0.00 3.08 0.08 1.92 0.07 0.07 0.00 1.66 1.71 0.00 0.00 0.00 0.00 1.10 0.00 0.00 4.90 39.54 32.72 32.27 0.00 0.12 0.26 0.19 0.17 0.29 4.27 0.09 16.35 0.61 0.60 0.00 9.35 11.67 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.47 0.02 0.46 4.13 0.00 0.44 2.79 0.00 0.45 4.14 FICON OPERATIONS RATE ACTIVE DEFER ZHPF OPERATIONS RATE ACTIVE DEFER
5 5 4 4 4 4 4 4
8.49 8.21 0.00 0.03 0.04 0.02 0.02 0.02
0.05 0.05 0.00 0.00 0.00 0.00 0.00 0.00
49.57 47.79 0.00 0.11 0.14 0.03 0.03 0.02
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.24 0.24 0.00 0.02 0.05 0.03 0.03 0.05
208.6 192.6 0.0 10.3 21.2 15.1 14.6 23.0
2.8 1.6 0.0 1.6 2.2 9.0 2.6 2.0
0.6 0.0 0.0 0.0 0.0 1.0 0.1 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 5 5 4 3 3 5 5
4.08 0.21 0.21 0.00 1.02 1.54 0.00 0.00
1.91 0.10 0.10 0.00 0.59 0.70 0.00 0.00
22.77 1.21 1.21 0.00 3.14 6.37 0.00 0.00
0.35 0.02 0.02 0.00 0.23 0.10 0.00 0.00
1.73 0.10 0.10 0.00 1.46 1.05 0.00 0.00
828.9 11.3 11.3 0.1 576.2 803.6 0.0 0.0
2.8 1.0 1.0 1.3 1.2 1.9 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
179.3 45.5 45.5 0.0 214.9 0.0 0.0 0.0
1.3 1.0 1.0 0.0 1.1 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Figure 27. Monitor I Channel Path Activity report
| |
Tip: URL http://www.ibm.com/servers/eserver/zseries/library/techpapers/ gm130120.html provides a document called FICON and FICON Express Channel Performance that helps you with understanding the performance characteristics of the different versions of FICON and FICON Express channels. This paper also explains in detail the benefits of FICON and FICON Express channels in both FC and FCV mode, discusses several channel configurations, compares ESCON and FICON technology and, last not least, recommends how to use FICON RMF information from various reports for I/O configuration performance measurement.
| |
Report Analysis Channel 2C has a partition utilization of 37.13% and a total utilization of 39.54%. This may not be a problem. To find out which logical control unit (LCU) is using this channel, look in the I/O Queuing Activity report (see Figure 28 on page 49). From there, you can go on to check device response times. See Chapter 4, Analyzing I/O activity for further analysis.
48
System indicators
I/O Queuing Activity report

I/O z/OS V1R11 Q U E U I N G A C T I V I T Y PAGE SYSTEM ID SYS1 RPT VERSION V1R11 RMF IODF = 26 CR-DATE: 12/08/2006 DATE 11/28/09 TIME 16.30.00 CR-TIME: 16.09.06 INTERVAL 15.00.036 CYCLE 1.000 SECONDS ACT: ACTIVATE -------- RETRIES / SSCH --------CP DP CU DV ALL BUSY BUSY BUSY BUSY 0.04 0.00 0.01 1.49 0.00 0.00 1.02 0.33 0.85 0.02 0.00 0.01 1.49 0.00 0.00 1.02 0.06 0.85 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.27 0.00 AVG OPEN EXCH 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 DATA XFER CONC 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1
TOTAL SAMPLES =
900
IOP
- INITIATIVE QUEUE ACTIVITY AVG Q RATE LNGTH 1456.345 81.124 1305.816 5469.766 2057.740 253.301 5154.934 112.071 15891.10 0.01 0.00 0.00 0.34 0.01 0.00 0.35 0.00 0.23
------- IOP UTILIZATION ------% IOP I/O START INTERRUPT BUSY RATE RATE 28.06 3.45 2.25 16.33 4.21 0.83 18.62 1.47 9.40 1456.345 81.124 1305.816 5469.766 2057.739 253.301 5154.934 112.071 15891.10 1910.201 213.122 1083.198 5061.730 2079.958 413.870 5498.320 136.763 16397.16 AVG CUB DLY 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
-- % I/O REQUESTS RETRIED -CP DP CU DV ALL BUSY BUSY BUSY BUSY 3.9 1.7 0.1 0.1 0.9 0.7 59.8 59.8 0.2 0.0 0.4 0.2 50.6 50.5 24.7 4.4 46.0 45.7 2.2 0.0 0.0 0.0 0.0 0.2 0.0 20.2 0.2 0.0 0.0 0.2 0.0 0.2 0.0 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 AVG CSS DLY
00 01 02 03 04 05 06 07 SYS
LCU 0011
CU
DCM GROUP CHAN MIN MAX DEF PATHS 90 B0 D7 E3 * 10 11 2 2 6 * 90 B0 D7 E3 * EC 88 D9 96 *
CHPID TAKEN 64.501 83.215 63.353 30.523 241.59 0.034 0.038 0.074 0.147 64.088 82.921 63.831 30.712 241.55 0.020 0.052 0.032 0.036 0.140
% DP BUSY 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
% CU BUSY 0.05 0.03 0.11 0.42 0.11 0.00 0.00 0.00 0.00 0.04 0.03 0.12 0.31 0.09 0.00 0.00 0.00 0.00 0.00
AVG DELAY CMR CONTENTION Q DLY RATE LNGTH 3.1 2.6 3.2 3.5 3.0 0.0 0.0 0.0 0.0 3.0 2.6 3.2 3.5 3.0 0.0 0.1 0.1 0.0 0.1
HPAV WAIT MAX
1400
0.008
3.00
22.2
0.000
0.052
0.001
0012
1500
0.000
0.00
0.1
0.000
0.052
0.001
0013
1600
0.000
0.00
22.2
0.000
0.052
0.001
0014
1700
0.000
0.00
0.2
0.000
0.052
0.001
Figure 28. Monitor I I/O Queuing Activity Report
Report Analysis Channel path 90 is connected to LCUs 0011 and 0013 . The corresponding devices can be seen in the Device Activity report.
49
System indicators
DASD Activity report

D I R E C T z/OS V1R9 A C C E S S D E V I C E DATE 02/10/2007 TIME 09.29.00 CR-TIME: 10.23.17 AVG AVG AVG PEND DISC CONN TIME TIME TIME 0.6 0.3 0.6 0.3 0.5 0.3 0.5 0.3 0.3 0.2 0.7 0.0 0.2 0.0 9.9 4.8 0.0 0.0 0.0 0.1 17.1 0.6 21.5 0.6 2.4 7.1 1.8 0.6 0.6 0.4 % DEV CONN 0.24 0.00 7.12 0.00 0.04 2.63 0.17 0.00 0.00 0.00 % DEV UTIL 0.25 0.00 7.18 0.00 0.18 4.40 0.17 0.00 0.00 0.00 % DEV RESV 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 AVG % % NUMBER ANY MT ALLOC ALLOC PEND 1.3 0.0 1.0 0.0 2.0 3.7 11.0 2.0 0.0 0.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 A C T I V I T Y INTERVAL 14.59.996 CYCLE 1.000 SECONDS
SYSTEM ID UIG1 RPT VERSION V1R9 RMF IODF = 12 CR-DATE: 03/01/09
TOTAL SAMPLES =
1,807
STORAGE DEV DEVICE GROUP NUM TYPE SGGDG SGGDG SGGDG SGGDG SGCCBS SGCCBS SGDBSWN SGDBSWN SGSW : : SGSW SGCRITC 0340 0341 0342 0343 0344 0345 0346 0347 0348 0349 33903 33903 33903 33903 33903 33903 33903 33903 33903 33903
NUMBER VOLUME PAV OF CYL SERIAL 200 200 200 200 200 200 200 200 200 200 IDP386 IDP387 IDP388 IDP389 CCB178 CCB179 DBSW11 DBSW12 IDP302 PLEXP1
DEVICE AVG AVG AVG AVG LCU ACTIVITY RESP IOSQ CMR DB RATE TIME TIME DLY DLY 0056 0056 0056 0056 0056 0056 0056 0056 0056 0056 0.139 0.001 3.311 0.001 0.150 3.703 0.959 0.003 0.003 0.002 18.4 0.9 22.4 0.9 12.8 23.2 2.3 0.9 0.9 0.7 0.0 0.0 0.0 0.0 0.0 11.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.2 0.0 0.0 0.0
037E 33903 037F 33903
200 200
IDP303 IDP076 LCU
0056 0056
0.003 0.003
0.8 0.8 13.0
0.0 0.0 0.5
0.0 0.0 0.0
0.0 0.0 0.2
0.3 0.2 0.7
0.0 0.0 3.6
0.5 0.6 8.2
0.00 0.00 0.57
0.00 0.00 0.82
0.0 0.0 0.0
0.0 1.0 137
100.0 100.0 100.0
0.0 0.0 0.0
0056 44.580
Figure 29. Monitor I DASD Activity Report
Rule-of-Thumb Tape LCU AVG RESP TIME Look at the ratio of LCU DISC + PEND time to CONN time. When the DISC+PEND exceeds CONN for extended periods of time, this is an indication of channel and/or control unit contention.
Report Analysis Several devices 340, 342, 344, or 345 have extraordinary long response times and need further investigation. (Which workloads are using these devices? How much are they being delayed? Are these response times typical for my system?). See Chapter 4, Analyzing I/O activity, on page 81 for further analysis.
50
System indicators
Figure 30. DASD Summary. Spreadsheet Reporter macro RMFR9DAS.XLS (DASD Activity Report - System)
In the DASD Summary report, you see at a glance the key data of your DASD subsystem: performance values as well as capacity data. If you feel that you might see some more details, you can have a look on the ten most busy DASD volumes in your system:
51
System indicators
Figure 31. Response Time of Top-10 Volumes. Spreadsheet Reporter macro RMFR9DAS.XLS (DASD Activity Report Top10RT)
52
System indicators
Paging Activity - Swap Placement Activity

P A G I N G z/OS V1R11 SYSTEM ID OS04 RPT VERSION V1R11 RMF S W A P A C T I V I T Y DATE 06/05/2009 TIME 09.29.00 A C T I V I T Y *---------------- EXPANDED STORAGE ----------------* MIGRATED FROM EXP STOR 0 0.00 0.0% 0 0.00 0.0% 1 0.00 4.2% 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 1 0.00 1.3% LOG SWAP /EXP STOR EFFECTIVE 1,558 1.73 100.0% 2,081 2.31 100.0% 2,917 3.24 100.0% 6 0.01 100.0% 3 0.00 100.0% 8 0.01 100.0% 153 0.17 100.0% 6,726 7.48 100.0% INTERVAL 14.59.996 CYCLE 1.000 SECONDS
OPT = IEAOPT02
MODE=ESAME
P L A C E M E N T
*------ AUX STORAGE -------* *---LOGICAL SWAP--* AUX STOR TOTAL 0 0.00 0.0% 0 0.00 0.0% 1 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 1 0.00 0.0% AUX STOR AUX STOR VIA DIRECT TRANSITION LOG SWAP 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 1 0.00 100.0% 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 1 0.00 100.0% 1,558 1.73 100.0% 2,081 2.31 100.0% 2,918 3.24 100.0% 6 0.01 100.0% 3 0.00 100.0% 0 0.00 0.0% 153 0.17 100.0% 6,719 7.47 99.9%
TOTAL TERMINAL INPUT/OUTPUT WAIT LONG WAIT CT RT % CT RT % CT RT % CT RT % CT RT % CT RT % CT RT % CT RT % 1,558 1.73 23.2% 2,081 2.31 30.9% 2,918 3.24 43.4% 6 0.01 0.1% 3 0.00 0.0% 8 0.01 0.1% 153 0.17 2.3% 6,727 7.48 100.0%
LOG SWAP EFFECTIVE 1,535 1.71 98.5% 2,060 2.29 99.0% 2,894 3.22 99.2% 6 0.01 100.0% 3 0.00 100.0% 0 0.00 0.0% 150 0.17 98.0% 6,648 7.39 98.9%
EXP STOR DIRECT 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 0 0.00 0.0% 8 0.01 100.0% 0 0.00 0.0% 8 0.01 0.1%
EXP STOR TOTAL 23 0.03 1.5% 21 0.02 1.0% 24 0.03 0.8% 0 0.00 0.0% 0 0.00 0.0% 8 0.01 100.0% 3 0.00 2.0% 79 0.09 1.2%
EXP STOR EFFECTIVE 23 0.03 100.0% 21 0.02 100.0% 23 0.03 95.8% 0 0.00 0.0% 0 0.00 0.0% 8 0.01 100.0% 3 0.00 100.0% 78 0.09 98.7%
DETECTED WAIT
UNILATERAL
EXCHANGE ON RECOMMENDATION VALUE TRANSITION TO NONSWAPPABLE OMVS INPUT WAIT TOTAL
AUXILIARY STORAGE - AVERAGE PAGES PER SWAP OUT OCCURRENCES OF TERMINAL OUTPUT WAIT 87
AVERAGE PAGES PER SWAP IN -
257
Figure 32. Paging Activity report - Swap Placement Activity
Rule-of-Thumb LOG SWAP/EXP STOR EFFECTIVE This is the percentage of all swaps that are satisfied from processor storage. You want the TOTAL value under this column to be a high number (e.g. over 95 %) to keep swap delay times low. The percentage under TERMINAL INPUT/OUTPUT WAIT should also be high, as this is where most TSO swaps are reflected.
53
System indicators
Report Analysis The LOG SWAP/EXP STOR EFFECTIVE values for v TOTAL: 100% v TERMINAL INPUT/OUTPUT WAIT: 100% are perfect. See Chapter 5, Analyzing processor storage activity for further analysis.
Page/Swap Data Set Activity report

P A G E / S W A P z/OS V1R11 SYSTEM ID OS04 RPT VERSION V1R11 RMF 898 D A T A S E T A C T I V I T Y INTERVAL 14.59.996 CYCLE 1.000 SECONDS
DATE 06/05/2009 TIME 09.29.00
NUMBER OF SAMPLES =
PAGE DATA SET USAGE ------------------------% PAGE IN TRANS USE TIME 0.11 0.11 2.67 3.01 3.01 3.23 3.12 0.200 0.027 0.004 0.005 0.005 0.005 0.005 V NUMBER PAGES I IO REQ XFER'D O DATA SET NAME 6 38 711 771 821 726 766 5 37 5,697 5,711 5,625 5,523 5,548 PAGE.OS04.APLPA PAGE.OS04.ACOMMON PAGE.OS04.LOCAL01 PAGE.OS04.LOCAL02 PAGE.OS04.LOCAL03 PAGE.OS04.LOCAL04 PAGE.OS04.LOCAL05
PAGE SPACE TYPE PLPA COMMON LOCAL LOCAL LOCAL LOCAL LOCAL
VOLUME SERIAL PAGP11 PAGP10 PAGP12 PAGP13 PAGP14 PAGP15 PAGP16
DEV NUM 0378 0377 0379 037A 037B 038D 039F
DEVICE TYPE 33903 33903 33903 33903 33903 33903 33903
SLOTS ALLOC
---- SLOTS USED --MIN MAX AVG
BAD SLOTS 0 0 0 0 0 0 0
45000 16466 16466 16466 21240 5345 5378 5355 566820 67670 70965 69140 566820 65887 69007 67190 566820 66611 69723 67928 566820 66296 69563 67618 566820 66390 69311 67560
N N Y Y Y
Figure 33. Monitor I Page/Swap Data Set Activity Report
Rules-of-Thumb % IN USE This is really the % busy for the data set. Above 30% you may start to see response times increase. However, you also need to consider other active data sets on the same volume; check the total device utilization on the device activity report. PAGES XFERD The total number of pages transferred to and from a given page/swap dataset. Divide this number by the number of INTERVAL seconds to get the rate in pages/second. Above 30 pages/second for a single data set may mean it is time to add another page pack, dedicate the packs to paging (or better yet, look into more processor storage).
Report Analysis If we check the two highest values in the report, we get: v % IN USE: 3.23% - no problem seen v PAGES XFER'D: 5711 - this is equivalent to 6 pages/second. See Chapter 5, Analyzing processor storage activity for further analysis.
54
System indicators

REPORT BY: POLICY=SPECIAL WORKLOAD=ONLINE SERVICE CLASS=TSO ---SERVICE--IOC 30100 CPU 357.3K MSO 203.7K SRB 56334 TOT 647.4K /SEC 1,075 ABSRPTN 1,244 TRX SERV 856 RESOURCE GROUP=*NONE SERVICE TIME CPU 32.300 SRB 5.100 RCT 0.000 IIT 0.100 HST 0.000 AAP 0.000 IIP 0.000 PERIOD=1 IMPORTANCE=2
-TRANSACTIONS- TRANS-TIME HHH.MM.SS.TTT AVG 1.25 ACTUAL 96 MPL 0.86 EXECUTION 96 ENDED 4,945 QUEUED 0 END/SEC 8.21 R/S AFFIN 0 #SWAPS 4,952 INELIGIBLE 0 EXCTD 0 CONVERSION 0 AVG ENC 0.00 STD DEV 119 REM ENC 0.00 MS ENC 0.00 ... REPORT BY: POLICY=SPECIAL -TRANSACTIONSAVG 60.79 MPL 55.66 ENDED 14,418 END/SEC 23.95 #SWAPS 14.8K EXCTD 0 AVG ENC 0.00 REM ENC 0.00 MS ENC 0.00 TRANS-TIME HHH.MM.SS.TTT ACTUAL 749 EXECUTION 652 QUEUED 96 R/S AFFIN 0 INELIGIBLE 0 CONVERSION 0 STD DEV 5.769
--DASD I/O-SSCHRT 10.0 RESP 15.4 CONN 2.8 DISC 2.5 Q+PEND 10.2 IOSQ 0.0
---APPL %--- --PROMOTED-- ----STORAGE---CP 15.00 BLK 0.000 AVG 124.31 AAPCP 0.00 ENQ 0.000 TOTAL 107.44 IIPCP 0.00 CRM 0.000 SHARED 0.00 LCK 0.000 AAP 0.00 -PAGE-IN RATESIIP 0.00 SINGLE 16.7 BLOCK 6.2 SHARED 0.0 HSP 0.0
--DASD I/O-- ---SERVICE--- SERVICE TIME ---APPL %--SSCHRT 368.7 IOC 1106K CPU 509.900 CP 105.40 RESP 12.4 CPU 5630K SRB 124.900 AAPCP 0.00 CONN 2.4 MSO 9228K RCT 0.200 IIPCP 0.00 DISC 2.1 SRB 1380K IIT 1.300 Q+PEND 7.9 TOT 17343K HST 0.000 AAP 0.00 IOSQ 0.0 /SEC 28,819 AAP 0.000 IIP 0.00 IIP 0.000 ABSRPTN 517 TRX SERV 474
--PROMOTED-- ----STORAGE---BLK 0.000 AVG 152.18 ENQ 0.000 TOTAL 8471.98 CRM 0.000 SHARED 0.00 LCK 0.000 -PAGE-IN RATESSINGLE 0.9 BLOCK 0.5 SHARED 0.0 HSP 0.0
Figure 34. Workload Activity report
Rules-of-Thumb APPL% CP This value tells you how much CPU time was captured for the workload in a given service class. You can use this information for several purposes: v Compute a capture ratio (CR) for your system v Input to capacity planning v Check for a single-CP constraint Remember that the APPL% CP is given as percent of a single engine, so you will need to divide by the number of engines if your system has more than one. Most MVS systems today have capture ratios greater than 80%. If yours is below that, you may want to investigate further. PAGE-IN RATES For any workload that is sensitive to paging (for example CICS), check the SINGLE and BLOCK paging values to be sure that no paging is taking place. The amount of paging that these workloads can sustain is up to you, but the best answer is probably zero.
55
System indicators
Report Analysis v Calculation of the capture ratio:

APPL% CP 105.4 Capture ratio = ---------------- = ----------- = 0.71 MVS BUSY % * #CP 73.88 * 2
This value is somewhat low. If your system is running in an PR/SM environment, the calculation has to be performed using LPAR BUSY and the number of logical processors. v The paging rates for TSO are SINGLE BLOCK 16.7 page-ins per second 6.2 page-ins per second
This may be worth further investigation. See Chapter 3, Analyzing processor activity and Chapter 5, Analyzing processor storage activity for further analysis.
56
System indicators
Virtual Storage Activity report

V I R T U A L S T O R A G E A C T I V I T Y PAGE 1 z/OS V1R11 SYSTEM ID OS04 RPT VERSION V1R11 RMF DATE 06/05/2009 TIME 09.29.00 INTERVAL 14.59.996 CYCLE 1.000 SECONDS
COMMON STORAGE SUMMARY NUMBER OF SAMPLES STATIC AREA EPVT ECSA EMLPA EFLPA EPLPA ESQA ENUC ----- 16 MEG NUCLEUS SQA PLPA FLPA MLPA CSA PRIVATE PSA 90 ALLOCATED CSA/SQA ------------ BELOW 16M -------------- ------- EXTENDED (ABOVE 16M) -------MIN MAX AVG MIN MAX AVG SQA 2076K 09.34.31 2100K 09.35.51 2084K 59.5M 09.35.01 59.7M 09.30.41 59.6M CSA 1744K 09.29.11 1748K 09.29.01 1747K 151M 09.31.42 152M 09.40.31 152M ALLOCATED CSA BY KEY 0 952K 09.29.11 956K 09.29.01 1 44K 09.29.01 44K 09.29.01 2 40K 09.29.01 40K 09.29.01 3 0K 09.29.01 0K 4 28K 09.29.01 28K 09.29.01 5 8K 09.29.01 8K 09.29.01 6 180K 09.29.01 184K 09.32.11 7 432K 09.29.01 432K 09.29.01 8-F 60K 09.29.01 60K 09.29.01 SQA EXPANSION INTO CSA 0K 09.29.01 0K
STORAGE MAP ADDRESS SIZE 17600000 1674M ADA1000 200M A984000 4212K A981000 12K 6B7B000 62.0M 26AD000 68.8M 1000000 22.7M BOUNDARY -----FC9000 220K D2F000 2664K B23000 2096K B22000 4K ABB000 412K 800000 2796K 1000 8188K 0 4K
952K 44K 40K 0K 28K 8K 183K 432K 60K
18.4M 1168K 5024K 60K 7576K 3896K 89.6M 24.9M 976K
09.34.51 09.29.01 09.29.01 09.29.01 09.31.21 09.29.01 09.30.21 09.33.11 09.29.01
18.4M 1180K 5024K 64K 7976K 3904K 90.1M 25.1M 984K
09.34.21 09.41.11 09.29.01 09.33.41 09.42.31 09.37.21 09.30.01 09.40.21 09.39.31
18.4M 1170K 5024K 60K 7673K 3899K 89.8M 25.0M 978K
0K
0K 09.29.01
0K
0K
PLPA INTERMODULE SPACE 6K IN PLPA AND 187K IN EPLPA PLPA SPACE REDUNDANT WITH MLPA/FLPA 2K IN PLPA AND 3053K IN EPLPA
---------- BELOW 16M ---------------MIN MAX AVG CSA FREE PAGES (BYTES) LARGEST FREE BLOCK ALLOCATED AREA SIZE SQA FREE PAGES (BYTES) LARGEST FREE BLOCK ALLOCATED AREA SIZE 1048K 09.29.01 1052K 09.29.11 1048K 1048K 09.29.01 1052K 09.29.11 1048K 1744K 09.29.11 1748K 09.29.01 1747K 564K 09.35.51 360K 09.29.01 2304K 09.29.01 588K 09.34.31 360K 09.29.01 2304K 09.29.01 579K 360K 2304K
--------------- ABOVE 16M ----------MIN MAX AVG 48.3M 09.40.31 48.1M 09.29.01 152M 09.29.01 9352K 09.30.41 8116K 09.29.01 68.8M 09.29.01 49.2M 09.31.42 48.1M 09.29.01 152M 09.29.01 9568K 09.35.01 8116K 09.29.01 68.8M 09.29.01 48.8M 48.1M 152M 9447K 8116K 68.8M
MAXIMUM POSSIBLE USER REGION -
7996K BELOW AND 1666M ABOVE
Figure 35. Monitor I Virtual Storage Activity Report - Common Storage Summary
57
System indicators
Rules-of-Thumb Maximum Allocated Percent Calculate this value for the CSA, the ECSA, the SQA, and the ESQA. For example, for the CSA:
CSA ALLOCATED AREA SIZE Maximum Allocated Percent(CSA)= ------------------------ * 100 CSA SIZE
For CSA and ECSA, this value should be less than 65%. The virtual storage that is most volatile is the SQA and CSA, both above and below the 16MB line. This is a number that is readily tracked on the Virtual Storage report on either a daily or weekly basis. RMF calculates the size of this area as the difference between the highest and lowest address occupied by allocated storage, and includes all free blocks that lie between allocated blocks. Significant segmentation causes this number to be much larger than the amount of storage actually used. If the size allocated for ESQA is too small or is used up, the system attempts to steal pages from the ECSA. When both areas are full, the system allocates space from the SQA and CSA below the 16MB line. SRM will attempt to reduce the demand for resources. If no storage is available, the result is a system failure. If the size of SQA below 16M is too small, additional data will spill into the CSA below the 16MB line. If no storage is available, the result is a system failure. v Is CSA/ECSA allocated percent high? Review ECSA, CSA, ESQA, and SQA sizes Increase CSA size: cleanup LPA, split address spaces (MRO for example), use Version/Release products that exploit ESA, or convert applications to exploit ESA (for example to Cobol II) v Is SQA/ESQA allocated percent low? ESQA or SQA may be too large. v Is SQA EXPANSION INTO CSA low? SQA size may be too large; you may be able to decrease SQA size and increase CSA size. Using the Monitor III Common Storage report is a very good way to understand which AS has the storage.
Report Analysis The maximum allocated percent for the CSA is:
1744K ----- * 100 = 62.4% 2796K
That concludes there is no problem in this area. The SQA EXPANSION INTO CSA is zero, that means the SQA is large enough.
58
System indicators
Where do you go from here?

By now you will have found the major cause of your performance problem. The following chapters will go into more detail to help you further.
As you know, performance tuning is an iterative process and bear in mind: v Take a base measurement as your starting point v Only change one thing at a time v Predict how your key measurements will change v Measure your change against your expectation
59
System indicators
60
Chapter 3. Analyzing processor activity
Lets Analyze Your Processors This chapter discusses how to analyze a processor problem. It takes the following approach: v What RMF values can you look at in a performance problem situation? v What do these fields mean? v What can you do to solve the problem? We make recommendations for two cases: An application being delayed by higher priority work An application using excessive CPU, or overall CPU under stress As you continue through this chapter, bear in mind that tuning the processor is generally a high-effort/low-return activity. Quick fixes are rare. However, we will discuss the tuning activities that can apply.
61
Processor analysis
Do you have a processor problem?

There are many views on what constitutes a processor performance problem. You should conclude that you have a processor performance problem because your: v Service level objectives are being exceeded. v Users are complaining about slow response. v AND, the CPU indicators (discussed in the following section) show signs of stress. Or better still, you see a processor delay directly in Monitor III. We suggest there are two main ways to approach this problem analysis, depending on how you are alerted: v If a user or group of users complains about response times then start with Monitor III. v If your normal daily or weekly performance tracking procedures show a trend or problem developing, then start with Monitor I. Table 2 shows for each monitor, which fields in which RMF reports can detect and resolve a processor performance problem.
Table 2. Processor Indicators in Monitor III and Monitor I REPORTS Monitor III GROUP DELAY WFEX PROC CPC Monitor I CPU Activity Partition Data DELAY Fields Avg Delay PROC %Delayed for PRC Speed *PROC DLY% USG% Appl% CP WLM Capping % BUSY TIME PERCENTAGE PHYSICAL PROCESSOR TOTAL USING Fields AVG USG PROC

This section lists the Monitor III indicators that may be used to alert you to a potential processor problem.
62
Mon III - Processor analysis
GROUP report
RMF V1R11 Command ===> Samples: 100 System: PRD1 Date: 04/07/09 Time: 10.32.00 Group Response Time Scroll ===> HALF Range: 100 Sec
Class: BATCH Period: 1 Primary Response Time Component:
Description: Primary Batch Using the processor TRANS Ended Rate 0.0 --- Response Time ----- Ended TRANS-(Sec) WAIT EXECUT ACTUAL 830.5 152.7 983.3
WFL % 60
Users TOT ACT 30 3
Frames %ACT 25
EXCP Rate 2.1 USGDEV 0.10 4.67
PgIn Rate 0.5
This report presents information for a specific service class period.
DELAY Indicator Field: Average Delay PROC for Response Time ACT Description: This is a component of average response time, showing the average number of seconds, transactions active during the range period in the specific service class period, were delayed by the lack of CPU. It is the CPU queue time. Guideline: Look at the different Average Delay values, in the line Response Time ACT. If the PROC value is the largest, continue your processor investigation. Look at the different service classes (coming back to the Primary Panel and choosing option 3) and start with the service class showing the largest value. Problem Area: This points out processor delays for the service class listed. Potential Solution: v See DELAY for processor: Guidelines for improvement on page 73 v See Processor USING: Trimming activities on page 76
63
USING Indicator Field: AVG USG PROC for Response Time ACT Description: This is a component of average response time, namely the average number of seconds, transactions belonging to the service class were using the CPU during the range period. Guideline: Check all the service classes and start with that one showing the average largest processor consumption. Problem Area: This points out processor consumption. Potential Solution: v See Processor USING: Trimming activities on page 76
DELAY report
Name BHOLEQB CATALOG BAJU BHOLNAM4 JES2 BHOLPRO1 BHOLPRO2 RMF *MASTER* BHOL CONSOLE BHOLSTO9 VTMLCL
Service CX Class B S T B S B B S S T S B S BATCH SYSSTC TSO BATCH SYSSTC BATCH BATCH SYSSTC SYSTEM TSO SYSSTC BATCH SYSSTC
Cr
WFL USG DLY IDL UKN ----- % Delayed for ----- Primary % % % % % PRC DEV STR SUB OPR ENQ Reason 0 0 14 62 67 72 74 75 92 97 100 100 100 51 0 2 31 13 72 74 3 23 29 2 1 1 0 3 12 19 7 28 26 1 2 1 0 0 0 49 0 85 0 0 0 0 0 0 67 0 99 0 0 0 97 0 1 0 32 18 83 1 0 28 0 26 96 1 76 0 3 1 98 0 0 0 99 0 0 0 0 3 0 12 0 1 6 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 SYSDSN LOCL LOCL BHOLPRO2 SYSPAG BHOLPRO2 BHOLPRO1 BHOLPRO2 SYSPAG CONSOLE
This is a good report to look at for a quick snapshot of Who is delayed for CPU and for how long. The report is sorted by workflow percentage (WFL % column) in ascending order. Check this report to see if any priority work is being delayed.
64
DELAY Indicator Field: % Delayed for PRC Description: This gives for each address space (AS), the percentage of the measured time that a transaction running in that AS was delayed by the lack of CPU. Guideline: Look at the largest %Delayed for PRC value. Compare it with the DLY % value of the same job. If the PRC value is the main part of the delay, look at this job. You can put the cursor on the line describing the job that may be a problem and press ENTER: a new panel is displayed (see Figure 43 on page 73) that lists the main cause of the delay and suggests some possible actions to reduce the delay. Problem Area: This points out a delay for processing. Potential Solution: v See DELAY for processor: Guidelines for improvement on page 73 v See Processor USING: Trimming activities on page 76
WFEX report
--------------------------- Speed (Workflow) --------------------------------Speed of 100 = Maximum, 0 = Stopped Average CPU Util: 100 % Name Users Active Speed Name Users Active Speed *SYSTEM 41 4 63 ALL TSO 2 0 70 ALL STC 32 1 79 ALL BATCH 7 3 59 *PROC 24 3 73 *DEV 6 1 33 *MASTER* 1 0 92 ------------------------------ Exceptions ------------------------------------Name Reason Critical val. Possible cause or action *ECSA* SECS% > 85 90.6 % System ECSA usage 91 %. *STOR TSQAO > 0 784K bytes SQA overflow into CSA 784K. BHOLEQB ENQ -SYSDSN 51.0 % delay ENQ.TEST.DAT Figure 38. Monitor III Workflow/Exceptions Report
65
DELAY Indicator Field: *PROC Speed (Workflow) Description: This is a good global indicator for CPU performance. If equal to 100%, no work has been delayed by the CPU. If equal to 50%, this means that half of the attempts to use CPU were denied because the CPU was busy (that is, five out of ten ready AS were being delayed). This field does not deal with how busy the CPU is, but rather the amount of delay that exists. Guideline: If this value is low (less than 40%), it may be due to CPU delay. If this value is greater than 80%, there is no CPU problem. Between these two values, it depends on the characteristics of the environment. Problem Area: This points out a delay for processing. Potential Solution: v See DELAY for processor: Guidelines for improvement on page 73 v See Processor USING: Trimming activities on page 76
PROC report
RMF V1R11 Processor Delays Command ===> Samples: 60 System: MVS1 CPU DLY Type % CP CP AAP CP AAP CP CP AAP CP CP AAP CP CP IIP 11 4 6 0 6 2 0 2 2 0 2 2 0 0 Date: 11/28/09 USG EAppl % % 46 4 0 0 0 6 2 2 4 0 0 0 78 2 59.4 42.5 98.4 5.3 7.7 0.8 1.9 0.7 1.2 0.1 0.4 0.2 111.0 21.3 Time: 09.10.00 Line 1 of 138 Scroll ===> HALF Range: 60 Sec
Jobname WSWS7 WSP1S2FS WSP1S6FS DBS3DBM1 WSP1S6F U078069 WSP1S4F U078068 DBS3DIST
Service CX Class O OMVS SO WASCR SO WASCR S DB2HIGH SO WASCR O OMVS SO WASCR O OMVS SO DB2HIGH
----------- Holding Job(s) ----------% Name % Name % Name 9 *ENCLAVE 2 DBS3DIST 6 *ENCLAVE 6 *ENCLAVE 2 XCFAS 2 *ENCLAVE 2 WSWS7 2 WSP1S6F 2 XCFAS 7 DBS3DIST 2 WSWS7 7 WSP1S2F 2 VTAM44
2 DBS3DIST
2 WSP1S2F
2 DBS3DIST
2 U078069
2 WSWS7
2 *ENCLAVE
Figure 39. Monitor III Processor Delays Report
This report displays all AS waiting for or using the processor during the range period. The report is sorted by descending overall delay percentages: the first line is the job you need to look at first.
66
DELAY Indicator Field: DLY % Description: The percentage of time an AS is delayed because of contention for the processor during the range period. For the multitask AS, RMF reports only one delay when several tasks are delayed at the same time. Guideline: Use the cursor sensitivity: put the cursor on the name of the AS you want to analyze (usually we start with the most delayed job). Hence the Job Delay report (as in Figure 43 on page 73) is displayed; it allows further investigation as explained later in this chapter. Problem Area: Points out a processor delay. Potential Solution: v Start by identifying the holding jobs that are causing the most delay v See Determine CPU holders on page 73
USING Indicator Field: USG % and EAppl% Description: v USG %: The percentage of time an AS is using the processor during the range period. This is a sampled single state value. v EAppl%: Percentage of CPU time as sum of TCB time, global and local SRB time, preemptable or client SRB time, and enclave CPU time consumed within this address space. This is a measured multi-state value: if the address space is using more than one processor, this value can exceed 100%. Guideline: Start with the AS with the largest EAppl% value, then look at its USG% value: if the USG% value is large, you have a heavy processor consumer. Problem Area: Points out processor consumption. Potential Solution: v See Processor USING: Trimming activities on page 76
67
Mon I - Processor analysis
Monitor I indicators
This section describes the Monitor I indicators that may be used to alert you to a potential processor problem.
CPU Activity report

C P U A C T I V I T Y PAGE 1 z/OS V1R11 SYSTEM ID OS04 RPT VERSION V1R11 RMF H/W MODEL E26 DATE 06/05/2009 TIME 09.29.00 INTERVAL 14.59.996 CYCLE 1.000 SECONDS
CPU
2097
MODEL
720
SEQUENCE CODE 00000000000473BF LOG PROC SHARE % 35.0 35.0 35.0 35.0 35.0 35.0 35.0 35.0 35.0 35.0 35.0 385.0 7.1 7.1 14.2
HIPERDISPATCH=NO --I/O INTERRUPTS-RATE % VIA TPI 382.8 249.3 273.4 534.1 519.6 874.6 866.3 847.9 828.6 833.0 816.1 7026 0.38 0.43 0.35 0.20 0.21 0.17 0.18 0.16 0.16 0.16 0.15 0.20
---CPU--NUM TYPE 0 1 2 3 4 5 6 7 8 9 A CP
---------------- TIME % ---------------ONLINE LPAR BUSY MVS BUSY PARKED 92.55 75.66 72.95 70.05 68.05 91.00 90.26 89.30 88.19 86.92 85.46 82.76 1.83 1.51 1.50 94.06 96.81 99.04 96.13 99.66 97.42 96.57 97.19 99.43 98.09 96.01 97.31 1.31 0.84 0.86 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
CP 100.00 CP 100.00 CP 100.00 CP 100.00 CP 100.00 CP 100.00 CP 100.00 CP 100.00 CP 100.00 CP 100.00 CP 80.00 TOTAL/AVERAGE
B AAP 100.00 C AAP 100.00 AAP AVERAGE
0.00 0.00
SYSTEM ADDRESS SPACE ANALYSIS
SAMPLES =
900
--------NUMBER OF ADDRESS SPACES--------QUEUE TYPES IN IN READY OUT READY OUT WAIT LOGICAL OUT RDY LOGICAL OUT WAIT ADDRESS SPACE TYPES BATCH STC TSO ASCH OMVS 42 100 16 0 0 88 185 16 0 38 74.2 138.7 16.0 0.0 15.6 MIN 86 2 0 1 0 32 MAX 181 67 18 48 16 135 AVG 134.5 13.5 0.2 24.3 0.2 85.3
-----------------------DISTRIBUTION OF IN-READY QUEUE-----------------------NUMBER OF ADDR SPACES <= = = = <= <= <= <= <= <= <= <= > N N N N N N N N N N N N N 0 10 20 30 40 50 60 70 80 90 100 !....!....!....!....!....!....!....!....!....!....! >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>> >> >>> >>>>> >>> >> >> > >
(%) 58.8 6.4 4.3 3.9 4.9 8.3 4.5 3.7 3.2 1.2 0.8 0.0 0.0
+ + + + + + + + + + + +
1 2 3 5 10 15 20 30 40 60 80 80
N = NUMBER OF PROCESSORS ONLINE ( 10.8 ON AVG)
Figure 40. Monitor I CPU Activity Report
68
USING Indicator Field: BUSY TIME PERCENTAGE Description: The percentage of the interval time that the processor was busy executing instructions; it includes the non-captured CPU-Time. Guideline: If this value is greater than 85% AND the IN READY QUEUE value for <= N is below 80%, as is the case in the sample report from Figure 40 on page 68, then you have some work with CPU delay. Note: Depending on your environment, the guidelines may be very different: v In a batch or scientific environment, a 100% CPU busy is usually not a problem. v In a transaction environment, user response times usually increase when CPU busy goes over 80%. The type of the application and the number of CPs available to it can influence how high this CPU busy value can go before response time suffers. Problem Area: Points out processor consumption. Potential Solution: v See Processor USING: Trimming activities on page 76 Note: BUSY TIME PERCENTAGE by itself is not enough as a CPU indicator. You must remember that the distribution of the arrival rate of work demanding CPU is not constant. So even at a low utilization level, a queue may build up. The IN READY value <= N gives the percentage of time when no contention for CPU was detected, that is, the number of AS with a ready process (TCB or SRB) was equal to or less than the number of CPs. In our example, as the total is 58.8%, then 41.2% of the time (100%-58.8%) you have work delayed. This is not to say that you cannot run a z/OS system at higher utilizations, on up to 100% busy. You can. Just be aware that the busier the CPU, the longer the CPU delay for lower-priority workloads will be. The more low-priority work you have, and the fewer non-CPU bottlenecks you have (e.g. responsive I/O), the busier you will be able to drive your CPU and still maintain good response for your high-priority workloads.
69
Figure 41. CPU Contention Report. Spreadsheet Reporter macro RMFY9SUM.XLS (System Overview Report OneCpuCont)
This report provides some more detailed data on CPU contention. In addition to the contention value which is given by the percentage of time when at least one job was ready and waiting for the processor, you see here the percentages when at least two or three jobs were waiting.
70

P A R T I T I O N D A T A R E P O R T PAGE z/OS V1R11 SYSTEM ID NP1 RPT VERSION V1R11 RMF DATE 11/28/09 TIME 09.30.00 INTERVAL 14.59.678 CYCLE 1.000 SECONDS 2
9 7 2
N/A N/A N/A
--------- PARTITION DATA ----------------- -- LOGICAL PARTITION PROCESSOR DATA -----MSU---- -CAPPING-- PROCESSOR- ----DISPATCH TIME DATA---WGT DEF ACT DEF WLM% NUM TYPE EFFECTIVE TOTAL 20 100 1 0 10 5 300 95 200 50 10 1 8 155 52 NO YES NO NO NO 62.2 1.2 0.0 4 3.3 1.0 0.0 6.0 0.0 4 CP CP CP CP CP 00.04.27.519 00.00.22.083 00.03.35.859 01.12.06.405 00.24.11.311 00.00.03.103 ------------ -----------01.44.41.295 01.44.46.280 00.14.59.611 00.14.59.625 00.00.00.000 00.00.00.000 00.00.00.065 ------------ -----------00.04.27.302 00.00.21.680 00.03.35.761 01.12.05.405 00.24.11.147
-- AVERAGE PROCESSOR UTILIZATION PERCENTAGES -LOGICAL PROCESSORS --- PHYSICAL PROCESSORS --EFFECTIVE TOTAL LPAR MGMT EFFECTIVE TOTAL 24.86 0.60 23.97 80.10 40.30 24.92 0.61 23.98 80.15 40.31 0.01 0.01 0.02 0.01 0.01 0.05 -----0.11 0.01 0.00 0.03 -----0.04 4.24 4.25 0.34 0.35 3.41 3.43 68.68 68.69 23.02 23.03 0.05 ------ -----99.69 99.80 99.95 99.96 0.00 0.00 0.03 ------ -----99.95 99.99
S A A A A A
DED DED
1 1
ICF ICF
99.98 0.00
99.99 0.00
Figure 42. Monitor I CPU Activity Report - Partition Data Report Section
This report provides data about configured partitions in LPAR mode. Only partitions active at the end of the duration interval are reported.
USING Indicator Field: PHYSICAL PROCESSORS TOTAL Description: The total amount of time the physical processor resource was assigned to a partition AND to the management of the LPAR itself. The partition identified by the name *PHYSICAL* is not a configured partition: this is a way to report uncaptured time which was used by LPAR but could not be attributed to a specific logical partition. Guideline: If the total physical processor complex is 100% busy for one partition, this LPAR could be constrained by other LPARs. Problem Area: Points out processor consumption. Potential Solution: Check the partition weights, capping, and number of CPs defined, to be sure you are allocating the CP resources the way you intended. v See Appendix A. PR/SM LPAR considerations
71
Processor analysis
Is your processor idle due to other bottlenecks?

If your system shows no processor indicators under stress, that does not mean you dont have a potential processor problem: You may have some other bottleneck in the system preventing work from using the processor. If your daily processor-busy peaks start to plateau, before reaching 100%, this could be an indication of such a bottleneck. The major ones are: v Storage v I/O v ENQ v Operator v Network If you think this might be the case, see the other topics for guidance on finding bottlenecks.
72
Processor analysis
DELAY for processor: Guidelines for improvement

We assume in this section that you have found that your applications primary delay is due to the processor. This means that other workloads are using the processor when your application wanted it. In this section, we will discuss what can be done to decrease processor delay for a workload. The section Processor USING: Trimming activities on page 76 is worth checking when you have a delay problem as well: the delay can be caused by an AS using too much of the processor. Trimming this AS solves the delay problem.
Determine CPU holders

Bring up the Monitor III Processor Delays report (shown in Figure 39 on page 66) and use cursor sensitive control. By positioning the cursor on the line describing the job you want to look at, and pressing ENTER, a new panel is displayed, on which the AS delaying the workload are named, as shown in Figure 43.
JOB report
RMF V1R11 Job Delays Command ===> Samples: 100 Job: BHOLPRO2 Line 1 of 1 Scroll ===> HALF Sec
System: PRD1 Date: 04/07/09 Time: 10.32.00 Range: 100 Primary delay: Job is waiting to use the processor.
Probable causes: 1) Job BHOLPRO1 may be looping. 2) Higher priority work is using the system. 3) Improperly tuned dispatching priorities. ------------------------- Jobs Holding the Processor -------------------------Job: BHOLPRO1 Job: BHOL Job: *MASTER* Holding: 4% Holding: 3% Holding: 1% PROC Using: 98% PROC Using: 3% PROC Using: 1% DEV Using: 0% DEV Using: 8% DEV Using: 4% --------------------------- Job Performance Summary --------------------------Service WFL -Using%- DLY IDL UKN ----- % Delayed for ------ Primary CX ASID Class P Cr % PRC DEV % % % PRC DEV STR SUB OPR ENQ Reason B 0033 BATCH 1 94 94 0 6 0 0 6 0 0 0 0 0 BHOLPRO1F
Figure 43. Monitor III Job Delays Report
Some probable causes of trouble are mentioned as well. This is the best starting point. See which AS are delaying your workload.
73
Processor analysis
Review dispatching priorities

The dispatching priority (DP) of each address space is specified dynamically according to the goals of the respective service class. You should reconsider your service policy definition if the workload delaying your application a has higher dispatching priority. The Monitor III ASD report shows address space dispatching priorities (DP PR).
ASD report
RMF - ASD Address Space Data Report Command ===> CPU= 67/ 55 UIC=2540 PR= 13:43:53 S C R DP JOBNAME SRVCLASS P L LS PR ANTAS000 MASTER* PCAUTH RASP TRACE GRS DUMPSRV CONSOLE ALLOCAS LLA VTAM NETVS NETVIEW SMF RMF33 DFHSM AMSAQTS VLF JES2 SMS JOEUSER STCDEF SYSTEM STCDEF STCDEF STCDEF SYSTEM SYSTEM SYSTEM SYSTEM SYSSTC SYSSTC SYSSTC SYSSTC SYSTEM SYSSTC SYSTEM SYSTEM SYSTEM SYSTEM SYSTEM BTCHPRD 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS WT TI F1 FF 74 FF 76 FF FF FF 71 FC FD FC FC FF FF 7D 74 78 FE 79 FF CS F 871 141 26 48 78 1381 31 62 317 331 518 77 497 65 128 674 37 2297 827 59 764 ESF CS TAR 0 Line 1 of 64 Scroll ===> HALF System= SYSF Total ES TX SWAP RT SC RV WSM RV
TAR X PIN WSS M RT
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.0 0.0 X 0.0 15 X 0.0 15 X 0.0 15 X 0.0 15 0.0 15 X 0.0 15 X 0.0 15 X 0.0 24 0.0 44 X 0.0 32 0.0 30 X 0.0 15 0.0 38 0.0 671 0.0 50 X 0.0 22 0.0 25 X 0.0 24 X 0.0 15
0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0
0 175 0 0 175 175 0 0 0 175 175 175 175 175 175 175 175 175 175 18 -1000
Figure 44. Monitor III Address Space State Data Report
74
Processor analysis
Check for CPU delay due to other LPARs

If you are running as an LPAR sharing CPs with other LPARs, and your MVS system is not 100% busy, you might ask: Are the logical CPs not working more because there is no demand (wait time) or because the LPAR is not getting enough CPU? One way to check if other LPARs are constraining your system is to do the following: Calculate the percentage of time when there was CPU delay by adding up the percentages from lines N + 1 to N + 80 from the Monitor I CPU Activity report Figure 24 on page 44. If this total is more than your CPU BUSY PERCENTAGE, then you have another LPAR constraining you (you have CPU delay % > CPU busy %). If this is the case, check the partition weights, capping, and number of CPs defined, to be sure you are allocating the CP resources the way you intended.
Check LPAR balancing using the CPC capacity report

You can use the Monitor III CPC Capacity report to check whether the capacity being defined for each partition is sufficient to provide the service that all applications are expecting.
RMF V1R11 CPC Capacity Command ===> Samples: 59 System: Z201 Date: 11/28/09 Line 1 of 12 Scroll ===> HALF Time: 13.16.00 Range: 60 sec
Partition: Z2 CPC Capacity: Image Capacity: Partition
937 77
2094 Model 716 Weight % of Max: 74.2 WLM Capping %: 5.4 Proc Num
4h Avg: 80 4h Max: 185
Group: Limit:
CGRP0010 100*
--- MSU --- Cap Def Act Def
Logical Util % Effect Total
- Physical Util % LPAR Effect Total 1.0 0.4 0.1 0.2 0.3 0.1 0.1 0.0 0.0 0.1 10.1 3.4 3.5 1.7 1.5 11.1 3.8 3.6 1.9 1.8 0.1 7.2 7.1 0.0 0.1
*CP TZ1 Z1 Z2 Z3 PHYSICAL *ICF CF1 CF2 PHYSICAL
130 150 50 N/A
122 89 58 0
NO NO NO YES
4.3 5.2 2.1 2.4
11.2 9.3 11.5 8.8
12.5 9.6 12.8 10.6
1 1
99.9 0.0
99.9 0.0
7.1 7.1 0.0
Figure 45. Monitor III CPC Capacity Report
The report provides information about processor resource consumption of each partition, and it displays the 4-hours average consumption. If this average is higher than the defined image capacity, WLM will start capping this partition. In the above example, the 4h average is higher than the image capacity. Therefore, partition Z2 will get less processor resources in the following intervals until the average is below the defined capacity. This might result in processor constraints and performance problems. If this is not just a temporary but a permanent problem, one should consider about increasing the image capacity for this partition.
75
Processor analysis
Processor USING: Trimming activities

This section describes the activities that help decrease the processor consumption. In case of delay, this list of activities can be used to trim the AS that uses too much processor resource.
Check for loops

Check for loops in the following reports: v PROC report v Job report on page 77
PROC report
RMF V1R11 Command ===> Samples: 100 System: PRD1 Date: 04/07/09 EAppl % 9.8 83.7 26.7 20.5 15.3 9.7 12.7 87.9 10.4 3.0 22.0 4.9 5.0 15.8 Time: 10.32.00 Processor Delays Line 1 of 71 Scroll ===> HALF Range: 100 Sec
Jobname BOE1HCP$ BTFL BOE1TBL$ BOE1FIR$ BOE1HCL$ BOE1DEV$ BOE1MGR$ BJPI BOE1HIT$ BOE1HOM$ JES3 BRTV BRMC XXVTAM
Service CX Class B T B B B B B T B B S T T S BATCH TSO 2 BATCH BATCH BATCH BATCH BATCH TSO BATCH BATCH SYSSTC TSO TSO SYSSTC
CPU DLY USG Type % % CP CP CP CP CP CP CP CP CP CP CP CP CP CP 10 9 9 9 7 7 6 4 3 3 2 2 2 1 11 91 28 21 19 11 15 96 13 5 21 6 5 22
----------- Holding Job(s) ----------% Name % Name % Name 9 8 8 8 7 6 6 4 3 3 2 2 2 1 BJPI BJPI BTFL BTFL BJPI BJPI BJPI BTFL BTFL VIBTS BJPI BTFL TESTRTX JES3 8 3 7 7 7 5 5 3 3 3 1 2 1 1 BTFL BOE1MGR$ BJPI BJPI BTFL BTFL BTFL JES3 BJPI BJPI BTFL BJPI CATALOG BOE1TBL$ 4 3 5 7 4 4 3 2 2 2 1 1 1 1 CATALOG BOE1HCL$ JES3 JES3 BOE1MGR$ TESTRTX BOE1HCP$ XXVTAM CATALOG BOE1DEV$ *MASTER* BOE1HCL$ BJPI AKS
Figure 46. Monitor III Processor Delays Report
Detect loops by inspecting DLY % and USG % in this report for individual address spaces. If the sum of both is 100%, you could be in an enable loop. You can check this by moving the cursor to jobname BTFL, and press ENTER. This links to the Job Delays report.
76
Processor analysis
Job report
RMF V1R11 Command ===> Samples: 100 Job: BTFL System: PRD1 Date: 04/07/09 Time: 10.32.00 Job Delays Line 1 of 1 Scroll ===> HALF Range: 100 Sec
Primary delay: Job is waiting to use the processor.
Probable causes: 1) Job BJPI may be looping. 2) Higher priority work is using the system. 3) Improperly tuned dispatching priorities. ------------------------- Jobs Holding the Processor -------------------------Job: BJPI Job: BOE1MGR$ Job: BOE1HCL$ Holding: 8% Holding: 3% Holding: 3% PROC Using: 96% PROC Using: 15% PROC Using: 19% DEV Using: 0% DEV Using: 5% DEV Using: 8% --------------------------- Job Performance Summary --------------------------Service WFL -Using%- DLY IDL UKN ----- % Delayed for ------ Primary CX ASID Class P Cr % PRC DEV % % % PRC DEV STR SUB OPR ENQ Reason T 0148 TSO 2 91 91 0 9 0 0 9 0 0 0 0 0 BJPI Figure 47. Monitor III Job Delays Report
A good indication of a loop is the fact that no other delays than processor delay are shown. This is also detected by Monitor III which points to another job that might be looping. Check with the owner of this address space (in this example, it is a TSO user), before deciding how to proceed - dont cancel it too quickly!
77
Processor analysis
Check for slip traps

--------------------------- Speed (Workflow) --------------------------------Speed of 100 = Maximum, 0 = Stopped Average CPU Util: 81 % Name Users Active Speed Name Users Active Speed *SYSTEM 505 13 54 ALL TSO 433 10 55 ALL BATCH 2 0 42 ALL STC 70 2 55 ALL ASCH Not avail ------------------------------ Exceptions ------------------------------------Name Reason Critical val. Possible cause or action * SLIP * SLIP PER TRAP SLIP ID=SR01 is active. BEVK Rate < 2.0 1.220 /sec Tx rate is 1.220 /s. BSHR STOR-COMM 23.1 % delay CSAHOG JCSA% > 15 18.3 % Job CSA usage 18 %, system 57 %. POK063 DAR > 20 23.22 /sec I/O rate is 23.22 /s on volume POK063. DATAPK Not avail Volume DATAPK is not mounted. Figure 48. Monitor III Workflow/Exceptions Report
Use the workflow/exceptions report to watch for signs of a SLIP PER active trap. Program Event Recording (PER) is a function of S/390 architecture that generates program interrupts and uses the CPU unproductively (when not used to analyze a problem). PER is started by a SLIP trap and has significant overhead. The WFEX report shows any active SLIP traps using PER. Check to see if the trap is still needed. The * SLIP * exception line, in yellow, is always shown as the first exception. This line cannot be suppressed when the SLIP is active.
Reduce I/O
Rules of Thumb v 1000 I/Os per second consume about 7.6% of a 9672 Z87 system. v Every 160 DASD demand pages/sec uses about 1.0% of a 9672 Z87 system. You can avoid I/O by: v Using Data-In-Memory (DIM) techniques v Increasing block sizes v Exploring SAM/E chain scheduling possibilities v Moving I/O bound activities to other shifts v Decreasing the paging Avoiding I/O may help to save CPU resources. Consequently you improve your performance (response and throughput). So there is a chance that your CPU is going to be busier and more productive, because you decrease I/O delay.
78
Processor analysis
Decrease monitor overhead

v v v v v Check for obsolete operations CLISTS. Close down duplicate/redundant monitoring. Reduce scope and/or frequency of other monitors. Control and decrease SMF records generation. Increase RMF cycle, control the use of RMFMON and RMF commands.
Tune system address spaces

The system AS that usually consume the most processor are: v Console AS, due to commands and message traffic in MVS console. Consider message filtering to reduce traffic. v VTAM AS. Consult a TP specialist. v Master Scheduler AS, mainly due to paging activities in SRB mode. See Chapter 5, Analyzing processor storage activity for ways to reduce paging. v JES AS. Decrease batch and some TSO activity (drastic action for small payback).
Tracing
GTF traces can be used to see the use of supervisor code in your TCB time (SVCs and PCs). Sometimes you may see a lot of GETMAIN, EXCP, or CLOSE activities in your GTF trace. This can be a clue for application redesign.
Calculate your capture ratio

If your prime shift capture ratio is below about 80%, check the following: v How much CPU am I spending with swapping (non-captured)? v How much CPU am I spending with paging (some non-captured)?
Redesign application
Sometimes the only answer may be to redesign an application completely or partly, or change the users expectations.
Install more CPU power

When you have done all you can, and the problem does not go away, you can still resort to buying more CPU power.
79
Processor analysis
Summary
To analyze processor problems you should: v Determine if this is a DELAY due to another user or a high USING processor consumption. v Look at the Monitor III Job Delay report first in case of DELAY. v Reduce all the processor overhead that you can: it will usually be a very small amount.
80
Chapter 4. Analyzing I/O activity
Lets Analyze Your I/O Subsystem This chapter discusses how to analyze an I/O problem. We review: v How to perform an overall health check of your I/O subsystem v How to improve I/O performance with the Enterprise Storage Server v How to analyze and solve I/O problems related to v Cache v DASD v Tape We start by reviewing some of the key concepts, terms, and functions that will be used throughout the chapter.
81
I/O analysis
Do you have an I/O problem?

You can conclude that you have an I/O problem if: v Service level objectives are being missed v Users are complaining about response times v AND, the I/O indicators discussed in this chapter show signs of stress. Or better still, you see high DEV DLY or USG for an important workload directly in Monitor III reports (for example, SYSINFO, DELAY, DEV, or DEVR). Here are the primary RMF indicators you can use to assess your I/O subsystem. After listing these indicators, we will go on to discuss each one in more detail, with guidelines on values that may indicate contention and suggestions for further action.
RMF I/O INDICATORS

DASD LCU level I/O rate Response Time v Disconnect v Connect v Queue v Pend Cache Statistics v Read hits v DASD fast write hits v Cache Fast write hits v Staging v DFW retries v Bypass v Inhibit v Off v ASYNC Utilization Cache Statistics v Read hits v DASD fast write hits v Cache Fast write hits v Staging v DFW retries v Bypass v Inhibit v Off v ASYNC DASD Device level I/O rate Response Time v Disconnect v Connect v Queue v Pend Tape I/O rate Response Time v Disconnect v Connect v Queue v Pend
Never start any tuning effort simply because one indicator seems to be a problem. Always check other related indicators to build a good understanding of your I/O subsystem (and overall system). Check to see which workloads are delayed, and by how much. Do all this to make sure you have the big picture well in hand, and only then begin to plan your tuning efforts. As you analyze I/O problems, remember the following: v Manage to service level agreements. No I/O response time, or other indicator, is inherently good or bad. The basic premise of performance management is that you want good I/O response times for important applications. What is good response time? The answer, of course, is that it depends. One way you can determine good response is to measure the response times of the volumes and logical control units (LCUs) when your applications are running well, meeting the formal or informal service level agreements.
82
I/O analysis
v Keep the balanced systems concept in mind. Before embarking on I/O subsystem analysis, review the CPU and processor storage resources. Keeping the big picture in mind will help you make better decisions on matters that affect overall system throughput, and on trade-offs such as using additional processor storage to reduce I/O rates. v Proceed in a top-down manner. This implies three things: If you think you see a problem, start with the basic facts. Which workload has the problem? What are the I/O rates and response times? Are service levels really being impacted? A bad response time may not be worth spending time on, if the I/O rate is low, or the workload has low priority. Analyze your overall subsystem before trying to tune for specific DASD volumes or tape subsystems. Analyze any response-time problem to see which component is greatest (disconnect, queue, etc.), and address that component first. v Optimize DASD data placement. The best response time is achieved if the I/O never happens. This may be accomplished by a smarter program that avoids the I/O, by buffering, or by data-in-memory techniques where the I/O is resolved somewhere in processor storage. The next best response time is achieved by resolving the I/O directly in a cache control unit, a cache hit. The worst response time occurs when a DASD I/O must be resolved directly from a DASD device.
83
I/O analysis
I/O subsystem health check

This section describes how to perform an overall health check of your I/O subsystem. Use this section to help you establish a baseline measurement of I/O subsystem performance, set up your performance monitoring process, and provide a base of knowledge from which to begin performance tuning.
Understand your business priorities

All devices and workloads do not have the same priority. Before starting any tuning effort, be sure you understand the important business applications and their formal or informal service objectives. Know which volumes serve the important workloads.
Review CPU and processor storage

Alleviating an I/O bottleneck will mean that more work will be delivered to the CPU for processing. Keep the balanced system concept in mind, making sure that CPU and processor storage resources are available. Review processor storage to see how much might be available for increased I/O elimination using data-in-memory techniques.
Establish a baseline measurement

Establishing a set of measurements for when your system is running well, provides a good base for later comparison. This will tell you what numbers are good for YOUR environment, which is better than simply following any rule of thumb guideline. Measure the I/O rate, response times, and the cache activity at the LCU and device level. Disconnect time is inversely proportional to cache effectiveness and is the key measurement. Concentrate on the DASD LCUs and individual devices with the highest load or most critical data, and account for all I/Os being resolved either in CACHE or directly to DASD (this gives you the cache hit ratio). Then explode these two categories CACHE and DASD into their components, to gain further insights into how the cache controller is performing and what options may be available to manage the performance of the data. Using the Spreadsheet Report, you can display these measurements by shift and by hour for further evaluation. Using the guidelines and recommendations given in this chapter, analyze these measurements to determine overall DASD subsystem health and to identify areas of further opportunity. See Cache Performance Management for further details and examples.
84
I/O analysis
Improving I/O performance with the Enterprise Storage Server

The Enterprise Storage Server (ESS) is IBMs most powerful disk storage server, developed using IBM Seascape architecture. The ESS provides functions for the eServer family, and also for the non-IBM (that is, Intel-based and UNIX-based) families of servers. Across all of these environments, the ESS features unique capabilities that allow it to meet the most demanding requirements of performance, capacity, and data availability that the computing business may require. The Seascape architecture design has allowed the IBM TotalStorage Enterprise Storage Server to evolve from the initial E models to the succeeding F models, and to the 800 models, each featuring new, more powerful hardware and functional enhancements. RMF has incorporated a number changes to accommodate and report on the advanced ESS functions related to performance, for example, the number of parallel access volumes or information about RAID ranks in the cache subsystem.
z/OS Parallel Sysplex I/O management

In the z/OS Parallel Sysplex, the Workload Manager (WLM) controls where work is run and optimizes the throughput and performance of the total system. Until now, WLM management of the I/O has been limited. With ESS, there are some exciting new functions that allow WLM to control I/O across the sysplex. These functions include parallel access to both single system and shared volumes and the ability to prioritize the I/O based upon the WLM goals. The combination of these features can significantly improve performance in a wide variety of workload environments.
ESS performance features

This chapter covers the performance features of the ESS including PAV, multiple allegiance, and I/O priority queuing.
Concurrent access features

z/OS retains device queuing features that were developed to ensure effective serial access to physical devices. These features were developed at a time when physical devices rather than emulated devices were the norm and long before RAID solutions were developed. They ensured that only one channel program could be active to a disk at any time. This ensured that there was no possibility of interference between channel programs. The traditional queuing method is illustrated in Figure 49 which shows that Application B is processing a request for volume 2000. If applications A and C issue I/O requests to the same volume, they will get a device busy or UCB busy response.
85
I/O analysis
Application A
I/O Request UCB 2000
Application B
Application C
I/O Request UCB 2000 UCB Busy (Request Qeued)
UCB 2000
UCB 2000
UCW Device Busy Device Busy (Pre-I/O Queuing)
2000
One I/O to one Volume at one time
Figure 49. z/OS Traditional Device Serialization
z/OS interacting with the ESS relaxes some of these restrictions. Two features are introduced to allow this. v Multiple allegiance v Parallel Access Volumes (PAV) Although there may be a large number of concurrent operations active to a particular logical volume, the ESS ensures that no I/O operations that have the potential to conflict over access to the same data will be scheduled together. Effectively no data can be accessed by one channel program that has the potential to be altered by another active program. Channel programs that are deemed to be incompatible with an active program are queued within the ESS.
Multiple allegiance
S/390 device architecture has defined that a state of implicit allegiance exists between a device and the channel path group that is accessing it. This allegiance is created in the control unit between the device and a channel path group when an I/O operation is accepted by the device. The allegiance causes the control unit to guarantee access, no busy status presented, to the device for the remainder of the channel program over the set of paths associated with the allegiance. This concept has been expanded to support the ESS with a concept of multiple allegiance. ESSs concurrent operations capability supports concurrent accesses to or from the same volume from multiple channel path groups, system images. The ESSs multiple allegiance support allows different hosts to have concurrent implicit allegiances provided that there is no possibility that any of the channel programs can alter any data that another channel program might read or write.
86
I/O analysis
Multiple allegiance requires no additional software or host support, other than to support the ESS. It is not externalized to the operating system or operator. Multiple allegiance will reduce contention reported as PEND time. Resources that will benefit most from multiple allegiance are: v Volumes that: Have many concurrent read operations Have a high read to write ratio v Data sets that: Have a high read to write ratio Have multiple extents on one volume Are concurrently shared by many users
Parallel access volumes (PAV)

z/OS systems queue I/O activity on a unit control block (UCB) that represents the physical device. High parallel I/O activity can adversely effect performance, traditionally because high accesses usually correlated to high levels of mechanical motion and in subsystems with large caches and RAID arrays because the volumes were treated as a single resource, serially reused. This contention is worst for large volumes with numerous small data sets. The symptom displayed is extended IOSQ time. The operating system can not attempt to start more than one I/O operation at a time to the device. The ESSs concurrent operations capabilities also support concurrent data transfer operations to or from the same volume from the same system. A volume accessed in this way is called a Parallel Access Volume (PAV). Figure 50 illustrates multiple allegiance and PAV allowing concurrent I/O processing. PAV exploitation requires both software enablement and an optional feature on your ESS. PAV support must be installed on each ESS. It enables the issuing of multiple channel programs to a volume from a single system, and allows simultaneous access to the logical volume by multiple users or jobs. Reads, as well as writes to different extents, can be satisfied simultaneously. The domain of an I/O consists of the specified extents to which the I/O operation applies. Writes to the same domain still have to be serialized to maintain data integrity. Support is implemented by defining multiple UCBs for volumes. The UCBs are of two types. Base address This is the actual unit address of the volume. There is only one base address for any volume. Alias address Alias addresses are mapped back to a base device address. I/O scheduled for an alias is physically performed against the base by the ESS. No physical disk space is associated with an alias address, however, they do occupy storage within z/OS. Alias UCBs are stored above the 16MB line.
87
I/O analysis
Application A
Application B
Application C
I/O Request UCB 2000 Alias UCBs
UCB 2000
UCB 2000
PAV Support
Multiple Allegiance
Two or more operations to the same volume at the same time, from different SCPs 2000
Parallel Access Volume

Two operations to the same volume at the same time, from the same SCP
Figure 50. Device Queuing in a Parallel Access Volume Environment
The workloads that are most likely to benefit from the PAV function being available include: v Volumes that have many concurrently open data sets, for example, volumes in a work pool v Volumes that have high read to write ratio per extent v Volumes reporting high IOSQ times Candidate data types are: v High read to write ratio v Many extents on one volume v Concurrently shared by many readers v Accessed using media manager or allocated as VSAM extended format PAVs can be assigned to base UCBs either manually or automatically by WLM. PAVs assigned manually are called static, while those movable by WLM are called dynamic. If WLM is used to manage I/O priorities, then you can use WLM dynamic PAV management as well. Otherwise, WLM can only use dynamic PAVs to balance device utilizations, not to directly help work achieve its goal.
I/O priority queuing

If I/Os cannot run in parallel, for example, due to extent conflicts, the ESS will internally queue I/Os. This reduces operating system overheads incurred by having to post device busy and redriving channel programs. The ESS will queue I/Os in the order in which they are received. This helps to reduce problems that occur when one processor can respond to interrupts faster than a sharing one, and so monopolize a device.
88
I/O analysis
You also have the option to enable priority queuing of I/Os to the ESS. WLM sets a priority bit in the CCW. Priority queuing is within a sysplex and queuing is at a volume level. The ESS queues I/O requests in the order specified by WLM. I/O may be queued in the following situations: v An extent conflict exists for a write operation. v To allow servicing of a cache miss, device will reconnect when data has been staged to cache. v A reserve request is issued and other accesses are current with a different path group ID.
Cache performance
Below are the CACHE and DASD components which you should measure. They are obtained from the Monitor I Cache Subsystem Activity report. In addition, you should collect the Monitor I Device Activity report data for the same period. You also need to know which devices are used by your important workloads: you can get this information from Monitor III DEVR or DEV reports.
Cache Performance Management The KEY

Cache Controller TOTAL I/Os
Async(destage) Off Inhibit Bypass Write Thru Staging (misses) DFW Retry
DASD
Cache
CFW Hits DFW Hits Read Hits
RMF SMF
Figure 51. Cache Performance Management: The Key
CACHE
RHIT DFWHIT CFWHIT Normal and sequential read hits DASD fast write (DFW) hits Cache fast write (CFW) write and read hits
DASD
STAGE Read misses and DFW misses. The record was not found in the cache. It is read or written, and that record with the rest of the track is staged into the cache. In the ESS, stages no longer occur
89
Cache performance
when data for a DFW request is absent from cache. Instead, this is handled as a write promote (and counts as a hit). DFWRETRY DASD fast write retry. The record is in cache and is about to result in a cache hit, but non-volatile storage (NVS) is full. The operation is not really retried but written directly to DASD. Write THRU to DASD. These are writes to devices behind storage controls that are not enabled for DFW. The bypass mode was specified in the define extent. The I/O is sent directly to DASD, the track is not promoted and the record is invalidated in cache. The ESS does not actually bypass the cache, but ensures that data which specifies bypass mode is destaged quickly. The inhibit mode was specified in the define extent and the record was not in the cache. Examples of inhibit include DFDSS for reads and DFSMS dynamic cache management (DCM or DCME) maybe cache and never cache storage classes. The I/O will retrieve the record in cache if it is there; if it is not in the cache, the I/O is sent directly to DASD, but no staging of the rest of the track occurs. If inhibit mode was specified and the result was a cache hit, the I/O would fall into one of the cache hit categories. Device is turned off for caching. Asynchronous destage of records to DASD. This includes anticipatory destage to write new or updated data from cache to DASD if NVS or cache is full. The unit is in tracks per second. All the other values given above are in I/Os per second. Async is not actually part of the I/O rate. It is the consequence of a write hit which must eventually be written to DASD. Counting it in the I/O rate would cause the I/O to be accounted for twice, once on the write hit and again when the track was destaged. Async can offer insight into the DFW.
THRU BYPASS
INHIBIT
OFF ASYNC
It should be noted that the causes of bypass and inhibit listed above may not be all inclusive. In particular, you should consult your software vendor to understand which caching mode, if any, they use. The default mode is normal or sequential. However, bypass and inhibit mode can explicitly be set in software on an I/O basis. For example, it is possible for some I/Os to a data set to use inhibit and some normal caching mode, even within the execution of a particular application program. The same is true for DB2. Within DB2, one plan may access a table space in normal caching mode and another plan may optimize to access in prefetch mode, utilizing the bypass mode to accomplish this.
Cache management with ESS

The ESS provides performance improvements over those provided by raw disk by using caching. Caching algorithms are executed by the storage directors and determine what data occupies cache. Attached hosts can offer information about their data access intentions that the storage directors will use to help select the best algorithm to use for cache management. The following caching features are provided.
Read caching
Read hits occur when all of the data requested for a data access is located in cache. The ESS improves the performance of read caching by storing data in cache tracks that have the greatest probability of being accessed by a read operation.
90
Cache performance
An I/O operation that results in a read hit will not disconnect from the channel and data will be immediately available to the application. If the requested data is not located in cache, a write miss occurs and data is read from disk, returned to the program, and loaded to cache. This is called staging. While records are being read from disk, the channel program is disconnected allowing other applications to access the channel. There are three types of staging: Record staging Only those records accessed by the channel program are staged into cache. Partial track staging The required records and the rest of the track are staged to cache. This is the default mode of cache operation. Full track staging Based on the prediction of sequential I/O processing. This can be predicted either by the previous behavior of the application or by the application signalling in the I/O request that sequential access will be used. Data transferred using inhibit cache load or bypass cache attributes will be loaded into cache but eligible for accelerated destaging. The ESS offers new capabilities to support the optimization of sequential performance; improved and more sensitive pre-fetching algorithms, and new channel commands that improve overheads and provide increased bandwidth.
Write caching
Two forms of write caching are supported: DASD Fast Write (DFW) and Cache Fast Write (CFW). Cache fast write is a z/OS function that is intended for data that does not need to be written to disk. CFW is used only when explicitly requested by the application and should only be selected for transient data, data that is not required beyond the end of the current job step and that can more easily be recreated from scratch than recovered from a checkpoint. An example of this type of data is intermediate work files, such as used by a sort program. DFW is the more usual form of write caching. With DFW the application is told that an I/O operation is complete once data has successfully been written to cache and Non-Volatile Storage (NVS). Data integrity and availability is maintained by retaining two copies of the data until it is hardened to disk, one copy in cache on one cluster and the second in NVS of the other cluster. NVS is protected by battery backup. Normal access to the data is from the copy retained in cache. Destaging of data that is backed up in NVS from cache to disk is based on a Least Recently Used (LRU) algorithm. Data may be retained in cache after being written to disk based on the cache activity. Destaging from NVS and cache is anticipatory and threshold based. The intention is to always have NVS and cache resources available to accept new data. Tracks at the top of the cache LRU list are checked for updates in NVS that have not been destaged to disk. The ESS schedules tracks at the top of the NVS LRU for destaging, so that they can be allocated without a delay during destaging.
Write performance
Caching benefits write performance as almost all writes are at cache speeds. DFW minimizes any potential penalty of RAID 5 generation of parity. This performance benefit is clearly demonstrated by the published results of ESS performance tests.
91
Cache performance
In addition, write performance is enhanced by striping. The ESS automatically stripes logical volumes across all the drives in the RAID array. This provides automatic load balancing across the disk in the array, and an elimination of hot spots. This design should reduce the amount of effort that storage administrators spend, hand placing data, at the same time offering performance improvements. The ESS RAID 5 implementation gives a minimal RAID 5 write penalty for sequential writes. When writing a sequential stripe across the disks in an array, the ESS generates the parity only once for all the disks. This is sometimes called a RAID 3-like implementation, and it provides high performance in sequential operations.
Cache Subsystem Activity report

The report provides cache statistics on a subsystem basis as well as on a detailed device-level basis. You may start the analysis of your cache subsystem with the Summary report.
Subsystem summary
The report allows you a top-down approach to analyze the storage subsystems in your configuration because you can see at a glance the most important data. Looking at this report, the storage subsystems causing problems can be easily identified and analyzed in a second Postprocessor run requesting more details.
C A C H E z/OS V1R11 S U B S Y S T E M S U M M A R Y PAGE SYSTEM ID OS04 RPT VERSION V1R11 RMF NVS I/O OFF RATE RATE 3.1 38.6 0.0 55.3 49.5 56.9 50.7 53.0 42.0 50.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 --CACHE HIT RATEREAD DFW CFW 1.0 16.5 0.0 0.0 0.2 0.2 0.2 0.2 0.3 0.2 2.2 22.0 0.0 55.2 49.3 56.7 50.5 52.8 41.8 50.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 DATE 06/05/2009 TIME 09.30.00 INTERVAL 14.59.996 1
SSID CU-ID
TYPE
CACHE
--------DASD I/O RATE-------STAGE DFWBP ICL BYP OTHER 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 165.5 145.1 169.9 149.0 158.3 123.6 150.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
ASYNC TOTAL RATE H/R 0.0 1.7 0.0 486.2 500.8 489.1 465.6 467.6 434.6 454.1 1.000 0.999 N/A 1.000 1.000 1.000 1.000 1.000 1.000 1.000
READ WRITE H/R H/R 1.000 0.997 N/A 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 N/A 1.000 1.000 1.000 1.000 1.000 1.000 1.000
% READ 30.8 42.9 N/A 0.1 0.4 0.4 0.4 0.4 0.6 0.4
0600 06C0 0A80 6801 6802 6803 6804 6805 6806 6807
0600 06E0 0ABE 4000 4100 4200 4300 4400 4500 4600
3990-006 3990-006 3990-006 2105-F20 2105-F20 2105-F20 2105-F20 2105-F20 2105-F20 2105-F20
256 512 512 8192 8192 8192 8192 8192 8192 8192
64 32 32 192 192 192 192 192 192 192
Figure 52. Cache Subsystem Activity report - Summary
Top-20 device lists

The report consists of two top-20 lists of devices, sorted in descending order by DASD I/O rate and by total I/O rate. These two lists allow you to identify the volumes with the highest I/O rates to the lower interface of a subsystem as well as the volumes with the highest I/O rates in total. Solving a possible problem, one of the listed devices would probably be of most benefit to the overall subsystem.
92
Cache performance
C A C H E z/OS V1R11
S U B S Y S T E M
S U M M A R Y PAGE 2 INTERVAL 14.59.996
SYSTEM ID OS04 RPT VERSION V1R11 RMF
DATE 06/05/2009 TIME 09.30.00
*** DEVICE LIST BY DASD I/O RATE *** VOLUME SERIAL PRD440 PPDS14 PRD437 PPD026 PRD554 PRD339 ... DEV NUM 077E 0220 0214 0B76 06B9 0231 SSID % I/O 25.7 6.9 4.5 27.4 59.1 2.4 I/O RATE 53.1 29.1 19.0 71.4 32.4 10.3 ---CACHE HIT RATE-READ DFW CFW 21.3 1.0 12.5 65.9 28.9 7.2 16.7 15.7 2.0 1.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ----------DASD I/O RATE---------- ASYNC STAGE DFWBP ICL BYP OTHER RATE 13.9 0.1 4.4 4.3 3.5 3.1 0.0 0.0 0.0 0.0 0.0 0.0 1.3 12.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.6 2.0 2.0 0.5 0.0 0.0 TOTAL H/R 0.714 0.573 0.765 0.939 0.893 0.698 READ H/R 0.604 0.943 0.738 0.939 0.893 0.698 WRITE H/R 1.000 1.000 0.998 0.975 1.000 N/A % READ 67.9 6.2 89.2 98.4 99.9 0.0
00B1 00CC 00CC 00F1 00FE 00CC
*** DEVICE LIST BY TOTAL I/O RATE *** VOLUME SERIAL PPD026 PRD440 PRD327 PRD343 PBV321 PRD307 ... DEV NUM 0B76 077E 0200 0515 022C 0507 SSID % I/O 27.4 25.7 11.8 17.5 11.2 15.3 I/O RATE 71.4 53.1 49.8 48.9 47.3 42.8 ---CACHE HIT RATE-READ DFW CFW 65.9 21.3 3.0 24.8 47.1 4.1 1.1 16.7 46.8 24.0 0.2 38.2 0.0 0.0 0.0 0.0 0.0 0.0 ----------DASD I/O RATE---------- ASYNC STAGE DFWBP ICL BYP OTHER RATE 4.3 13.9 0.1 0.1 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 1.6 6.3 5.5 0.1 13.0 TOTAL H/R 0.939 0.714 0.998 0.998 1.000 0.988 READ H/R 0.939 0.604 0.973 0.996 1.000 0.891 WRITE H/R 0.975 1.000 1.000 0.999 1.000 1.000 % READ 98.4 67.9 6.2 50.8 99.7 10.8
00F1 00B1 00CC 00E4 00CC 00E4
Figure 53. Cache Subsystem Activity Report - Top-20 Device Lists
Based on these reports, you can decide whether further investigation is necessary. The next step could be the creation of reports which show some more details. This can be done on subsystem level as well as on device level.
Subsystem-level reporting
This generates a report with three sections. v Cache Subsystem Status v Cache Subsystem Overview v Cache Subsystem Device Overview The subsystem-level report gives an overall view of the storage control, that is the amount of cache storage and non-volatile storage installed, as well as the current status of the cache. In addition, the performance analyst finds the number of I/O requests sent to the control unit and their resolution in the cache (hits). The report is completed by a list of all volumes attached to the subsystem, showing their specific utilization of the cache.
Device-level reporting
This generates, in addition to the above described report, a report with two sections: v Cache Device Status v Cache Device Activity The device-level report provides detailed information for each single device attached to the selected control unit. The report is intended to analyze the cache usage in detail on the basis of the information about the applications that access these volumes. Based on the data in the Summary report (Figure 52), one might decide that the subsystem SSID=00B1 needs further investigation.
93
Cache performance
Cache subsystem overview

C A C H E z/OS V1R11 S U B S Y S T E M A C T I V I T Y PAGE SYSTEM ID OS04 RPT VERSION V1R11 RMF CU-ID 074C SSID 00B1 CDATE DATE 06/05/2009 TIME 09.30.00 06/05/2009 CTIME 09.30.03 INTERVAL 14.59.996 1
SUBSYSTEM 3990-06 TYPE-MODEL 9396-001
CINT
15.00
---------------------------------------------------------------------------------------------------------------------------------CACHE SUBSYSTEM STATUS ---------------------------------------------------------------------------------------------------------------------------------SUBSYSTEM STORAGE CONFIGURED AVAILABLE PINNED OFFLINE 1024.0M 1019.9M 0.0 0.0 NON-VOLATILE STORAGE CONFIGURED PINNED 32.0M 0.0 STATUS CACHING NON-VOLATILE STORAGE CACHE FAST WRITE IML DEVICE AVAILABLE ACTIVE ACTIVE ACTIVE YES
---------------------------------------------------------------------------------------------------------------------------------CACHE SUBSYSTEM OVERVIEW ---------------------------------------------------------------------------------------------------------------------------------TOTAL I/O TOTAL H/R CACHE I/O REQUESTS NORMAL SEQUENTIAL CFW DATA TOTAL 186363 0.865 CACHE I/O CACHE H/R 181922 0.886 CACHE OFFLINE 0
-------------READ I/O REQUESTS------------COUNT RATE HITS RATE H/R 116003 16655 0 132658 128.9 18.5 0.0 147.4 95313 16617 0 111930 105.9 18.5 0.0 124.4 0.822 0.998 N/A 0.844
----------------------WRITE I/O REQUESTS---------------------COUNT RATE FAST RATE HITS RATE H/R 15150 34114 0 49264 16.8 37.9 0.0 54.7 15150 34114 0 49264 16.8 37.9 0.0 54.7 15134 34114 0 49248 16.8 37.9 0.0 54.7 0.999 1.000 N/A 1.000
% READ 88.4 32.8 N/A 72.9
-----------------------CACHE MISSES----------------------REQUESTS READ RATE WRITE RATE TRACKS RATE NORMAL SEQUENTIAL CFW DATA TOTAL 20690 38 0 20744 23.0 0.0 0.0 RATE 16 0 0 23.0 0.0 0.0 0.0 12993 870 14.4 1.0
------------MISC-----------COUNT RATE DFW BYPASS 289 0.3 CFW BYPASS 0 0.0 DFW INHIBIT 0 0.0 ASYNC (TRKS) 7239 8.0
------NON-CACHE I/O----COUNT RATE ICL 4435 4.9 BYPASS 6 0.0 TOTAL 4441 4.9
---CKD STATISTICS--WRITE WRITE HITS 268 268
---RECORD CACHING--READ MISSES WRITE PROM 14502 12938
Figure 54. Cache Subsystem Activity Report - Status and Overview
This section provides, at a glance, all relevant data for one subsystem. It distinguishes between I/O requests to be handled by the cache (cachable I/Os) and non-cache I/O requests. Cachable I/O Requests These requests are shown in three categories: NORMAL Cache will be managed by least-recently-used (LRU) algorithm for making cache space available. SEQUENTIAL Tracks following the track assigned in the current CCW chain are promoted, they will be transferred from DASD to cache in anticipation of a near-term requirement. CFW DATA WRITE and READ-AFTER-WRITE requests are processed in cache. The
94
Cache performance
data might not be written to DASD. Because CFW does not use the NVS, the application is responsible for restoring the data after a cache or subsystem failure. Non-Cache I/O Requests These requests that switched off cache processing, are shown in two categories: ICL Inhibit cache load number of I/O requests that inhibited load of data into cache, and the data was not found in the cache. Note: If the data was in the cache, it has been counted as cache hit. Therefore, this is actually the number of ICL misses. BYPASS Number of I/O requests that explicitly bypassed the cache, no matter whether the data is in the cache or not.
I/O Requests You find the following numbers in the sample report: Cachable I/O Requests
TOTAL READS READ HITS TOTAL WRITES WRITE HITS 132658 111930 READ HIT RATE (H/R) WRITE HIT RATE (H/R) CACHE HIT RATE (H/R) 0.844 1.000 0.886
49264 49248
TOTAL CACHE I/O 181922 CACHE HITS 161178
Non-Cache I/O Requests

ICL BYPASS TOTAL 4435 6 4441
Total I/O Requests The sum of these two groups is shown as total number of I/O requests for cached devices of this subsystem:
CACHE I/O NON-CACHE I/O TOTAL I/O CACHE HITS 181922 4441 186363 161178
TOTAL HIT RATE (H/R)
0.865
While hit percentage by itself (CACHE H/R) does not drive the cache performance management process, hit percentage should be derived by dividing the hits by all I/Os (TOTAL H/R). Hit percents derived by dividing the hits by cachable I/Os are misleading. Cachable I/Os refers to those I/Os that are eligible for caching. At first glance, the distinction between cachable I/Os and all I/Os seems rather insignificant since most I/Os are eligible for caching. However, some software (e.g. DFSMS Dynamic Cache Management and DB2) will selectively disable I/O operations for caching, using inhibit and bypass modes of operations. This could result in high cachable hit percentages but lower hit percentages. In this case the disconnect time would be higher than what you would expect for a high cachable hit percent, but the
95
Cache performance
disconnect time would be reasonable for a lower hit percent. Where all I/Os are eligible for caching, the hit percent and cachable hit percent will be equal. You can also get from the report the relationship between READ and WRITE requests as %READ. Sometimes, performance experts work with the READ/WRITE ratio this value is not shown in the report because of arithmetic problems in case of very less (or zero) WRITE requests, but can be calculated easily, otherwise.
READ/WRITE Ratio
TOTAL CACHE I/O TOTAL CACHE READS TOTAL CACHE WRITES 181922 132658 49264 % READ R/W ratio 72.9 2.69
The R/W ratio can also be calculated as

R/W ratio = 72.9 / (100 - 72.9) = 2.69
96
Cache performance
Cache subsystem device overview

C A C H E z/OS V1R11 S U B S Y S T E M A C T I V I T Y PAGE SYSTEM ID OS04 RPT VERSION V1R11 RMF CU-ID 4000 SSID 6801 CDATE DATE 06/05/2009 TIME 09.30.00 06/05/2009 CTIME 09.30.03 INTERVAL 14.59.996 2
SUBSYSTEM 2105-01 TYPE-MODEL 2105-F20
CINT
15.00
-----------------------------------------------------------------------------------------------------------------------------------CACHE SUBSYSTEM DEVICE OVERVIEW -----------------------------------------------------------------------------------------------------------------------------------VOLUME SERIAL DEV NUM RRID % I/O 100.0 0.0 100.0 0.0 17.5 17.9 10.3 17.7 18.1 18.4 I/O RATE 55.3 0.0 55.3 0.0 9.7 9.9 5.7 9.8 10.0 10.2 ---CACHE HIT RATE-READ DFW CFW 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 55.2 55.2 0.0 9.7 9.9 5.6 9.8 10.0 10.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ----------DASD I/O RATE---------STAGE DFWBP ICL BYP OTHER 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 165.5 165.5 0.0 29.0 29.7 16.9 29.4 30.0 30.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ASYNC RATE 486.2 486.2 0.0 39.1 36.3 16.8 39.2 33.8 34.6 TOTAL H/R 1.000 1.000 N/A 1.000 1.000 1.000 1.000 1.000 1.000 READ H/R 1.000 1.000 N/A N/A N/A 1.000 N/A N/A N/A WRITE H/R 1.000 1.000 N/A 1.000 1.000 1.000 1.000 1.000 1.000 % READ 0.1 0.1 N/A 0.0 0.0 0.9 0.0 0.0 0.0
*ALL *CACHE-OFF *CACHE SR4B00 4000 SR4B01 4001 SR4B02 4002 SR4B03 4003 SR4B04 4004 SR4B05 4005 SR4B06 4006
0000 0000 0000 0000 0000 0000 0000
Figure 55. Cache Subsystem Activity Report - Device Overview
This section of the report provides an overview for one control unit with all attached devices. You can use this report for investigating the cache effectiveness of each volume. During this analysis, you should use the values I/O RATE and % I/O as indicators to decide whether a volume has an I/O activity that makes it worth for further performance investigation. If you come to the conclusion that further analysis of a specific volume might be necessary, you should request the report on device level. It has the same structure as the Subsystem Overview report, but provides the details for each volume.
RAID rank activity

C A C H E z/OS V1R11 S U B S Y S T E M A C T I V I T Y PAGE SYSTEM ID OS04 RPT VERSION V1R11 RMF DATE 06/05/2009 TIME 09.30.00 INTERVAL 14.59.996 3
SUBSYSTEM 2105-01 CU-ID 4000 SSID 6801 CDATE 06/05/2009 CTIME 09.30.03 CINT 15.00 TYPE-MODEL 2105-F20 -----------------------------------------------------------------------------------------------------------------------------------RAID RANK ACTIVITY -----------------------------------------------------------------------------------------------------------------------------------ID RAID TYPE DA HDD -------- READ REQ ------RATE AVG MB MB/S RTIME 265 257 7 0.055 14.6 0.055 14.2 0.055 0.4 150 150 148 ------- WRITE REQ ------RATE AVG MB MB/S RTIME 243 125 118 0.111 0.089 0.134 27.0 11.1 15.9 249 266 232 ---------- HIGHEST UTILIZED VOLUMES ----------
*ALL 0000 0001
RAID-5 RAID-5
01 01
14 7 7
SR4B06
SR4B05
SR4B02
SR4B04
SR4B01
SR4B03
Figure 56. Cache Subsystem Activity report - RAID rank activity
This section of the report provides an overview about the activities of all RAID ranks belonging to the storage server. For each rank, you get the list of those six logical volumes which have the highest utilization, this might help you in deciding whether the mapping between RAID ranks and logical volumes is well balanced, or whether some changes should be performed.
97
Cache performance
Cache device activity

C A C H E D E V I C E A C T I V I T Y PAGE 2
z/OS V1R11
SYSTEM ID SYS1 RPT VERSION V1R11 RMF
DATE 11/28/2009 TIME 00.30.00
INTERVAL 15.00.055
SUBSYSTEM 2105-01 CU-ID 1E91 SSID 3007 CDATE 11/28/2009 CTIME 00.30.00 CINT 15.00 TYPE-MODEL 2105-E20 MANUF IBM PLANT 75 SERIAL 000000016374 VOLSER NP1MHI NUM 1E85 RRID 0700 -----------------------------------------------------------------------------------------------------------------------------------CACHE DEVICE STATUS -----------------------------------------------------------------------------------------------------------------------------------CACHE STATUS CACHING DASD FAST WRITE PINNED DATA - ACTIVE - ACTIVE - NONE DUPLEX PAIR STATUS DUPLEX PAIR STATUS DUAL COPY VOLUME - NOT ESTABLISHED - N/A - N/A
-----------------------------------------------------------------------------------------------------------------------------------CACHE DEVICE ACTIVITY -----------------------------------------------------------------------------------------------------------------------------------TOTAL I/O TOTAL H/R CACHE I/O REQUESTS NORMAL SEQUENTIAL CFW DATA TOTAL 32141 0.785 CACHE I/O CACHE H/R 32141 0.785 CACHE OFFLINE N/A
-------------READ I/O REQUESTS------------COUNT RATE HITS RATE H/R 26728 5413 0 32141 29.6 6.0 0.0 35.6 19998 5241 0 25239 22.1 5.8 0.0 27.9 0.748 0.968 N/A 0.785
----------------------WRITE I/O REQUESTS---------------------COUNT RATE FAST RATE HITS RATE H/R 0 0 0 0 0.0 0.0 0.0 0.0 0 0 0 0 0.0 0.0 0.0 0.0 0 0 0 0 0.0 0.0 0.0 0.0 N/A N/A N/A N/A
% READ 100.0 100.0 N/A 100.0
-----------------------CACHE MISSES----------------------REQUESTS READ RATE WRITE RATE TRACKS RATE NORMAL SEQUENTIAL CFW DATA TOTAL 6730 172 0 6902 7.4 0.2 0.0 RATE 0 0 0 7.6 0.0 0.0 0.0 6899 6026 7.6 6.7
------------MISC-----------COUNT RATE DFW BYPASS 0 0.0 CFW BYPASS 0 0.0 DFW INHIBIT 0 0.0 ASYNC (TRKS) 0 0.0
------NON-CACHE I/O----COUNT RATE ICL 0 0.0 BYPASS 0 0.0 TOTAL 0 0.0
---CKD STATISTICS--WRITE WRITE HITS 0 0
---RECORD CACHING--READ MISSES WRITE PROM 0 28
----HOST ADAPTER ACTIVITY--BYTES BYTES /REQ /SEC READ 736 146 WRITE 15.9K 471.6K
--------DISK ACTIVITY------RESP BYTES BYTES TIME /REQ /SEC READ 14.132 63.1K 743.9K WRITE 13.812 50.9K 648.1K
Figure 57. Cache Subsystem Activity Report - Cache Device Activity (device-level reporting)
This report shows the statistics for volume NP1MHI. For example, you might look at this data if users are complaining about the response time of a specific application, and Monitor III reports flag a certain volume as a major cause of delay.
98
Cache performance
If you prefer to see data for more than one interval, you can create an Overview report with the selection of those data you are interested in. Let us assume that you want to get the following data for volume PRD440 (with SSID=00B1 and address 077E): Exception CADRT CADWT CADRHRT CADWHRT CADMT CADDFWB CADNCICL Meaning Total READ I/O rate Total WRITE I/O rate Total READ hit ratio Total WRITE hit ratio Total staging rate Total DFW bypass rate Non-cache ICL rate
You find a listing of all conditions in the z/OS RMF Users Guide. You get the Overview report by specifying the following Postprocessor options:
OVERVIEW(REPORT) OVW(READRATE(CADRT(SSID(00B1),DEVN(077E)))) OVW(WRTRATE(CADWT(SSID(00B1),DEVN(077E)))) OVW(READHR(CADRHRT(SSID(00B1),DEVN(077E)))) OVW(WRTHR(CADWHRT(SSID(00B1),DEVN(077E)))) OVW(STGRT(CADMT(SSID(00B1),DEVN(077E)))) OVW(DFWBPRT(CADDFWB(SSID(00B1),DEVN(077E)))) OVW(ICLRT(CADNCICL(SSID(00B1),DEVN(077E))))
This creates the following report:

R M F z/OS V1R11 O V E R V I E W R E P O R T PAGE SYSTEM ID OS04 RPT VERSION V1R11 RMF DATE 06/05/2009 TIME 09.30.00 INTERVAL 14.59.996 CYCLE 1.000 SECONDS 001
NUMBER OF INTERVALS 5 DATE TIME INT MM/DD HH.MM.SS HH.MM.SS 06/05 06/05 06/05 06/05 06/05 10.44.00 10.59.00 11.14.00 11.29.00 11.44.00 01.15.00 00.14.59 00.15.00 00.14.59 01.14.59 READRATE
TOTAL LENGTH OF INTERVALS 01.14.57 WRTRATE READHR WRTHR STGRT DFWBPRT ICLRT
35.213 2.799 2.769 3.692 0.535
16.671 12.444 0.159 0.225 0.261
0.604 0.514 0.318 0.377 0.696
1.000 1.000 1.000 1.000 0.996
13.957 1.359 1.889 2.301 0.164
0.030 0.089 0.000 0.001 0.003
1.263 0.021 0.063 0.034 0.022
Figure 58. Overview Report for Cached DASD Device
In this report, you see that the volume was very active just in one interval, this means that further investigation is required to highlight the critical volumes of this Cache subsystem. Using the Spreadsheet Reporter, you can create reports with key performance data you are interested in. If you would like to see all SSIDs at glance, you can create the Cache Subsystem Report which displays all important cache characteristics. To get all details about cache rates for a longer period of time, you would use the Cache Trend Report.
99
Cache performance
Figure 59. Cache Hits Overview Report. Spreadsheet Reporter macro RMFR9CAC.XLS (Cache Subsystem Report SSIDOvw)
Figure 60. Cache Trend Report. Spreadsheet Reporter macro RMFX9CAC.XLS (Cache Trend Report - Summary)
100
Cache performance
Notes: 1. Cache controller information is collected by individual device addresses without segregation by issuing system. Therefore, the Monitor I Data Gatherer automatically collects the total for each device behind the storage controller. There is no way to differentiate the I/Os by each sharing system. Thus, data gathering is only required to be activated on one of the sharing systems. 2. The report uses cache storage subsystem ID (SSID) numbers to identify control units. The Monitor I DASD Activity report uses logical control unit (LCU) numbers to report on these same control units. LCU and SSID often will not be the same value, even when referring to the same control unit. You may need to compare the address range of devices on a given controller to match LCU to SSID. 3. Also, I/O rates shown in the Cache Subsystem Activity report may not exactly match I/O rates given in Shared Device Activity report (the sum of device activity from all sharing systems). This is because the storage controller counts each locate in a multiple locate CCW chain, whereas RMF only counts one I/O to start the CCW. An example of a CCW with multiple locates is DB2 list prefetch. Conversely, control unit commands, such as standalone RESERVEs, would get counted for the DASD Activity report but not for the Cache Subsystem Activity report.
Indicator Field: Use the following fields: v TOTAL H/R - CACHE H/R in the Subsystem Overview report and Device Activity report. v TOTAL H/R - READ H/R - WRITE H/R in the Device Overview report Guideline: Check whether all volumes (pay special attention to those that do a lot of I/O) are experiencing a good hit ratio (80% or higher). If this is the case, further analysis is not necessary. Note that the occurrence of a low hit ratio for one or more devices cannot be interpreted as a sure sign of the need for more cache (other analytical tools, such as the IBM Cache Analysis Aid, will be required for more thorough cache sizing). The low hit ratio may be caused by data sets which do not cache well, bypass cache, etc. Problem Area: Low cache hit ratios will mean higher DISC times for that device. Also, devices with low cache hit ratios may use cache storage that other devices could have used more profitably. Potential Solution: If you have DFSMS/MVS and its Dynamic Cache Management Extended (DCME) capability, DCME will dynamically manage cache for those data sets you specify (via storage class). Thus no manual cache tuning should be required on your part. If your data is not SMS-managed, you may wish to identify volumes which are not achieving a good cache hit ratio, and are doing a significant amount of staging, and turn them off for caching. This will improve the use of cache for volumes that benefit more from cache. A trial and error approach will be needed here; you will need to make a judgment call on whether to cache devices that are benefitting marginally from cache (maybe 30-80 % hit ratios).
101
Cache performance
Using Monitor III cache reporting

Monitor III offers two reports which can assist you in monitoring online the performance of your Cache subsystem: CACHSUM CACHDET Cache Summary report Cache Detail report
Both reports provide similar information as you can find in the Postprocessor Cache reports. Therefore, the performance indicators are the same, and you might refer to the description of the Postprocessor reports for further evaluation.
Cache Summary report

The report provides an overview about the activities in the cache subsystem for all SSIDs. You might take this as starting point when analyzing I/O performance to get a first impression about the I/O processing. If you feel that further analysis is required, you may continue with the Cache Detail report.
RMF V1R11 Command ===> _ Samples: 120 Systems: 5 Cache Summary - SYSPLEX Line 1 of 21 Scroll ===> HALF Sec Sec
Date: 04/15/09 Time: 08.25.00 Range: 120 CDate: 04/15/09 CTime: 08.24.55 CRange: 120 Hit Hit % Rate 98.8 100 99.8 100 100 100 99.5 99.5 29.3 99.8 30.9 100 98.6 5.4 90.4 141.9 1.1 0.9 42.7 129.0 2.7 4.3 2.3 0.3 -- Miss --Total Stage 1.2 0.0 0.2 0.0 0.0 0.0 0.2 0.7 0.0 0.0 0.1 0.0 1.2 0.0 0.2 0.0 0.0 0.0 0.2 0.7 0.0 0.0 0.1 0.0
SSID
CUID Type-Mod
Size
I/O Rate 99.8 5.4 90.6 142.0 1.1 0.9 42.9 129.7 9.2 4.3 7.4 0.3
Read Seq Async Off % Rate Rate Rate 99.2 100 100 97.7 68.2 70.4 72.2 98.2 40.3 64.6 17.5 88.9 1.6 0.0 0.0 0.0 0.0 0.0 0.6 0.2 0.0 0.3 0.0 0.0 0.1 0.0 0.0 0.3 0.2 0.1 1.8 0.0 0.4 0.4 0.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0010 0011 0030 0040 0041 0042 0043 0044 0060 0061 0062 0063
071D 0520 0269 0401 0460 05AA 05C6 0200 0627 0654 06B0 06CE
9396-001 9396-001 3990-006 9393-002 9393-002 9393-002 9393-002 3990-006 9393-002 9393-002 9393-002 9393-002
1024M 1024M 256M 1536M 1536M 1536M 1536M 64M 1536M 1536M 1536M 1536M
Figure 61. CACHSUM Report
102
Cache performance
Cache Detail report

The CACHDET report provides detailed information about the activities of one cache subsystem.
RMF V1R11 Command ===> _ Samples: 120 Systems: 5 Cache Detail - SYSPLEX Line 1 of 20 Scroll ===> HALF Sec Sec Async Rate 1.8 0.0 1.8 0.0 0.4 0.0 0.3 0.0 0.1 0.2 0.5 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0
Date: 11/28/09 Time: 16.11.30 Range: 120 CDate: 11/28/09 CTime: 16.11.25 CRange: 119 Hit - Cache Hit Rate - - DASD I/O - Seq % Read DFW CFW Total Stage Rate 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.2 0.0 0.0 0.0 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.2 0.0 0.0 0.0 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.0 0.6 0.0 0.3 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Volume /Num SSID
I/O I/O % Rate
*ALL *NOCAC *CACHE SYSPXC SYSSM3 SYSAXC SYSSM6 SYSSMS SYSSM8 SYSSM5 SYSSM2 SYSSMB SYSSMA SYSSM7 SYSOPE SYSSM4 SYSSM9 SYSSM1 SYSSMC MVSWR1
05CB 05C3 05CA 05C6 05C0 05C8 05C5 05C2 05DA 05CD 05C7 05CC 05C4 05C9 05C1 05DB 05FF
0043 0043 0043 0043 0043 0043 0043 0043 0043 0043 0043 0043 0043 0043 0043 0043 0043
100 42.9 99.5 30.8 11.9 0.0 0.0 0.0 0.0 0.0 100 42.9 99.5 30.8 11.9 27.9 12.0 100 10.4 1.6 11.6 5.0 100 3.9 1.1 11.2 4.8 100 3.2 1.6 10.8 4.6 98.9 3.7 0.9 9.7 4.2 97.6 0.4 3.7 6.8 2.9 100 1.6 1.4 5.4 2.3 100 2.3 0.1 3.7 1.6 100 0.6 1.0 3.6 1.6 100 1.4 0.1 3.6 1.5 99.5 1.5 0.0 3.3 1.4 98.2 1.2 0.2 1.6 0.7 97.5 0.6 0.1 0.6 0.3 96.9 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Figure 62. CACHDET Report
103
DASD performance
DASD indicators and guidelines

The following sections discuss performance topics and tuning guidelines that can help you to resolve DASD I/O throughput issues.
Response time components
DASD Response Time Components

Queue Pend
Delay in:
-channel -director port -storage control -device busy
Connect
Disconnect
Seek Latency RPS Delay Staging
Data Transfer Search Time MVS Internal Queuing

Figure 63. DASD Response Time Components
Response time is a key measure of how well your I/O is performing, and a key indicator of what can be done to speed up the I/O. It is important to understand the components of this response time, and what they mean. DASD response time is the sum of queue, pend, connect and disconnect time. The following lists the response time components and what causes them: Connect Disconnect The part of the I/O during which data is actually transferred, protocol, search and data transfer time. Time that an I/O request spends freed from the channel. This is the time that the I/O positions for the data that has been requested. It includes SEEK and SET SECTOR Moving the device to the requested cylinder and track Latency Waiting for the record to rotate under the head Rotational Position Sensing (RPS) Rotational delay, as the device waits to reconnect to the channel. The term RPS delay traditionally means waiting for an extra disk rotation because a transfer could not occur when the data was under the head. Delays of this type are now obsolete with the ESS. Pend Time that the I/O is delayed in the path to the device. Pend time
104
DASD performance
may be attributable to the channel, control unit, or wait for director (director port delay time), although it is often caused by shared DASD. IOS Queue Represents the average time that an I/O waits because the device is already in use by another task on this system, signified by the devices UCBBUSY bit being on.
For most customers, reporting is based on one or both of the following measurements: Service Time Response Time Connect time plus disconnect time Connect time plus disconnect time plus pend time plus IOSQ time
Spreadsheet reports
In Chapter 2, Diagnosing a problem: The first steps, on page 23, you could see the DASD Summary report (Figure 30 on page 51) to be used for first steps in understanding the DASD performance of your system. If you feel that further investigation is required, you can create additional reports. You might start with a second Summary report that provides data on the most heavily used LCUs and DASD devices:
Figure 64. DASD Summary Report. Spreadsheet Reporter macro RMFR9DAS.XLS (DASD Activity Report - Summary)
105
DASD performance
If you have seen some average data for most active volumes, you might be interested in getting some more details:
Figure 65. Activity of Top-10 Volumes. Spreadsheet Reporter macro RMFR9DAS.XLS (DASD Activity Report Top10Act)
The report shows four values: in addition to the activity rate, you get the values for the I/O Intensity, the Service Time Intensity, and the Path Intensity. I/O Intensity is the product of response time multiplied with the activity rate. The unit of measurement is ms/s. It allows you to examine how many milliseconds per second applications waited for the device. The value can exceed 1000, which is an indicator that the device experiences a very heavy load and requires a long time to satisfy the requests. The value also includes the IOS Queue Time. If the activity rate is very low (e.g. less than 1 SIO/s) and the IOS Queue Time is high, the value is not very meaningful. You should use this measurement for devices with a considerable load (more than 2 IOs/s). I/O Intensity is not a common name in the literature. Other references may use the same measurement with a different name, for example Response Time Volume DASD MPL (typically divided by 1000)
106
DASD performance
If you are interested in getting some details about the response times of the top-10 volumes, you can get this report:
Figure 66. Response Time of Top-10 Volumes. Spreadsheet Reporter macro RMFR9DAS.XLS (DASD Activity Report Top10Rt)
General DASD guidelines

Following are some guidelines for expected values for typical DASD volumes. There are two practical ways to monitor DASD activities with RMF: v The Monitor I DASD Activity report v The Monitor III DEV and DEVR reports The most common procedure is to analyze the Monitor I DASD Activity report. Note: With any RMF reports, the I/O information reported is from one MVS system only. If you are sharing I/O across multiple systems, you will need to review RMF reports from each of the sharing systems in order to see the complete I/O activity to the LCUs and devices (this is not true for the Cache RMF Reporter, however; since CRR data is obtained from the control unit, it does include I/O activity from all sharing systems). Contention from a sharing system will generally be seen as increased pend and disconnect times.
107
DASD performance

D I R E C T z/OS V1R11 A C C E S S D E V I C E A C T I V I T Y PAGE SYSTEM ID UIG1 RPT VERSION V1R11 RMF IODF = 00 CR-DATE: 10/16/2009 DATE 11/28/2009 TIME 11.17.31 CR-TIME: 10.31.31 AVG AVG CMR DB DLY DLY .178 .000 .000 .000 .133 .000 .134 .000 .000 .000 .133 .135 .000 .000 .163 .512 1.41 .000 .384 .256 .640 .000 .000 .000 .000 .037 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 INTERVAL 14.59.998 CYCLE 1.000 SECONDS ACT: ACTIVATE AVG AVG AVG % % PEND DISC CONN DEV DEV TIME TIME TIME CONN UTIL .287 .000 4.58 .000 .000 .000 .274 .052 1.54 .275 .051 1.62 .000 .000 .239 .214 .000 .000 .256 .640 1.54 .000 .512 .384 .768 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .297 2.01 .000 .000 .337 1.15 1.41 .000 1.15 .640 1.28 .000 .000 0.02 0.00 0.21 0.07 0.00 0.00 0.05 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.02 0.00 0.22 0.08 0.00 0.00 0.05 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 % AVG % % DEV NUMBER ANY MT RESV ALLOC ALLOC PEND 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 14.0 14.0 0.0 0.0 4.0 3.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 46.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2
TOTAL SAMPLES =
900
STORAGE GROUP
DEV DEVICE NUM TYPE 732A 33903 733A 33903 733D 33903
NUMBER OF CYL 200 200 200
VOLUME PAV SERIAL SYSXCA SYSAXA SYSDXA LCU
DEVICE AVG AVG LCU ACTIVITY RESP IOSQ RATE TIME TIME 002C 002C 002C 002C 0.037 0.000 1.354 1.391 0.000 0.000 1.707 0.041 0.000 0.000 0.012 0.001 0.001 0.000 0.001 0.001 0.001 0.000 0.000 3.047 4.87 .000 .000 .000 1.86 .000 1.94 .000 .000 .000 .536 2.22 .000 .000 .593 1.79 2.94 .000 1.66 1.02 2.05 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
SMS SMS SMS DB2 DB2 DB2
969F 969F 9691 9692 9693 9694 9695 9696 9697 9699 969A 969B 969C 969E 969F
33909 33909 33909 33909 33909 33909 33909 33909 33909 33909 33909 33909 33909 33909 33909
10017 10017 10017 10017 10017 10017 10017 10017 10017 10017 10017 10017 10017 10017 10017
SYSIMS SYSIMS SYSUSR SYSOPE SYSCDS SYSHM1 SYSSMS SYSST1 SYSST2 SYSHM2 SYSSD1 SYSSD2 SYSSD3 SYSDB2 SYSIMS LCU
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0040 0040 0040 0040 0040 0040 0040 0040 0040 0040 0040 0040 0040 0040 0040 0040
1.05 .365
.143 .036
.259 .002 .428
Figure 67. Direct Access Device Activity Report
In this report, the fields of most interest are highlighted.
108
DASD performance
Indicator Field: AVG RESP TIME Description: The average response time, in milliseconds, for an I/O to this device or LCU. Related Fields: DEVICE ACTIVITY RATE, AVG PEND TIME, AVG DISC TIME, AVG CONN TIME, AVG IOSQ TIME Guideline: For most situations, this measurement is the best way to quickly determine how well a device is performing. AVG RESPONSE TIME can be considered as the overall measure of the health of a devices operation. It is the sum of the AVG IOSQ TIME, AVG PEND TIME, AVG DISCONNECT TIME, and AVG CONNECT TIME. If response time is too high, then you can look further at these response time components and take remedial action. What is too high? If your service level objectives are being met, then your response time is OK. If not, analyze what job is being delayed, and for how long (see prior section). Check the I/O rate to the device: if the I/O rate is insignificant, why bother tuning for response time? Problem Area: I/O response time is a component of user response time; often it is the dominant component. High I/O response can delay user jobs and increase response times to unacceptable levels. Potential Solution: Determine the dominant response time component, and see Improving your DASD performance on page 124 for suggestions.
109
DASD performance
Indicator Field: AVG IOSQ TIME Description: This is the time measured when a request is being delayed in the MVS system. Related Fields: DEVICE ACTIVITY RATE, AVG RESP TIME Guideline: If IOSQ is greater than 1/2 times service time (where service time is DISC + CONN) then it warrants a closer look. In practice, there should be little queue time (less than 5 ms). Problem Area: High IOSQ times contribute to high AVG RESP TIME. Combined with high I/O rates for important workloads, this could become a user response time problem. Potential Solution: If your problem is high IOSQ time then you have the following options. Traditionally IOSQ problems are usually resolved by data movement either by separating active data sets or by moving active data to faster storage, for example, a coupling facility structure. On an ESS subsystem, you have additional options. v If PAV is not enabled for the device, enable it. v If you are using static PAVs, assign more aliases to the device. v If you are using dynamic PAV, then increase the number of PAVs associated in the pool for the subsystem. v Check to ensure that all PAVs that should be bound to the device are online and operational. You can use the DEVSERV QP and DS QP,xxxx,UNBOX commands to do this.
110
DASD performance
Indicator Field: AVG PEND TIME Description: This value represents the average time (ms) an I/O request waits in the hardware (channel subsystem). Related Fields: AVG RESP TIME, DEVICE ACTIVITY RATE Guideline: A high PEND time suggests that the channel subsystem is having trouble initiating the I/O operation. There is a blockage somewhere in the path to the device. That might be due to: v AVG DPB DELAY. Delay due to ESCON director ports being busy. v AVG CUB DELAY. Delay due to the DASD control unit being busy, due to I/O from another sharing MVS system. v AVG DB DELAY. Delay due to the device being busy, due to I/O from another sharing MVS system. v Channel path wait. Whatever PEND time is not accounted for by the above three measures is due to delay for channel paths. This is the measure of channel delay that matters - not channel busy. If you think your channels are too busy, track this component of response time, for volumes that serve important work, to see if it is really a problem. If you think shared DASD causes the problem, look at the DASD Activity report from the other system, taken at the same time. PEND time should generally be less than 4 ms. Problem Area: High PEND times contribute to high AVG RESP TIME. Combined with high I/O rates for important workload, this could become a user response time problem. Potential Solution: High PEND times are usually caused by shared DASD contention or high channel path utilization. On an ESS subsystem, multiple allegiance should reduce PEND times. If problems with PEND times exist you have the following options. v Change the mix of data on the volume to reduce contention. If you can identify one data set that is contributing to most of the problem this may be eligible to be moved to a custom volume or moved into storage. v Check channel utilization, especially if not using eight-path path groups. Changes have been made to CCWs to reduce channel overheads for the ESS. This will tend to lower channel utilization and increase throughput, however, using multiple allegiance and PAV successfully will cause increases in channel utilization.
111
DASD performance
Indicator Field: AVG DISC TIME Description: For cache controllers, DISC includes the time waiting while staging completes for the previous I/O to this particular device or until one of the four lower interfaces frees up from either transferring, staging or destaging data for other devices. DISC can also be caused by a device which looks for the correct location on the disk to read or write the data. Related Fields: AVG RESP TIME, DEVICE ACTIVITY RATE Guideline: Generally, DISC time should be less than 17 ms. If the data is cached with reasonable hit ratios there should be little DISC time (3-5 ms). Once your DISC time gets above 12 ms (about the amount of time you would expect for latency and typical seek times), you are probably experiencing RPS misses due to path contention. Problem Area: High DISC times contribute to high AVG RESP TIME. Combined with significant I/O rates for important workloads, this could become a user response time problem. Potential Solution: If the major cause of delay is disconnect time then you will need to do some further research to find the cause. You might use the ESS Expert for creating more detailed reports. The most probable cause of high disconnect time is having to read or write data from disks rather than cache. You need to check for the following conditions. v High DISK to CACHE transfer rate, check the ESS Expert Disk to Cache Transport report and use the drill-down functions of the ESS Expert to identify the logical volumes that are experiencing response time problems. v High disk utilization, check the ESS Expert Disk Utilization report, if the problem is limited to one physical disk, or array the best solution is to move data to balance performance across the subsystem. The other option is to move data to another subsystem, or to another storage medium, for example data in memory, or to change the way that the application uses this data. v Low CACHE hit ratio seen in the Cache Subsystem Device Overview report. If you are suffering from poor cache hit ratios there is little that you can do, you should look at the ESS as a whole, using ESS Expert and check that activity is balanced across both clusters of the ESS. v NVS full condition, if the NVS is overcommitted you will see an increase in the values reported in the DFW BYPASS field of the Cache Subsystem Overview report. If this is a persistent performance indicator you may want to spread activity across more ESS subsystems. .
112
DASD performance
Indicator Field: AVG CONN TIME Description: This is time as measured by the channel subsystem during which the device is actually connected to the CPU through the path (channel, control unit, DASD) and transferring data. This time is considered good in most cases, because it is transferring data. Related Fields: AVG RESP TIME, DEVICE ACTIVITY RATE Guideline: Connect time is a function of block size and channel speed. Typical connect times often fall in the 2-6 ms range. Problem Area: There are times when doing large searches in catalogs, VTOC, and PDS directories causes the connect time to become excessive. This results in connect time being considered expensive time in that no real data is transferred during this activity. Also during connect time, all parts related to an I/O operation (the entire path) must be available for use by that I/O. So no other user can do I/O to or from a device while that device is in connect mode. One word of caution here - chained scheduling and seldom ending channel programs may generate high connect time which is not indicative of a performance problem. Potential Solution: If you see devices with poor performance and high connect times then the cause is probably application related, high connect times are associated with large data transfers, either due to the use of large block sizes or to activities like DB2 prefetch that schedule large I/O transfers. If the application is not reporting poor response and other users of the volume are not impacted, then no further work is required. High connect time is an indication of large amounts of data being transferred. An application, such as DB2, transfers large I/Os for maximum efficiency.
113
DASD performance
Indicator Field: % DEV UTIL Description: This is the percentage of time the UCB is busy (see IOSQ time). This includes both the time when the device was involved in an I/O operation (CONN and DISC) as well as any time it was reserved but not involved in an I/O operation. Related Field: AVG IOSQ TIME Guideline: A device overload condition is not always easy to detect. The device utilization as reported by RMF is not a measure of goodness or badness. If there is only one active data set on the device, and presuming that this data set cannot be made resident in processor storage, there is nothing wrong with a very high device utilization. However, if multiple online users are all trying to get to data sets on the same device, they can be delayed due to the device being busy (from the other users). This will show up as increased IOSQ time. If the utilization rises above 35%, response times may start to suffer, due to increased queuing for the device. If you have online response-sensitive data on a non-cached volume, you may want to investigate if utilization rises above the 25-30% range. Problem Area: High device utilization can lead to increased IOSQ time, if users are delayed while the device is in use by others. Potential Solution: Potential ways to decrease device utilization include: v Caching the device (if not already cached). With a good hit ratio service time may be reduced significantly, which will reduce device utilization dramatically. v Distribute heavily used data sets onto different volumes. v Remove poor cache users. Turn off caching for data sets or volumes that do not achieve good hit ratios. As discussed before, this frees up cache for users that benefit more.
114
DASD performance
Indicator Field: Multiply DEVICE ACTIVITY RATE by AVG CONN TIME Description: Utilization (% busy) of the paths to DASD devices. Related Fields: AVG PEND TIME, AVG DISC TIME Guideline: The maximum allowable utilization is not a fixed number, but it can be higher as the overall hit ratio increases. In an average 4-path system (DLSE), cached or non-cached, a general guideline is not to exceed 50%, if the goal is to maintain good response for interactive users. If the workload using these paths is not response-time sensitive, you may not care about high path utilization. For example, if you have only four jobs accessing four paths, then there is no delay and high path utilization is desired. Since the storage path utilization is not directly reported, an estimate can be made using the RMF reported LCU I/O rate and the average LCU connect time. This amounts to a simple calculation, which you should execute for each system that accesses devices attached to the storage control unit: 1. Multiply I/O rate by average connect time 2. Divide the result by four (a 3990 has four storage paths) 3. Divide by 10 to convert to a percentage (because I/O rate is given in I/O per second, while connect time is reported in millisecond time units) Problem Area: Higher utilizations will be an issue if they are causing delay to important work. This delay will show up as increased pend and disconnect times for volumes sharing these paths. Potential Solution: If your storage path utilization is above 50% and causing delay to important work, a potential solution is to balance activity across control units by moving data sets.
115
DASD performance
Indicator Field: AVG DISC TIME, ASYNC and STAGE rates Description: These fields give you an indication of lower interface contention. The lower interfaces are the connections between the DASD devices and the DASD control unit. Related Fields: LCU cache hit ratio Guideline: Direct measure of lower interface utilization is not available from any reporting tool, but normally you should not have to worry about the lower interfaces. If the utilization of the storage paths and the hit ratios are within acceptable limits, the lower interface utilization should be fine. The lower interfaces can be overloaded on cache controllers, however, if the hit ratio for the storage control unit falls below about 70% and a high subsystem throughput is imposed. Problem Area: Problems here would be reflected in increased DISC times. If you suspect a problem, check your STAGE and ASYNC rates from the Cache RMF Reporter. Correlate both STAGE and ASYNC rates to your DISC times (for example, plot DISC time vs. STAGE + ASYNC rates). Potential Solution: If you see a significant correlation between STAGE + ASYNC and DISC, investigate cache tuning or workload movement to reduce cache overload.
116
DASD performance
DASD performance in the sysplex

The Shared Device Activity report is a sysplex report that is available for DASD and tape devices. In this context we will discuss the DASD version of this report. Note: The report requires matching device numbers (the physical device must have the same device number on all systems), or self-defining devices to give meaningful results.
Shared DASD Activity report

S H A R E D z/OS V1R11 D I R E C T A C C E S S D E V I C E DATE 04/10/2009 TIME 16.30.00 A C T I V I T Y PAGE INTERVAL 014.59.946 CYCLE 1.000 SECONDS 1
SYSPLEX RMFPLEX1 RPT VERSION V1R11 RMF 900.0 (MAX) = 900.0 (MIN) = 900.0
TOTAL SAMPLES(AVG) =
DEV NUM
DEVICE TYPE
SMF DEVICE AVG AVG AVG AVG AVG VOLUME PAV SYS IODF LCU ACTIVITY RESP IOSQ DPB CUB DB SERIAL ID SUFF RATE TIME TIME DLY DLY DLY DPVOL1 *ALL MVS1 MVS2 MVS3 *ALL MVS1 MVS2 MVS3 14.589 4.550 5.489 4.550 15.186 4.069 6.438 4.679 3.5 0.0 0.0 4.8 0.0 0.0 3.1 0.0 0.0 3.8 1.0 0.0 9.3 0.5 0.0 5.2 2.0 0.0 14.8 0.5 0.0 8.0 2.0 0.0 0.0 0.2 0.0 0.0 0.0 0.3 0.0 0.2 0.0 0.2 0.0 0.8 0.0 0.0 0.0 0.1
AVG AVG AVG PEND DISC CONN TIME TIME TIME 0.4 0.2 0.8 0.3 1.2 1.1 1.0 1.6 0.6 2.5 0.5 4.1 0.6 1.7 0.6 1.9 5.2 2.4 0.4 1.7 10.2 3.1 2.4 2.0
% DEV CONN 3.63 1.86 0.91 0.86
% DEV UTIL 4.45 2.10 1.24 1.11
% AVG DEV NUMBER RESV ALLOC 0.1 0.0 0.1 0.0 651 205 210 236
06AA 3380E
01 02 03
0043 0143 0043
077A 3380E 077A 0872 077A
ERBDAT
01 02 03
0043 0143 0052
3.62 11.48 0.68 0.84 2.00 8.58 0.94 2.06
0.4 1078 0.0 343 0.0 357 0.4 378
Figure 68. Shared DASD Activity Report
This example reports about a sysplex consisting of three systems (MVS1, MVS2, and MVS3). Only two devices are shown. The second device does not have the same device number on all three systems.
Indicator Field: *ALL Description: The system line shows the device activity contributed by all systems in the sysplex. Guideline: The report gives you an overall performance picture of DASD devices that are shared between MVS systems in a sysplex. For each shared DASD device the report contains one line for each system that has access to it. The summary line allows you to identify a bottleneck caused by device delay in the sysplex. Furthermore, it allows you to see each systems share in the bottleneck. The summary device activity rate and the device utilization show the total load on the device. The single-system values show the share of each system.
117
DASD performance
Analyzing specific DASD problems

This section will focus on how to analyze specific DASD problems. It is assumed that you have a good overall knowledge of your DASD subsystem (as discussed in I/O subsystem health check on page 84) before proceeding. One approach to investigating I/O problems can be summarized: 1. Is there a problem? 2. Whos got it? 3. Do I care? 4. What can be done about it? Is there a problem? The Monitor III SYSINFO report shows whether there is any device delay. It also measures the size of the problem by the number of users affected.
SYSINFO report
RMF V1R11 Command ===> Samples: 100 System: PRD1 Date: 11/28/09 Time: 10.32.00 System Information Line 1 of 27 Scroll ===> HALF Range: 100 Sec
Partition: MVS1 CPs Online: 4 AAPs Online: 0 IIPs Online: 0 Group
2094 Model 716 Avg CPU Util%: Avg MVS Util%:
73 84
Appl%: 63 Policy: STANDARD EAppl%: 65 Date: 11/28/09 Appl% AAP: 0 Time: 14.05.07 Appl% IIP: 0 -Average Number Delayed For PROC DEV STOR SUBS OPER ENQ 1.9 0.4 1.4 0.1 0.0 0.0 3.7 1.4 1.4 0.8 0.2 0.4 0.4 0.4 0.3 0.1 0.0 4.1 1.5 1.7 1.0 0.0 0.0 N/A 1.7 1.7 1.6 0.0 0.0 1.5 1.5 1.3 0.1 0.0 7.0 2.0 0.5 4.5 0.0 0.0 0.0 0.5 0.5 0.5 0.0 0.0 2.0 2.0 2.0 0.0 0.0 2.6 0.8 1.8 0.1 0.0 0.0 N/A 1.8 1.8 1.8 0.0 0.0 0.8 0.8 0.8 0.0 0.0 2.0 0.0 1.0 1.0 0.0 0.0 N/A 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 2.0 0.0 0.0 0.0 N/A 2.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
T WFL --Users-- RESP TRANS -AVG USG% TOT ACT Time /SEC PROC DEV 34 664 26 13.95 50 534 8 13.95 26 11 10 0.00 27 115 8 0.00 3 0 0.00 2 0 0.00 5 4 N/A N/A 26 11 10 46.0 0.06 26 11 10 46.0 0.06 23 9 9 27.9 0.06 29 0 0 54.2 0.02 59 1 1 .000 0.00 50 527 8 .759 13.98 50 527 8 .759 13.98 48 526 8 .403 13.98 75 1 1 30.6 0.08 75 0 0 126 0.02 5.1 2.6 1.5 1.1 0.0 0.0 0.2 1.5 1.5 0.9 0.1 0.6 2.6 2.6 2.1 0.3 0.1 5.0 2.1 1.4 1.5 0.0 0.0 N/A 1.4 1.4 1.4 0.0 0.0 2.1 2.1 1.9 0.2 0.0
*SYSTEM *TSO *BATCH *STC *ASCH *OMVS *ENCLAVE PRIMEBAT NRPRIME
W S 1 2 3 PRIMETSO W TSOPRIME S 1 2 3
Figure 69. Monitor III System Information Report
Whos got it? Look at the Delay report shown in Figure 70 on page 119. Who (which address space) has a device delay problem? This is more important than the number of users affected. Remember, I/O service is not democratic. One important job or transaction being delayed can be much more significant than several other transactions or jobs. The Delay report shows where delays exist by specific resource for specific users. This aids in the assessment of performance impact: Who will be helped? How much?
118
DASD performance
DELAY report
Name MISTYDFS BHOLEQB BHOLWTO1 BRHI CONSOLE VTMLCL STARTING CATALOG INIT BHOLDEV3 BTEUDASD BAJU KLSPRINT BHIM BHOLPRO1
Service CX Class B B B T S S T S S B B T B T B NRPRIME NRPRIME NRPRIME TSOPRIME SYSSTC SYSSTC TSOPRIME SYSSTC SYSSTC NRPRIME NRPRIME TSOPRIME NRPRIME TSOPRIME NRPRIME
Cr
WFL USG DLY IDL UKN ----- % Delayed for ----- Primary % % % % % PRC DEV STR SUB OPR ENQ Reason 0 0 0 0 0 0 0 0 0 20 22 24 25 31 52 0 0 0 0 0 0 0 0 0 20 22 24 25 11 13 100 0 0 0 49 0 0 0 33 0 1 0 4 0 0 0 3 0 97 0 2 0 98 0 2 0 7 0 1 0 99 0 1 49 0 0 80 0 0 1 77 0 1 1 75 0 1 0 75 0 0 0 25 64 0 2 12 0 0 12 0 0 0 0 0 0 0 0 0 79 76 75 75 18 0 0 0 0 4 3 2 2 1 1 0 0 0 0 5 0 0 100 0 0 0 33 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 49 0 0 0 0 0 0 0 0 0 0 0 0 0 Message SYSDSN Message COMM LOCL LOCL COMM LOCL LOCL BBVOL1 BBVOL1 BBVOL1 BBVOL1 BBVOL1 BHOLPRO2
Do I care? Who is being delayed, and how much, often determines whether any action is necessary. Very important work requires immediate attention. Less important work? Small delays? Well, maybe you can look at it tomorrow. What can be done about it? To identify the solution to an I/O problem, you must identify where the I/O pain is. Remember, Monitor III can direct you to any device that is causing delays. This point of view is important; just because a device might have a high activity rate or a high response time, it MAY NOT be causing significant delays, or it may be delaying work you do not care about. The reports that are useful here include:
119
DASD performance
DEV report
RMF V1R11 Command ===> Samples: 100 System: PRD1 DLY USG CON % % % 79 76 75 75 18 4 4 1 15 20 23 19 6 11 3 0 4 4 4 1 1 6 0 0 Date: 04/07/09 Time: 10.32.00 Device Delays Line 1 of 9 Scroll ===> HALF Range: 100 Sec
Jobname BHOLDEV3 BTEUDASD KLSPRINT BAJU BHIM BHOL JES2 BHOLNAM4
C Class B B B T T T S B NRPRIME NRPRIME NRPRIME TSOPRIME TSOPRIME TSOPRIME SYSSTC NRPRIME
------------ Main Delay Volume(s) -----------% VOLSER % VOLSER % VOLSER % VOLSER 72 75 75 71 16 9 6 1 BBVOL1 BBVOL1 BBVOL1 BBVOL1 BBVOL1 BBVOL1 SYSPAG BBVOL1
1 430D13
Figure 71. Monitor III Device Delays Report
The Device Delays report shows jobs in your system that are delayed due to device contention. Jobnames (address space names) are listed in descending order in delay percentages. For each jobname, the names of up to 4 device volumes are listed indicating which volumes are causing the delay. If a job is being delayed, you can immediately see which volume is causing the most delay.
120
DASD performance
DEVR report
RMF V1R11 Device Resource Delays Command ===> Samples: 100 Line 1 of 16 Scroll ===> HALF Sec
System: PRD1 Date: 04/07/09 Time: 10.32.00 Range: 100
Volume S/ Act Resp ACT CON DSC PND %, DEV/CU /Num PAV Rate Time % % % Reasons Type BBVOL1 0222 41.8 .099 88 21 66 PND 1 3380K 3990-2
Service USG DLY Jobname C Class % % KLSPRINT BHOLDEV3 BTEUDASD BAJU BHIM BHOL RMFGAT BHOLNAM4 BHOLPRO1 BVAU JES2 BENK BHIM BHOL INIT B B B T T T S B B T S T T T S NRPRIME NRPRIME NRPRIME TSOPRIME NRPRIME TSOPRIME SYSSTC NRPRIME NRPRIME TSOPRIME SYSSTC TSOPRIME TSOPRIME TSOPRIME SYSSTC 6 2 5 3 1 3 0 0 0 6 0 0 0 2 0 92 92 91 91 22 10 1 1 1 14 7 1 1 0 1
SYSPAG 0225 430D13 0220 SYSCAT 0221
10.9 .028 1.1 .032
19 4
6 13 PND 2 2 PND
0 3380K 3990-2 0 3380K 3990-2 0 3380K 3990-2
0.3 .017
0 PND
Figure 72. Monitor III Device Resource Delays Report
The DEVR report is used to evaluate the performance of each volume. It differs from the DEV report in that it lists a volume and all of the jobs that are being delayed for that volume. Thus, if you find a volume that RMF suggests is performing poorly, you can use Monitor III to determine which jobs it is affecting and how badly it is affecting any particular job. You can get similar information from the Monitor I DASD Activity report.
Report Analysis v In this case, from the SYSINFO report (shown in 69) we know that 1.7 jobs in service class NRPRIME are delayed because of a device. v From the Delay report (shown in 70) we know the jobs in NRPRIME are waiting for volume BBVOL1. v From the DEVR report (shown in 72) we identify that BBVOL1 is a 3380K that is not cached. Replacing this disk by a faster device would be appropriate.
121
DASD performance
Performance analysis on data set level

The following sample reports do not show specific problems but can illustrate how performance analysis can be done on data set level. If you need further analysis for a specific volume that you see in the DEVR report, you can navigate through cursor-sensitive control to some reports which provide information on data set level.
RMF V1R11 Device Resource Delays Command ===> _ Samples: 76 System: D3 Line 1 of 19 Scroll ===> HALF Sec
Date: 05/03/09 Time: 12.39.00 Range: 120
Volume S/ Act Resp ACT CON DSC PND %, DEV/CU /Num PAV Rate Time % % % Reasons Type COUPLB S 9191 COUPLP S 91CD SGT103 S 9182 2.4 .115 21 9 0 PND 12 33903 DB 8 3990-3 DPB 3 0 PND 17 33903 DB 15 3990-3 1 PND 2 33903 DPB 1 3990-3
Service USG DLY Jobname C Class % % XCFAS S SYSSTC 16 7
0.8 .224 0.7 .056
18 3
1 0
XCFAS U015074 IXGLOGR CATALOG U015041 CATALOG RMFGAT CATALOG *MASTER*
S SYSSTC T S S T S S S S TSOSLOW SYSTEM SYSTEM TSOSLOW SYSTEM SYSSTC SYSTEM SYSTEM
22 1 1 0 0 0 1 1 1
5 1 1 1 1 4 1 1 0
CMNAF8 S BAF8 HDCAT S 910E
0.5 .105 2.3 .009
5 0
0 0
0 PND DB 0 PND
5 33903 5 3990-6 0 33903 3990-3
Figure 73. Monitor III - Device Resource Delays Report
If you select volume SGT103 and click on Enter, you get this report:
DSNV Report
RMF V1R11 Command ===> _ Samples: 76 System: D3 Date: 05/03/09 Time: 12.39.00 Data Set Delays - Volume Line 1 of 5 Scroll ===> HALF Range: 120 Sec
-------------------------- Volume SGT103 Device Data -------------------------Number: 9182 Active: 3% Pending: 2% Average Users Device: 33903 Connect: 0% Delay DB: 0% Delayed Shared: Yes Disconnect: 1% Delay CU: 0% 0.1 Delay DP: 1% -------------- Data Set Name --------------Jobname ASID DUSG% DDLY% IXGLOGR.ATR.UTCPLXHD.RM.DATA.A0000010.DATA IXGLOGR 0021 1 1 U015074.CALENDAR.DATA U015074 0110 0 1 U015041.COURSEIN.DATA U015041 0125 0 1 SYS1.VVDS.VSGT103 CATALOG 0051 0 1 SYS1.VTOCIX.VSGT103 U015074 0110 1 0 Figure 74. Monitor III - Data Set Delays by Volume
In addition to general information about this volume, you see a list with all data sets which are currently used. Selecting a specific job, you navigate to the next report. The Device Resource Delays report (DEVR) provides USG and DLY values for jobs that are using devices or are waiting for them. This data is gathered in a multistate
122
DASD performance
fashion, this means that there may be several wait records for the same job for the same device. The reporter changes to pseudo multistate, this can result in one USG counter and one DLY counter in parallel within a cycle, but does not take multiple wait records into account. Data gathering for the Data Set Delays reports (DSND, DSNJ, and DSNV) is different. Here, several wait records referring to the same device are not treated as being the same and counted only once because they may refer to different data set names, and have to be counted individually. As a result, the sum of the USG and DLY percentage values in these reports can be different to the USG and DLY percentage values in the DEVR report. Therefore, the three reports contain the headings DUSG% and DDLY% instead of USG% and DLY% to indicate a potential difference to the related values in the DEVR report. It might happen that -- N/A -- is provided instead of a data set name. This happens if only those I/O instructions have been detected for which no data set information is provided by the SMS subsystem, like v I/Os to system data sets (like paging or spooling) v I/Os to any data set which was opened prior to SMS subsystem initialization v I/Os like SENSE or RELEASE v System I/Os not done by an access method
DSNJ report
RMF V1R11 Command ===> _ Samples: 76 Jobname: U015074 ASID 0110 System: D3 Date: 05/03/09 EXCP Rate Time: 12.39.00 Connect: Data Set Delays - Job Line 1 of 2 Scroll ===> HALF Range: 120 0% Sec
0.1
-------------- Data Set Name --------------U015074.CALENDAR.DATA SYS1.VTOCIX.VSGT103
Volume Num DUSG% DDLY% SGT103 9182 0 1 SGT103 9182 1 0
Figure 75. Monitor III - Data Set Delays by Job
This report shows all active data sets for a specific job. If you now are interested to know whether some other job is working with this data set, again, you can navigate to the corresponding report.
123
DASD performance
DSND report
RMF V1R11 Command ===> _ Samples: 76 System: D3 Date: 05/03/09 Time: 12.39.00 Data Set Delays Line 1 of 1 Scroll ===> HALF Range: 120 Sec
Input Data Set Name: SYS1.VTOCIX.VSGT103 -------------- Data Set Name --------------- Volume Jobname SYS1.VTOCIX.VSGT103 SGT103 U015074 Figure 76. Monitor III - Data Set Delays by Data Set ASID DUSG% DDLY% 0110 1 0
In this sample, no other job than U015074 is using the data set.
Improving your DASD performance

This section will suggest actions you may want to consider, to improve the performance of your DASD. v Obviously, the best response time is achieved if the DASD I/O never happens. Eliminate I/O using buffering or data-in-memory techniques wherever possible. v Next best is resolving the I/O in cache. v Last is an I/O to DASD itself. Also remember ways to increase the parallelism of your I/O processing. v This could include the Data Striping capability of DFSMS, where sequential data is distributed across multiple I/O devices to allow concurrent I/O. v As a last resort, parallelism could also mean manually splitting data across volumes, or duplicating data on multiple volumes. In addition, here are actions to consider to address the following symptoms:
High DISC time

v Cache the volume, if it is a suitable cache candidate. v Un-cache the volume, if it is a poor cache candidate. v Tune cache. Best to use DCME to dynamically manage cache for you. If necessary, manually tune cache by turning off poor cache users. v Review block size. Increase where applicable. v Add paths (if not already 4-path). v Use faster device. v Move or copy data sets to other volumes to reduce contention. v Reduce arm movement via data set placement (seek analysis required).
High IOSQ time

v Cache the volume, if it is a suitable cache candidate. v Tune cache. Best to use DCME to dynamically manage cache for you. If necessary, manually tune cache by turning off poor cache users. v Decrease DISC, CONN, PEND times. Decreasing the other response time components will also decrease IOSQ time proportionally. v Move or copy data sets to other volumes to reduce contention. v Run requests sequentially to reduce contention.
124
DASD performance
High CONN time

v v v v Cache the volume, if it is a suitable cache candidate. Use indexed VTOCs. In a cached environment, use faster channels. Use PDSEs if you suspect directory search time is a problem.
High PEND time

v Address the biggest subcomponent of PEND time (see page 111). Use faster channels, or add more channels (if not already 4-path). Move data sets to avoid contention from sharing systems.
125
Tape performance
Tape indicators and guidelines

This section discusses guidelines for analyzing and resolving tape performance issues for the 3480 and 3490 families of tape devices. From a performance perspective, you can think of tape as being like DASD with long connect times. All of the things we used to do for single path DASD are true for tape: short strings, many control units and channels. If the subsystem is configured with one channel per control unit function, the only reason to add channels or cross communication of control units is availability, not performance.
Identifying tape-bound jbs

An indication of a tape-bound job is when a large proportion of the total time for the job is spent in tape activity and a small amount in CPU (or other) activity. The job termination messages, NUMBER OF MOUNTS and AVG MOUNT TIME from the Magnetic Tape Device Activity report can help determine if a job is tape-bound. Look at the job termination messages for the ratio of CPU time to the total elapsed time as indication for tape-bound jobs. You can also use the Monitor III Delay report, to verify that no other (non-I/O) significant delays exist for the job in question and that tape delay is dominant. Once a job is identified as tape-bound then further evaluation and measurements may be justified.
RMF measurements for tape monitoring

The options required to measure and monitor the tape subsystems may not be the default options used in most user environments. The reason is that the jobs and applications using tape are normally batch tasks, more concerned with throughput than response time. Some RMF options that must be enabled to have meaningful RMF data for tape subsystem measurements are discussed. These options may be modified in the ERBRMF00 parmlib member or may be modified online for current sessions:
Monitor I Session Options INTERVAL This value needs to be small (preferably no more than 15 minutes) if the details of the tape subsystem performance are to be evaluated. TAPE must be one of the device types specified. To request I/O queuing activity for tape LCUs, IOQ(TAPE) must be specified. This option specifies that the channel activity is to be measured. The channel utilization may be of interest in evaluating multiple control units on the same channel, or for evaluating the requirement for additional channel paths.
DEVICE IOQ CHAN
126
Tape performance
Magnetic Tape Device Activity report

M A G N E T I C z/OS V1R11 T A P E D E V I C E DATE 11/28/2009 TIME 23.45.00 CR-TIME: 12.03.30 A C T I V I T Y PAGE SYSTEM ID SYSE RPT VERSION V1R11 RMF IODF = 00 CR-DATE: 05/13/2008 INTERVAL 15.00.027 CYCLE 1.000 SECONDS ACT: ACTIVATE % DEV CONN 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 % DEV UTIL 0.00 0.00 0.08 0.03 0.00 0.00 0.00 0.20 0.05 0.00 0.08 0.04 0.19 0.19 % DEV RESV 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NUMBER OF MOUNTS 0 0 0 0 0 0 0 0 0 0 0 0 0 0 AVG MOUNT TIME 0 0 0 0 0 0 0 0 0 0 0 0 0 0 TIME DEVICE ALLOC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
TOTAL SAMPLES =
810
DEV NUM
DEVICE TYPE
VOLUME SERIAL
DEVICE AVG AVG AVG AVG LCU ACTIVITY RESP IOSQ CMR DB RATE TIME TIME DLY DLY 001D 001D 001D 0.000 0.000 0.005 0.005 0.000 0.000 0.000 0.005 0.005 0.000 0.006 0.006 0.007 0.007 .000 .000 .000 .000 157 .000 157 .000 .000 .000 .000 414 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
AVG AVG AVG PEND DISC CONN TIME TIME TIME .000 .000 .000 .000 .000 .000 .448 156 1.09 .448 156 1.09
1A71 3590L 1A72 3590L 1A73 3590L LCU 1A78 1A79 1A7A 1A7B 3590L 3590L 3590L 3590L LCU 1A7C 3590L 1A7F 3590L LCU 1A85 3490L LCU
001D 001F 001F 001F 001F 001F 0020 0020 0020 0021 0021
.000 .000 .000 .000 .000 .000 .000 .000 .000 .448 413 1.09 .448 413 1.09
414 .000 .000 .000 126 .000 126 .000 260 .000 260 .000
.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
.000 .000 .000 1.79 122 1.61 1.79 .896 .896 122 1.61 258 1.32 258 1.32
Figure 77. Magnetic Tape Device Activity Report
Response time components

Response time is a key measure of how well your I/O is performing, and a key indicator of what can be done to speed up the I/O. It is important to understand the components of this response and what they mean. Tape response time is the sum of queue, pend, connect and disconnect time (see page 109). The following lists the response time components and what causes them: Queue Time AVG IOSQ TIME is a measure of an I/O request waiting in software for the device to become available to service the request. For tape this usually means a REWIND/UNLOAD is in progress. Pend Time AVG PEND TIME reflects the time an I/O spends waiting to get a path to the control unit. This is the time spent waiting for an available channel path. Disconnect Time AVG DISC TIME is one of the more significant measures of a heavily loaded tape subsystem. Disconnect time is that time spent waiting for the buffer in the control unit to become available to transfer data; it includes the tape rewind time. One example of a disconnected I/O operation is a buffer-full condition. Connect Time AVG CONN TIME is predominantly time spent transferring data between the control unit and the processor. Connect time is directly related to data block size, and to the amount of data transferred for each I/O.
127
Tape performance
General tape indicators

Indicator Field: AVG DISC TIME (LCU), AVG PEND TIME (LCU), AVG CONN TIME (LCU) Description: Ratio of DISC + PEND time to CONN time. Lower ratios indicate less contention for channels and control units. Related Fields: DEVICE ACTIVITY RATE. If the I/O rate to the controller is negligible, dont bother with any further analysis. Guideline: Percent channel busy is not a good indicator of tape performance. Instead look at the ratio of LCU DISC + PEND time to CONN time. If the ratio exceeds 2 to 1 for extended periods of time (with an I/O rate high enough to be of interest), this is an indication of channel and/or control unit contention. Problem Area: This is an indicator that response times are increasing (increased DISC and PEND) due to an overloaded control unit or channels. Potential Solution: If you have significant DISC time and little PEND time, add more or faster control units. If you have significant DISC and PEND times, add more/faster control units and more/faster channels. As a guideline, you probably want to configure no more than four active drives per control unit. Beyond four concurrently active drives behind a single control unit function, data throughput for the additional drives drops off dramatically.
Indicator Field: AVG CONN TIME (LCU) Description: Predominantly, data transfer time between the control unit and the CPU. Related Fields: AVG RESP TIME, DEVICE ACTIVITY RATE Guideline: CONN times in the 20-40 ms range are typical. If your CONN time is significantly lower (for example, 4-5 ms), you may have smaller than optimal block sizes, resulting in more I/Os required to transfer the data. Problem Area: This could mean delay to the application, if a high number of I/Os are being generated to transfer the data. Decreasing the I/O rate by increasing the amount of data transferred per I/O will speed up the application run-time. Potential Solution: Increase block size and/or the number of host buffers where applicable.
128
Tape performance
Improving your tape performance

This section will suggest actions you may want to consider to improve tape performance. The key to good performance is to maximize data transfer, minimize control unit communication and command processing, and minimize PEND and DISC times. This can be accomplished by using large block sizes, allocating no more than four concurrently active drives per control unit and not cross-coupling control units. As a checklist, review the following: v Review block size. Increase where applicable. v Allocate data sets to DASD. TMM (tape mount management) is a method to accomplish this and significantly reduce tape mount requirements. v Increase host buffers for priority jobs. Increase buffers from default of 5, to as high as 20 (but no more than 20). v Use VIO for temporary data sets. v Use faster channels. v Add channels (if not already one channel per control unit function). v Reduce mount delay. use ACLs/ICLs (cartridge loaders) for nonspecific mounts use an automated tape library for specific mounts review physical logistics for manual mount efficiencies (e.g. are tapes stored near the operators, and near the drives?) v Add control unit functions (probably want no more than 4 concurrently active drives per control unit). v Use enhanced-capacity cartridges and tape drives. v Restructure for more parallelism (for example, to allow application to use multiple drives). v Use the Improved Data Recording Capability (IDRC) to reduce the amount of data transfer required.
129
I/O Performance - Summary
Summary
Typically, I/O processing is the largest single component of transaction response time. So, while I/O performance management can be complicated and time consuming, the return in terms of response time and throughput can be significant. Look for ways to avoid I/O altogether: Through optimized use of buffers, newer data-in-memory techniques, and eliminating unnecessary I/O. Cache the I/O that will benefit from cache, and keep cache tuned to the extent that poor cache users are not allowed to get in the way of better cache users.
As with all performance management you need to: v Understand the overall system picture (CPU, processor storage) and the overall I/O subsystem picture before tuning. v Measure the delay to important workloads. v Review related fields before tuning. For example, if response time for a device seems high, also check the I/O rate, type of workload using the device, and % DLY to users, before deciding what (if any) action to take. v Determine where your leverage for improvement exists. Start by seeing which component of response time (DISC, CONN, etc.) is dominant, and take further action accordingly.
130
Chapter 5. Analyzing processor storage activity
Lets Analyze Your Storage This topic discusses how to analyze a processor storage problem. Topics include: v What are the indicators in RMF to review, to verify there is a problem? v Some examples of common problems that RMF can help you resolve. v Potential tuning actions. Simply put, resolving processor storage problems generally means prioritizing which workloads get access to processor storage, finding ways to move fewer pages, or speeding up the movement of pages.
131
Do you have a processor storage problem?

Here are the primary RMF indicators you may wish to use in assessing your processor storage configuration. After listing these indicators, we will then discuss each one in more detail, with guidelines for each value that may indicate contention and suggestions for further action. We will refer to the components of storage as central storage (CS) and auxiliary storage (AUX). Never start any tuning effort simply because one indicator seems to be a problem. Always check other related indicators to build a good understanding of your overall processor storage situation and overall system situation, and only then begin to plan your tuning efforts. For example, if your UIC is low, also look at demand paging from auxiliary storage and storage delays in Monitor III. Table 3 shows for each monitor, which fields in which report can most often help detect and solve a processor storage problem.
Table 3. Processor Storage Indicators in Monitor III and Monitor I MONITOR Monitor III REPORT SYSINFO DELAY STOR Monitor I Paging Activity PAGE MOVEMENT WITHIN CENTRAL STORAGE PAGE-IN EVENTS AVG HIGH UIC TOTAL FRAMES AVAILABLE Page/Swap Activity SLOTS ALLOC SLOTS USED % IN USE DASD Activity AVG RESP TIME FIELDS Average Number Delayed for STOR %Delayed for STR DLY%
132
Storage analysis

This section reviews the most frequently used processor storage indicators in the Monitor III reports. We will start by looking at SYSINFO and DELAY reports, which can lead you to the STOR report.
SYSINFO report
RMF V1R11 Command ===> Samples: 100 System: PRD1 Date: 04/07/09 Time: 10.32.00 System Information Line 1 of 23 Scroll ===> HALF Range: 100 Sec
Partition: MVS1 CPs Online: 4 AAPs Online: 0 IIPs Online: 0 Group
2094 Model 716 Avg CPU Util%: Avg MVS Util%:
73 84
Appl%: 63 Policy: STANDARD EAppl%: 65 Date: 11/28/09 Appl% AAP: 0 Time: 14.05.07 Appl% IIP: 0 -Average Number Delayed For PROC DEV STOR SUBS OPER ENQ 1.9 0.4 1.4 0.1 0.0 0.0 3.7 0.0 0.0 0.0 0.0 1.4 1.4 0.8 0.2 0.4 4.1 1.5 1.7 1.0 0.0 0.0 N/A 0.0 0.0 0.0 0.0 1.7 1.7 1.6 0.0 0.0 7.0 2.0 0.5 4.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.0 0.0 2.6 0.8 1.8 0.1 0.0 0.0 N/A 0.0 0.0 0.0 0.0 1.8 1.8 1.8 0.0 0.0 2.0 0.0 1.0 1.0 0.0 0.0 N/A 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 2.0 0.0 2.0 0.0 0.0 0.0 N/A 0.0 0.0 0.0 0.0 2.0 2.0 2.0 0.0 0.0
T WFL --Users-- RESP TRANS -AVG USG% TOT ACT Time /SEC PROC DEV 34 664 50 534 26 11 27 115 3 2 5 4 3 3 2 1 26 11 26 11 23 9 29 0 59 1 26 8 10 8 0 0 N/A 0 0 0 0 10 10 9 0 1 13.95 13.95 0.00 0.00 0.00 0.00 N/A 0.00 0.00 0.00 0.00 0.06 0.06 0.06 0.02 0.00 5.1 2.6 1.5 1.1 0.0 0.0 0.2 0.0 0.0 0.0 0.0 1.5 1.5 0.9 0.1 0.6 5.0 2.1 1.4 1.5 0.0 0.0 N/A 0.0 0.0 0.0 0.0 1.4 1.4 1.4 0.0 0.0
*SYSTEM *TSO *BATCH *STC *ASCH *OMVS *ENCLAVE PRIMEAPP APPPRIME
W S 1 2 PRIMEBAT W NRPRIME S 1 2 3
.000 .000 .000 .000 46.0 46.0 27.9 54.2 .000
Figure 78. Monitor III System Information Report
133
Storage analysis
Indicator Field: Average Number Delayed For STOR Description: This shows you the average number of AS delayed for reasons related to processor storage. Guideline: Look at the value for the user community you are trying to help (for example, a specific service class, all TSO, or overall SYSTEM). If the STOR delay is larger than the other delays listed, you probably have some leverage here. If the STOR delay is zero or near zero, then your leverage is elsewhere (for example, I/O or CPU). Problem Area: This points out delays for paging, swapping, VIO, and out-ready. Potential Solution: See the Storage Delays report to find out which AS are delayed, and the type of storage delay they are having. You can get there by positioning the cursor on the storage value you are interested in, and pressing ENTER.
134
Storage analysis
DELAY report
Name BHOLEQB JES2 SMF BAJU RMFGAT BHOL BHOLPRO2 BHOLPRO1 *MASTER* CATALOG BHOLSTOA PCAUTH
Service CX Class B S S T S T B B S S B S BATCH SYSSTC SYSSTC TSO SYSSTC TSO BATCH BATCH SYSTEM SYSSTC BATCH SYSSTC
Cr
WFL USG DLY IDL UKN ----- % Delayed for ----- Primary % % % % % PRC DEV STR SUB OPR ENQ Reason 0 0 0 29 50 78 93 95 96 100 100 0 51 0 49 0 2 0 98 0 1 0 99 9 22 67 2 1 1 0 98 18 5 73 4 93 7 0 0 95 5 0 0 24 1 0 76 2 0 0 98 1 0 99 0 0 0 0 100 0 0 0 1 1 1 7 5 0 0 0 0 0 0 2 0 1 0 0 21 0 0 3 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 SYSDSN 0 SYSPAG 0 SYSPAG 0 LOCL 0 BAJU 0 RMFUSR 0 BHOLPRO1 0 BHOLPRO2 0 SYSPAG 0 0 0
Indicator Field: % Delayed for STR Description: This will give you the percentage of time the job was delayed for processor storage related reasons. Guideline: See if the STR delay is larger than the others, and is significant (over 10%). If so, go to the Storage Delays panel (enter STOR at the command line) and see the guidelines described there. Problem Area: Delays due to paging, swapping, VIO, and out-ready. Potential Solution: See the STOR report on page 136 topic of this information unit.
135
Storage analysis
STOR report
RMF V1R11 Command ===> Samples: 100 System: PRD1 Service Class TSO TSO SYSTEM SYSSTC SYSSTC SYSSTC SYSSTC SYSSTC SYSSTC SYSSTC SYSSTC SYSSTC SYSSTC SYSSTC SYSSTC BATCH DLY % 21 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Date: 04/07/09 Time: 10.32.00 Storage Delays Line 1 of 44 Scroll ===> HALF Range: 100 Sec
Jobname BAJU BHOL *MASTER* PCAUTH RASP TRACE XCFAS GRS SMXC SYSBMAS DUMPSRV CONSOLE ALLOCAS SMF LLA BHOLSTO9
C T T S S S S S S S S S S S S S B
------- % Delayed for -----COMM LOCL VIO SWAP OUTR 0 18 0 3 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-- Working Set -Central Expanded 273 1111 73 30 60 35 173 68 20 21 33 74 26 47 213 0
Figure 80. Monitor III Storage Delays Report
136
Storage analysis
Indicator Field: Check DLY % first, then find biggest component (e.g. COMM, LOCL...). Description: This report shows AS in descending order of delay. Guideline: Check if the value in the DLY % field is higher than 10% before proceeding. The DLY % fields in the various Monitor III reports help you to estimate how much you can reduce response time by eliminating delay. For example, by entirely eliminating 10% delay, you can improve response time by about 10%. Problem Area: Paging, swapping, VIO, etc. Potential Solution: v COMM See Auxiliary storage tuning on page 144 v LOCL See Prioritize access to processor storage on page 144 - storage isolation and criteria age See Auxiliary storage tuning on page 144 If swappable, consider swap set size tuning v VIO See Prioritize access to processor storage on page 144 - storage isolation and criteria age See Auxiliary storage tuning on page 144 v SWAP See Prioritize access to processor storage on page 144 - storage isolation and criteria age See Auxiliary storage tuning on page 144 Swap set size tuning v OUTR Check RTO setting
137
Storage analysis
Monitor I indicators
This section will review the most frequently used processor storage indicators in the Monitor I reports.
Paging Activity - Central Storage Paging Rates

P A G I N G A C T I V I T Y PAGE 1 z/OS V1R11 SYSTEM ID OS04 RPT VERSION V1R11 RMF DATE 06/05/2009 TIME 09.29.00 INTERVAL 14.59.996 CYCLE 1.000 SECONDS
OPT = IEAOPT02 MODE=ESAME CENTRAL STORAGE PAGING RATES - IN PAGES PER SECOND -------------------------------------------------------------------------------------------------------------------------------PAGE IN -----------------------------------NON % NON SWAP, OF SWAP, NON TOTAL TOTL SWAP BLOCK BLOCK RATE SUM PAGE OUT ---------------------------% OF NON TOTAL TOTL SWAP SWAP RATE SUM
CATEGORY PAGEABLE SYSTEM AREAS (NON VIO) LPA CSA SUM ADDRESS SPACES HIPERSPACE VIO NON VIO SUM TOTAL SYSTEM HIPERSPACE VIO NON VIO SUM SHARED
0.00
0.01
0.01
0 0.00 0.00 0 ------ ------ ---0.00 0.00 0
0.00 0.04 ------ -----0.00 0.05
0.04 0 ------ ---0.05 0
0.06 0.00 0.28 7.01 2.01 ------ ------ -----0.28 7.07 2.01
0.06 0.00
1 0
0.00 0.00
0.00 0.00
0 0
9.30 99 ------ ---9.37 100
0.00 0.01 0.01 100 ------ ------ ------ ---0.00 0.01 0.01 100
0.06 0.00 0.28 7.01 2.06 ------ ------ -----0.28 7.07 2.06 0.00
0.06 0.00
1 0
0.00 0.00
0.00 0.00
0 0
9.35 99 ------ ---9.41 100 0.00 20.84 5.6 1.26 3.31
0.00 0.01 0.01 100 ------ ------ ------ ---0.00 0.01 0.01 100 0.00 0.00 0.0
PAGE MOVEMENT WITHIN CENTRAL STORAGE AVERAGE NUMBER OF PAGES PER BLOCK BLOCKS PER SECOND PAGE-IN EVENTS (PAGE FAULT RATE)
PAGE MOVEMENT TIME %
Figure 81. Paging Activity report - Central Storage Paging Rates
Indicator Field: PAGE MOVEMENT WITHIN CENTRAL STORAGE Description: Rate of page movement above and below 16MB in CS. For example, to keep long term fixed pages out of reconfigurable storage. Guideline: This is rarely a problem. If you see large numbers here (maybe over 100/sec), it may be worth looking at. Or, if PAGE MOVEMENT TIME % reports a high value, this may indicate a problem. Problem Area: Some CPU time is required to move these pages. Generally not enough to worry about. Potential Solution: Review the setting of your RSU parameter for reconfigurable storage. Reduce if possible (but not at the expense of your disaster recovery planning).
138
Storage analysis
Indicator Field: PAGE-IN EVENTS (PAGE FAULT RATE) Description: This is the rate of demand paging in from auxiliary storage. The AS waits while these pages are obtained. Guideline: No overall system guideline for this number would make much sense. More important than the single-number here, is the amount of paging delay to important work (see Monitor III Storage Delays report). You may wish to track this one over time, or turn to this for specific MPL tuning. Problem Area: Problems here would show up as paging delay to important work. Potential Solution: Paging problems are probably best addressed by analyzing which workloads are paging, then addressing the problem for those workloads specifically (see Monitor III Storage Delays report). Use Monitor III to get to the AS level, or the Workload Activity report to get to the service class. There, the following solutions could apply: v Isolate storage v Adjust swap set size v Adjust the criteria age value v As a last resort, see Auxiliary storage tuning on page 144 for ways to speed up the paging.
139
Storage analysis
Paging Activity - Central Storage Movement Rates / Frame and Slot Counts
P A G I N G z/OS V1R11 SYSTEM ID TRX2 RPT VERSION V1R11 RMF A C T I V I T Y PAGE DATE 11/28/2009 TIME 16.15.00 INTERVAL 15.00.000 CYCLE 1.000 SECONDS 2
OPT = IEAOPTL9 MODE = ESAME CENTRAL STORAGE MOVEMENT RATES - IN PAGES PER SECOND -----------------------------------------------------------------------------------------------------------------------------------HIGH UIC (AVG) = 65535 (MAX) = 65535 (MIN) = 65535
WRITTEN TO CENTRAL STOR HIPERSPACE PAGES VIO PAGES RT 0.00
READ FROM CENTRAL STOR 0.00
*--- CENTRAL STORAGE FRAME COUNTS ----* MIN MAX AVG 2 2 2
RT
0.00
0.00
FRAME AND SLOT COUNTS -----------------------------------------------------------------------------------------------------------------------------------CENTRAL STORAGE LOCAL PAGE DATA SET SLOT COUNTS --------------------------------------------------------------------------MIN MAX AVG MIN MAX AVG (91 SAMPLES) AVAILABLE SQA LPA CSA LSQA REGIONS+SWA TOTAL FRAMES
793,610 9,652 4,894 4,515 23,847 204,865 1048576
796,682 9,660 4,895 4,530 24,059 207,724 1048576
795,793 9,656 4,894 4,521 23,922 205,681 1048576
AVAILABLE SLOTS VIO SLOTS NON-VIO SLOTS BAD SLOTS TOTAL SLOTS
1,797,432 0 5,058
1,797,641 0 5,267
1,797,526 0 5,173
0 0 0 1,802,699 1,802,699 1,802,699
FIXED FRAMES ---------------------------NUCLEUS 2,603 2,603 2,603 SQA 8,766 8,774 8,769 LPA 79 79 79 CSA 172 172 172 LSQA 9,184 9,356 9,243 REGIONS+SWA 7,709 8,451 7,853 BELOW 16 MEG 37 37 37 BETWEEN 16M-2G 17,519 18,245 17,662 TOTAL FRAMES 28,533 29,260 28,720
SHARED FRAMES AND SLOTS -----------------------------------------------CENTRAL STORAGE 2,835 2,843 2,836 FIXED TOTAL FIXED BELOW 16 M AUXILIARY SLOTS TOTAL 16 0 0 2,881 24 0 0 2,889 16 0 0 2,882
| | | | | | | |
GETMAIN REQ FRAMES BACKED FIX REQ < 2 GB FRAMES < 2 GB REF FAULTS 1ST NON-1ST
STORAGE REQUEST RATES ---------------------------555 154 5 81 326 0
MEMORY OBJECTS AND FRAMES -----------------------------------------------OBJECTS COMMON 6 6 6 SHARED 0 0 0 LARGE 0 0 0 FRAMES COMMON 1,364 1,364 1,364 COMMON FIXED 0 0 0 SHARED 0 0 0 1 MB 0 0 0
Figure 82. PAGING Report - Central Storage Movement Rates / Frame and Slot Counts
140
Storage analysis
Indicator Field: HIGH UIC (AVG) Description: Amount of time a page remains in central storage without being referenced. This measures stress on central storage and ranges from a minimum of 0 to a maximum of 65535. Guideline: Small values might indicate storage constraints, however, some systems run perfectly at low UICs. Problem Area: Problems here would show up as delays caused by demand paging or swapping, or increased CPU overhead due to page movement. Also throughput of the system could drop, due to MPL adjustment. Potential Solution: Decrease storage demand.
Page/Swap Data Set Activity report

P A G E / S W A P D A T A S E T A C T I V I T Y PAGE 1 z/OS V1R11 SYSTEM ID OS04 RPT VERSION V1R11 RMF 898 DATE 06/05/2009 TIME 09.29.00 INTERVAL 14.59.996 CYCLE 1.000 SECONDS
NUMBER OF SAMPLES =
PAGE DATA SET USAGE ------------------------SLOTS ---- SLOTS USED --ALLOC MIN MAX AVG 45000 21240 566820 566820 566820 566820 566820 16466 5345 67670 65887 66611 66296 66390 16466 5378 70965 69007 69723 69563 69311 16466 5355 69140 67190 67928 67618 67560 % PAGE V IN TRANS NUMBER PAGES I USE TIME IO REQ XFER'D O DATA SET NAME 0.11 0.11 2.67 3.01 3.01 3.23 3.12 0.200 0.027 0.004 0.005 0.005 0.005 0.005 6 38 711 771 821 726 766 5 37 5,697 5,711 5,625 5,523 5,548 PAGE.OS04.APLPA PAGE.OS04.ACOMMON PAGE.OS04.LOCAL01 PAGE.OS04.LOCAL02 PAGE.OS04.LOCAL03 PAGE.OS04.LOCAL04 PAGE.OS04.LOCAL05
PAGE SPACE TYPE PLPA COMMON LOCAL LOCAL LOCAL LOCAL LOCAL
VOLUME DEV DEVICE SERIAL NUM TYPE PAGP11 PAGP10 PAGP12 PAGP13 PAGP14 PAGP15 PAGP16 0378 0377 0379 037A 037B 038D 039F 33903 33903 33903 33903 33903 33903 33903
BAD SLOTS 0 0 0 0 0 0 0
N N Y Y Y
Figure 83. Monitor I Page/Swap Data Set Activity Report
Note: Starting with OS/390 2.10, swap data sets are no longer supported. Therefore, the report contains information only about page data sets.
Indicator Field: SLOTS ALLOC and AVG SLOTS USED Description: This tells you how much of your auxiliary storage space is used. Guideline: Keep the percent of space used to around 30% (always below 50%) of the space allocated.
141
Storage analysis
Indicator Field: % IN USE Description: This is the busy percentage for the data set. Guideline: For page data sets, you may see response time increase if this number rises above 30%. If you have swap data sets defined, you may see response time increases if they are above 15%. Problem Area: Increased time to satisfy a page request from DASD. Potential Solution: See Auxiliary storage tuning on page 144

D I R E C T z/OS V1R11 A C C E S S D E V I C E DATE 05/27/2008 TIME 06.00.00 CR-TIME: 10.31.31 A C T I V I T Y PAGE SYSTEM ID SYSF RPT VERSION V1R11 RMF IODF = 00 CR-DATE: 10/16/2009 INTERVAL 14.59.998 CYCLE 1.000 SECONDS ACT: ACTIVATE AVG AVG AVG PEND DISC CONN TIME TIME TIME .287 .000 4.58 .000 .000 .000 .274 .052 1.54 .275 .051 1.62 .284 .239 .214 .000 .000 .256 .640 1.54 .000 .512 .384 .768 .000 .000 .005 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .549 .297 2.01 .000 .000 .337 1.15 1.41 .000 1.15 .640 1.28 .000 .000 % DEV CONN 0.02 0.00 0.21 0.07 0.07 0.05 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 % DEV UTIL 0.02 0.00 0.22 0.08 0.07 0.05 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 % DEV RESV 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 AVG % % NUMBER ANY MT ALLOC ALLOC PEND 0.0 0.0 14.0 14.0 38.0 4.0 3.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 46.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2
TOTAL SAMPLES =
900
STORAGE GROUP
DEV DEVICE NUM TYPE 732A 33903 733A 33903 733D 33903
NUMBER OF CYL 200 200 200
VOLUME PAV SERIAL SYSXCA SYSAXA SYSDXA LCU
LCU
DEVICE AVG AVG AVG AVG ACTIVITY RESP IOSQ CMR DB RATE TIME TIME DLY DLY 0.037 0.000 1.354 1.391 1.281 1.707 0.041 0.000 0.000 0.012 0.001 0.001 0.000 0.001 0.001 0.001 0.000 0.000 3.047 4.87 .000 .000 .000 1.86 .000 1.94 .000 1.71 .536 2.22 .000 .000 .593 1.79 2.94 .000 1.66 1.02 2.05 .000 .000 .867 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .178 .000 .000 .000 .133 .000 .134 .000 .154 .133 .135 .000 .000 .163 .512 1.41 .000 .384 .256 .640 .000 .000 .038 .037 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
002C 002C 002C 002C 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0040 0040 0040 0040 0040 0040 0040 0040 0040 0040 0040 0040 0040 0040 0040
SMS SMS SMS DB2 DB2 DB2
9690 9691 9692 9693 9694 9695 9696 9697 9699 969A 969B 969C 969E 969F
33909 33909 33909 33909 33909 33909 33909 33909 33909 33909 33909 33909 33909 33909
10017 10017 10017 10017 10017 10017 10017 10017 10017 10017 10017 10017 10017 10017
SYSLIB SYSUSR SYSOPE SYSCDS SYSHM1 SYSSMS SYSST1 SYSST2 SYSHM2 SYSSD1 SYSSD2 SYSSD3 SYSDB2 SYSIMS LCU
1.05 .365
.143 .036
.259 .002 .428
Figure 84. Direct Access Device Activity Report
142
Storage analysis
Indicator Only DASD volumes used for paging and swapping are of interest in this section. See Chapter 4, Analyzing I/O activity for analysis of other I/O. Field: AVG RESP TIME Description: This will show overall response time (ms) for these volumes. Related Field: DEVICE ACTIVITY RATE, Monitor III page/swap delay % values. Guideline: Typical response times for page/swap volumes are in the 30-40 ms range. If your response times are significantly higher you may want to investigate. Problem Area: Problems here would add page or swap delay to your important workloads. Potential Solution: See Auxiliary storage tuning on page 144
143
Storage analysis
General storage recommendations

This section gives actions you may want to consider to improve your processor storage performance. Before you start, bear in mind the following; v The first priority of tuning is to make the best use of your resources. v After this, prioritizing workload is even more critical as you will be stealing from one workload to give to another.
Increase storage
v Take central storage from another partition, or install more.
Auxiliary storage tuning

Reduce paging (see above). Dedicate volumes to page data sets. Distribute page data sets among channel paths and control units. Limit use of VIO=YES. Add more LOCALs. Make the sum of all the page space 2 to 4 times the number of slots used. v Use faster devices. v v v v v
Prioritize access to processor storage

v Make better use of the PLPA by removing unused modules and verify that concatenated libraries are needed (LPALSTxx). v Decrease the page fixing level. RMF tells you the cause of the problem at PRIV-FF in the ARD Monitor II report. v Use the storage protection specification in the service policy.
Tune DASD
v Tuning normal DASD I/O reduces storage requirements. Transactions will complete faster, needing less virtual and processor storage.
Summary
To analyze storage problems you should: v View the system as a whole and not rely on individual guidelines. As we have said before the main thing is: Are your SLA targets being met? If not, and storage is the problem then: v Ensure the storage is being used fully and start looking at prioritizing access. v Use the recommendation checklist with the z/OS MVS Initialization and Tuning Reference. v Only change one value at a time and monitor its effect before making any other change.
144
Lets Analyze Your Sysplex This information unit discusses some special performance aspects in a sysplex: v How to understand CICS and IMS workload reports v Analyzing coupling facility activities
145
Understanding WLM
Understanding Workload Activity data for IMS or CICS

This information unit provides an explanation of some Workload Activity report data for work managers. RMF provides data for the subsystem work managers that support workload management. In z/OS, these are IMS and CICS. You find these values either in the Monitor III Work Manager Delays report or in the Postprocessor Workload Activity report. The following examples will describe the Workload Activity report. This section provides some sample reports for CICS and IMS, followed by some potential problem data and explanations. Based on the explanations, you may decide to alter your service class definitions. In some cases, there may be some actions that you can take. In those cases, you can follow the suggestion. In other cases, the explanations are provided only to help you better understand the data.
Interpreting the Workload Activity report

The following samples of a Workload Activity report with CICS transactions are discussed: v CICS service class accessing DBCTL v CICS service class accessing DBCTL with IMS V5 on page 148 v CICS accessing data via FORs on page 149
CICS service class accessing DBCTL

Figure 85 shows an example of the work manager state section for a service class representing CICS transactions accessing Database Control (DBCTL).
REPORT BY: POLICY=HPTSPOL1 WORKLOAD=PRODWKLD SERVICE CLASS=CICSHR RESOURCE GROUP=*NONE PERIOD=1 IMPORTANCE=1
-TRANSACTIONS- TRANS-TIME HHH.MM.SS.TTT AVG 0.00 ACTUAL 114 MPL 0.00 EXECUTION 78 ENDED 216 QUEUED 36 END/S 0.24 R/S AFFINITY 0 #SWAPS 0 INELIGIBLE 0 EXCTD 216 CONVERSION 0 AVG ENC 0.00 STD DEV 270 REM ENC 0.00 MS ENC 0.00 RESP TIME (%) 93.4 67.0 -------------------------------- STATE SAMPLES BREAKDOWN (%) ------------------------------- ------STATE-------ACTIVE-- READY IDLE -----------------------------WAITING FOR----------------------------- SWITCHED SAMPL(%) SUB APPL CONV PROD LOCAL SYSPL REMOT 10.9 0.0 0.0 0.0 89.2 0.0 89.2 0.0 0.0 19.7 0.0 10.6 0.0 0.0 69.7 0.0 0.0 0.0
SUB TYPE CICS CICS
P BTE EXE
Figure 85. Hotel Reservations Service Class
The fields in this report describe the CICSHR service class. CICS transactions have two phases: v The begin-to-end phase (CICS BTE) takes place in the first CICS region to begin processing a transaction. Usually this is the terminal owning region (TOR). The TOR is responsible for starting and ending the transaction. The ENDED field shows that 216 hotel reservation transactions completed. The ACTUAL time shows that the 216 transactions completed in an average transaction time of 0.114 seconds. v The execution phase (CICS EXE) can take place in an application owning region (AOR) and a file owning region (FOR). In this example, the 216 transactions were routed by a TOR to an AOR.
146
Understanding WLM
The EXCTD field shows that the AORs completed 216 transactions in the interval. The EXECUTION time shows that on average it took 0.078 seconds for the AORs to execute the 216 transactions. The EXECUTION time applies only to the EXCTD transactions. While executing these transactions, the CICS subsystem records the states the transactions are experiencing. RMF reports the states in the STATE SAMPLES BREAKDOWN (%) section of the report. Because there is a CICS BTE and a CICS EXE field, you can assume that the time spent in the TOR represents the BTE phase, and the time spent in the AOR represents the EXE phase. There is one EXE phase summarizing all the time spent in one or more AORs. Figure 86 shows the actual response time break down of the CICSHR service class.
Figure 86. Response Time Breakdown of CICSHR accessing DBCTL
The CICS BTE field shows that the TORs have information covering 93.4% of the response time. RMF does not have information covering 100% of the 0.114 seconds response time, because it takes some time for the system to recognize and assign incoming work to a service class before it can collect information about it. For most of the 93.4% of the time, the transactions did not run in the TOR, but had been routed locally to an AOR on the same MVS image. You can see this by the SWITCHED SAMPL(%) LOCAL field, which is 89.2% of the total state samples. This value accounts for 83.3% of the response time, because 100% of the total state samples correspond to 93.4% of the response time ( 89.2 x 93.4 / 100 = 83.3%). This value of 89.2% is close, if not equal, to the WAITING FOR CONV field, which indicates that there is no delay in the TOR once the AOR has returned the transactions. The total execution time is some percentage of the total response time. It is the EXECUTION transaction time (0.078), divided by ACTUAL transaction time (0.114), which is 68.4%. The CICS execution phase (CICS EXE field) covers 67% of the response time. Some of that time the work is active in the AOR, sometimes it is waiting behind another task in the region, but 69.7% of the total state samples in the PROD field (which corresponds to 69.7 x 67 / 100 = 46.7% of the response time) were found outside of the CICS subsystem, waiting for another product to provide some service to these transactions. Based on the configuration of the system, the transactions are accessing DBCTL.
147
Understanding WLM
The LOCL, SYSP and REMT state percentages appear in the WAITING FOR section if greater than zero and show the percentages of the total state samples the service class was delayed in these states when CICS was waiting to establish a session. The STATE SWITCHED SAMPL(%) fields LOCAL, SYSPL, and REMOT show the percentage of the state samples in which transactions were routed via MRO, MRO/XCP, or VTAM connections.
CICS service class accessing DBCTL with IMS V5

Figure 87 shows an example of a Workload Activity report for a CICS service class representing point-of-sale transactions (CICSPS). This example builds on the example in Figure 85 on page 146. It shows how the RMF report changes when IMS is upgraded to Version 5.
REPORT BY: POLICY=WSTPOL02 -TRANSACTIONSAVG 0.00 MPL 0.00 ENDED 579 END/S 0.95 #SWAPS 0 EXCTD 559 AVG ENC 0.00 REM ENC 0.00 MS ENC 0.00 RESP TIME (%) 96.6 29.2 43.7 WORKLOAD=CICS SERVICE CLASS=CICSPS RESOURCE GROUP=*NONE PERIOD=1 IMPORTANCE=1
TRANS-TIME HHH.MM.SS.TTT ACTUAL 503 EXECUTION 399 QUEUED 104 R/S AFFINITY 0 INELIGIBLE 0 CONVERSION 0 STD DEV 847
SUB TYPE CICS CICS IMS
P BTE EXE EXE
-------------------------------- STATE SAMPLES BREAKDOWN (%) ------------------------------- ------STATE-------ACTIVE-- READY IDLE -----------------------------WAITING FOR----------------------------- SWITCHED SAMPL(%) SUB APPL CONV PROD I/O LOCK LOCAL SYSPL REMOT 2.5 0.0 0.1 0.0 97.4 0.0 0.0 0.0 97.4 0.0 0.0 48.6 0.0 14.0 0.0 0.0 23.6 13.7 0.0 0.0 0.0 0.0 92.2 0.0 0.0 0.0 0.0 0.0 0.0 7.8 0.0 0.0 0.0
Figure 87. Point-of-Sale Service Class
The ENDED field shows that 579 point-of-sale transactions completed. The ACTUAL time shows that those transactions completed in an average response time of 0.503 seconds. The EXCTD field shows that the AORs completed 559 transactions in the interval. This is less than the ENDED value because a TOR may not always route all transactions to an AOR. Those non-routed transaction are not counted as EXCTD. The EXECUTION field shows that on average it took the AORs 0.399 seconds to execute the 559 transactions. While executing these transactions, the CICS subsystem reports the states the transactions are experiencing. Figure 88 on page 149 shows the response time breakdown of the CICSPS service class.
148
Understanding WLM
Figure 88. Response Time Breakdown of CICSPS accessing DBCTL with IMS V5
The CICS BTE field shows that the TORs have information covering 96.6% of the response time. Most of that time, the transactions were in fact not being run in the TOR, but had been routed locally to an AOR on the same MVS image. You can see this by the SWITCHED SAMPL(%) LOCAL field, which is 97.4%. This value accounts for 94.1% of the response time (97.4 x 96.6 / 100 = 94.1%) The EXECUTION transaction time (0.399 seconds) divided by the ACTUAL transaction time (0.503 seconds) is 79.32%. There are two execution phases shown: a CICS EXE, and an IMS EXE. The CICS execution phase covers only 29.2% of the response time. That is lower than in the previous example, because now IMS is providing information. The IMS EXE row shows the 43.7% of the response time for which it is responsible.
CICS accessing data via FORs

Figure 89 shows an example of a Workload Activity report for some CICS user transactions which have data accessed via file owning regions.
REPORT BY: POLICY=EQUALIMP -TRANSACTIONSAVG 0.00 MPL 0.00 ENDED 14119 END/S 47.17 #SWAPS 0 EXCTD 13067 AVG ENC 0.00 REM ENC 0.00 MS ENC 0.00 WORKLOAD=CICS SERVICE CLASS=CICUSRTX RESOURCE GROUP=*NONE PERIOD=1 IMPORTANCE=1
SUB TYPE CICS CICS
P BTE EXE
RESP -------------------------------- STATE SAMPLES BREAKDOWN (%) ------------------------------TIME --ACTIVE-- READY IDLE -----------------------------WAITING FOR----------------------------(%) SUB APPL I/O CONV LOCK 90.5 3.0 0.0 3.9 0.2 4.0 88.8 0.0 67.2 7.0 0.0 3.9 0.0 89.0 0.0 0.1
------STATE-----SWITCHED SAMPL(%) LOCAL SYSPL REMOT 92.4 0.0 0.0 60.9 0.0 0.0
Figure 89. CICS User Transactions Service Class
In this example, 14,119 transactions completed with a response time of 0.093 seconds. Only 13,067 transactions were executed by AORs in the interval. Others ran completely in the TOR.
149
Understanding WLM
The CICS execution phase contains information about the transactions while they were running in either the AOR or FOR. This shows that 89.0% of the state samples for these transactions were waiting for I/O completion. This accounts for 59.8% of the response time (67.2 x 89.0 / 100 = 59.8%). It is not possible to describe whether this time was for I/O initiated by the FOR, or for I/O initiated within the AOR. However, the SWITCHED SAMPL(%) LOCAL value in the CICS EXE line says that for 60.9% of the state samples (which is 40.9% of the response time, that is, 60.9 x 67.2 / 100), one of those regions was waiting for the other region to complete some processing for the transaction before the original region could proceed.
Problem: Very large response time percentage

Figure 90 shows an example of a work manager state section for the CICSPROD service class. In column RESP TIME (%), both the CICS EXE and the CICS BTE rows show inflated percentages: 78.8K and 140.
REPORT BY: POLICY=HPTSPOL1 -TRANSACTIONSAVG 0.00 MPL 0.00 ENDED 1648 END/S 1.83 #SWAPS 0 EXCTD 1009 AVG ENC 0.00 REM ENC 0.00 MS ENC 0.00 WORKLOAD=PRODWKLD SERVICE CLASS=CICSPROD RESOURCE GROUP=*NONE PERIOD=1 IMPORTANCE=1
SUB TYPE CICS CICS
P BTE EXE
RESP -------------------------------- STATE SAMPLES BREAKDOWN (%) ------------------------------- ------STATE-----TIME --ACTIVE-- READY IDLE -----------------------------WAITING FOR----------------------------- SWITCHED SAMPL(%) (%) SUB APPL MISC PROD CONV I/O LOCAL SYSPL REMOT 78.8K 0.2 0.0 0.3 2.5 96.7 0.0 0.3 0.0 0.3 0.0 0.0 140 65.6 0.0 2.2 0.0 0.0 32.4 0.0 0.1 0.0 0.0 0.0
Figure 90. Response Time Percentages Greater than 100
Possible Explanations
Long-running transactions: The report above shows how long-running transactions can inflate the value for RESP TIME (%). While the following example does not explain the exact values in the figure, it explains why this is possible. Suppose 100 transactions have ended within 1 second, and one transaction has been running for 5 minutes and is still executing when the RMF interval expires. The ACTUAL transaction time shows an average response time of 1 second, and RMF shows the breakdown into the states recorded by CICS or IMS. The subsystem, however, recorded a total of 6 minutes and 40 seconds (5 minutes plus 100 seconds) worth of data. That is an average of 4 seconds worth of data for each completed transaction which is 4 times the 1 second response time. The state samples breakdown, however, shows information representing 100% of the state samples. Also, when the one long-running transaction completes, it could easily distort the average response time during that interval. The RMF standard deviation and distribution of response times emphasizes when this occurs. The long-running transactions could be either routed or non-routed transactions. Routed transactions are transactions that are routed from a TOR to any AOR. Long-running routed transactions could result in many samples of WAITING FOR CONV (waiting for a conversation) in the CICS BTE phase, as well as states recorded from the AOR in the execution phase.
150
Understanding WLM
Long-running non-routed transactions execute completely in a TOR, and have no CICS EXE phase data, and could inflate any of the state data for the CICS BTE phase. Never-ending transactions: Never-ending transactions differ from long-running transactions in that they persist for the life of a region. For CICS, these could include the IBM reserved transactions such as CSNC and CSSY, or customer defined transactions. Never ending transactions are reported similarly to long-running transactions explained in Long-running transactions on page 150. However, for never-ending CICS transactions, RMF might report high percentages in the IDLE, WAITING FOR TIME, or the WAITING FOR MISC (miscellaneous) fields. Conversational transactions: Conversational transactions are considered long-running transactions. CICS marks the state of a conversational transaction as IDLE when the transaction is waiting for terminal input. Terminal input often includes long end-user response time, so you might see percentages close to 100% in the IDLE state for completed transactions. Service class includes dissimilar work: A service class that mixes customer and IBM transactions, short and long or never-ending transactions, routed and non-routed transactions, or conversational and non-conversational transactions can expect to have RMF reports showing that the total states sampled account for more than the average response time. This could be true for both IMS and CICS, and can be expected if the service class is the subsystem default service class. The default service class is defined in the classification rules. It is the service class to which all work in a subsystem is assigned that is not assigned to any other service class.
Possible actions
Group similar work into service classes: Make sure your service classes represent a group of similar work. This could require creating additional service classes. For the sake of simplicity, you can have only a small number of service classes for CICS or IMS work. If there are transactions for which you want the RMF state samples breakdown data, consider including them in their own service class. Do nothing: For service classes representing dissimilar work such as the subsystem default service class, understand that the response time percentage could include long-running or never-ending transactions. RMF data for such service classes may not make immediate sense.
Problem: Response time is zero

Figure 91 on page 152 shows an example of a work manager state section for the CICSLONG service class. The RESP TIME (%) fields shows a 0.0 value.
151
Understanding WLM
REPORT BY: POLICY=HPTSPOL1
WORKLOAD=PRODWKLD
SERVICE CLASS=CICSLONG
RESOURCE GROUP=*NONE
PERIOD=1 IMPORTANCE=1
CICS Long Running Internal Trxs -TRANSACTIONSAVG 0.00 MPL 0.00 ENDED 0 END/S 0.00 #SWAPS 0 EXCTD 0 AVG ENC 0.00 REM ENC 0.00 MS ENC 0.00 TRANS-TIME HHH.MM.SS.TTT ACTUAL 0 EXECUTION 0 QUEUED 0 R/S AFFINITY 0 INELIGIBLE 0 CONVERSION 0 STD DEV 0
SUB TYPE CICS CICS
P BTE EXE
RESP -------------------------------- STATE SAMPLES BREAKDOWN (%) ------------------------------- ------STATE-----TIME --ACTIVE-- READY IDLE -----------------------------WAITING FOR----------------------------- SWITCHED SAMPL(%) (%) SUB APPL CONV I/O PROD DIST REMT LOCK LOCAL SYSPL REMOT 0.0 70.8 0.0 1.4 0.7 11.2 9.2 0.3 5.3 1.2 0.0 11.2 0.0 0.0 0.0 43.2 0.0 0.2 0.1 31.8 10.4 8.7 0.0 2.9 2.8 0.0 0.0 0.0
Figure 91. Response Time Percentages all Zero
Possible explanations
No transactions completed: While a long-running or never-ending transaction is being processed, RMF stores the service class state samples in SMF 72.3 records. But when no transactions have completed, the average response time is 0. However, the calculations for the state samples breakdown will result in values greater than 0. RMF did not receive data from all systems in the sysplex: The Postprocessor may have been given SMF records from only a subset of the systems running in the sysplex. The report may represent only a single MVS image. If that MVS image has no TOR, its AORs receive CICS transactions routed from another MVS image or from outside the sysplex. Since the response time for the transactions is reported by the TOR, there is no transaction response time for the work, nor are there any ended transactions, on this MVS image.
Possible actions
Do nothing: You may have created this service class to prevent the state samples of long-running transactions from distorting data for your production work. Combine all SMF records for the sysplex: When a single MVS image that does not have TORs is combined with another MVS image that does have TORs and therefore does report response times, the states and response time from the first image will be combined by RMF with the states and response time from the second.
Problem: More executed transactions than ended transactions

In an IMS TM environment, the control region counts the number of ended transactions, and provides the actual response time information. The message processing program (MPP) counts the number of executed transactions, and provides the execution time. IMS provides only EXE phase information. There is no BTE phase information for IMS work. Figure 92 on page 153 shows an example of a work manager data section for the IMSTRX service class. The EXCTD field shows that 5312 transactions have finished executing while the ENDED field shows that only 5212 have been ended by the CTL
152
Understanding WLM
region.
REPORT BY: POLICY=WSTPOL01 -TRANSACTIONSAVG 0.00 MPL 0.00 ENDED 5212 END/S 5.79 #SWAPS 0 EXCTD 5312 AVG ENC 0.00 REM ENC 0.00 MS ENC 0.00 WORKLOAD=IMS SERVICE CLASS=IMSTRX
SUB TYPE IMS
P EXE
RESP -------------------------------- STATE SAMPLES BREAKDOWN (%) ------------------------------- ------STATE-----TIME --ACTIVE-- READY IDLE -----------------------------WAITING FOR----------------------------- SWITCHED SAMPL(%) (%) SUB APPL LOCK LOCAL SYSPL REMOT 82.4 98.1 0.0 0.0 0.0 1.9 0.0 0.0 0.0
Figure 92. Executed Transactions greater than Ended Transactions
IMS program-to-program switches: When an IMS transaction makes a program-to-program switch, the switched work can be considered either a new transaction, or a continuation of the originating transaction. The system checks the classification rules for the service class to be associated with the switched work. If the service class is the same as the originating transaction, it is processed in the MPP, and it is counted as an executed transaction (EXCTD). When the response is sent back to the network, RMF shows two executed transactions: the originating and the switched work, but only a single ended transaction (ENDED). Here, the number of executed transactions exceeds the number of ended transactions. If the service class is different from the original transaction, then RMF counts it as a new transaction with its own EXCTD and ENDED. Here, the executed transaction value and ended value agree. RMF processed only part of the entire sysplex data: It is possible to invoke the Postprocessor, giving it SMF records from just part of a sysplex for example a single MVS image. Suppose that MVS image has no TOR, but its AORs just receive CICS transactions routed from another MVS image. Since the transaction ends in the TOR, but is executed in the AOR, there will be no ended transactions for the CICS service class from this MVS image. Snapshot data is not always complete: The Workload Activity report contains snapshot data, which is data collected over a defined interval. For IMS, in a given interval, several transactions may have already executed in an MPP, but the control region may not have reported their completion yet. Similarly for CICS, the AOR could have finished executing a transaction, but the TOR may not have completed processing and reporting the completion. Thus, RMF may show more executed transactions than completions.
Possible actions
Classify IMS transactions uniquely: When an IMS transaction makes a program-to-program switch, the system checks the classification rules for the service class to be associated with the transaction. You can classify the switched transactions to a service class different from the service class of the originating transaction. RMF then counts a new transaction, with its own execution and end.
153
Understanding WLM
This way, the execution and ended values agree, and the RMF data is consistent with the data reported in the IMS Performance Analysis and Reporting System (IMSPARS). IMSPARS includes transaction timings, analysis reports, and detailed reports tracing individual transaction and database change activity. Classifying the program-to-program switched transaction to a unique service class might be useful for asynchronous program-to-program switches. Combine all SMF records for the sysplex: When a single MVS image that does not have TORs is combined with another MVS image that does, the executed transactions from the first image will be combined by RMF with the ended transactions from the second so that the sysplex-wide report is consistent.
Problem: Execution time is greater than response time

Figure 93 shows an example of a work manager state section for the CICSPROD service class. In the example, there is a response time of .091 seconds, but an execution time of .113 seconds. The example also shows 1731 ENDED transactions, yet the EXCTD field shows that only 1086 have been executed.
REPORT BY: POLICY=HPTSPOL1 -TRANSACTIONSAVG 0.00 MPL 0.00 ENDED 1731 END/S 1.92 #SWAPS 0 EXCTD 1086 AVG ENC 0.00 REM ENC 0.00 MS ENC 0.00 WORKLOAD=PRODWKLD SERVICE CLASS=CICSPROD RESOURCE GROUP=*NONE PERIOD=1 IMPORTANCE=1
Figure 93. Execution Time Greater than Response Time
Possible explanation
Mixing routed and non-routed CICS transactions: The AORs may have recorded states which account for more time than the average response time of all the transactions. The non-routed transactions do not show up in the EXE phase. In addition, most non-routed transactions end very quickly, and decrease the actual response time. The response time (ACTUAL field) shows 0.091 seconds as the average of all 1731 transactions, while the AORs can only describe the execution of the 1086 transactions they participated in.
Possible actions
Classify routed and non-routed transactions to different service classes: This would keep the numbers consistent with the expectation.
Problem: Large SWITCH percentage in CICS execution phase

Figure 94 on page 155 shows a work manager state data section for a CICSPROD service class. The LOCAL value in the SWITCHED SAMPL(%) section shows a value of 7092.
154
Understanding WLM
REPORT BY: POLICY=HPTSPOL1 -TRANSACTIONSAVG 0.00 MPL 0.00 ENDED 3599 END/S 4.00 #SWAPS 0 EXCTD 2961 AVG ENC 0.00 REM ENC 0.00 MS ENC 0.00
WORKLOAD=PRODWKLD
SERVICE CLASS=CICSPROD
RESOURCE GROUP=*NONE
PERIOD=1 IMPORTANCE=1
SUB TYPE CICS CICS
P BTE EXE
RESP -------------------------------- STATE SAMPLES BREAKDOWN (%) ------------------------------TIME --ACTIVE-- READY IDLE -----------------------------WAITING FOR----------------------------(%) SUB APPL MISC PROD CONV I/O 26.8K 0.3 0.0 0.4 2.5 96.3 0.0 0.6 0.0 93.7 41.2 0.0 6.0 0.0 0.0 52.7 0.0 0.1
------STATE-----SWITCHED SAMPL(%) LOCAL SYSPL REMOT 0.6 0.0 0.0 7092 0.0 0.0
Figure 94. Large SWITCH Percentage in a CICS Execution Environment
Distributed transaction processing: If a program initiates distributed transaction processing to multiple back-end sessions, there can be many AORs all associated with the original transaction. Each of these multiple back-end regions can indicate they are switching control back to the front-end region (SWITCH to another region on the LOCAL MVS image, or to a region on another MVS image in the sysplex). Thus, with a one-to-many mapping like this, there are many samples of the execution phase of requests switched long enough to exceed 100% of the response time of other work completing in the service class. Distributed program link (DPL): The distributed program link function from CICS/ESA 3.3 builds on the distributed transaction functions available in CICS by enabling a CICS program (the client program) to call another program (the server program) in another CICS region. While the server program is running, the client program will reflect that it is switched to another CICS region.
Possible Actions
- None -
Problem: Decreased number of ended transaction with increased response times

The Workload Activity report shows increased response times, and a decrease in the number of ended transactions over a few days.
Possible explanation
Conversion from ISC link to MRO: When two CICS regions are connected via a VTAM inter-system communication (ISC) link, they behave differently than when they are connected via multi-region (MRO) option. One key difference is that, with ISC, both the TOR and the AOR are receiving a request from VTAM, so each believes it is starting and ending a given transaction. So for a given user request routed from the TOR via ISC to an AOR, there would be 2 completed transactions. Let us assume they have response times of 1 second and 0.75 seconds, resulting in an average of 0.875 seconds.
155
Understanding WLM
When the TOR routes via MRO, the TOR will describe a single completed transaction taking 1 second, and the AOR will report its 0.75 seconds as execution time. Therefore, converting from an ISC link to an MRO connection, for the same workload, as shown in this example, could result in half the number of ended transactions and an increase in the response time reported by RMF. The difference could be much more significant if the AOR to FOR link was converted. Migration of an AOR to CICS 4.1: CICS 4.1 has extended the information it passes between regions involved in processing a given transaction. If all the regions are at the 4.1 level, this allows RMF to report the number of ended transactions as just those that have completed in the TOR. But if there is a mixture of release levels involved, this is not guaranteed. Let us assume that the TOR and FOR have been upgraded to release 4.1, but the AOR has not yet been upgraded. When the TOR receives a transaction, it will determine the service class. Then when it calls the AOR, it will pass that service class information. But since the AOR is not yet at a 4.1 level, the AOR has no code that passes this service class along to the FOR. So when the FOR is invoked, it believes it is starting a new transaction and must classify the request to a service class. When the FOR finishes processing, it states that it has completed the transaction. Later the TOR also states that it has completed the transaction. This results in RMF showing multiple completions for a given end-user transaction (one completion from the TOR, and one completion for each invocation of the FOR), with an average response time less than the value the TOR alone is reporting. When the AOR is migrated to CICS 4.1, it will pass the service class to the FOR. The FOR recognizes that it is participating in executing some portion of the original end-user transaction. When the FOR returns to the AOR, and the AOR returns to the TOR, there will be just one ended transaction recorded for RMF. Its response time will be the time from reception in the TOR until the TOR has completed. So RMF data will show a reduced number of transactions executed and a larger response time after the AOR is migrated to CICS 4.1 even when the end-user request is processed in exactly the same elapsed time.
Possible action
Increase CICS transaction goals: Do this prior to your conversion to an MRO connection, or prior to migrating your AOR to CICS 4.1, if the FOR transactions are classified to the same service class as your end-user transactions.
156
Coupling facility performance
Analyzing coupling facility activity

This information unit presents the reports offered by RMF that assist you in monitoring coupling facility activity. For this purpose, you can use the Postprocessor Coupling Facility Activity report, a spreadsheet macro, or the various interactive Monitor III reports.
Using the Postprocessor Coupling Facility Activity report

This topic gives guidelines for some indicators when monitoring the coupling facility.
157
Usage Summary
C O U P L I N G z/OS V1R11 SYSPLEX UTCPLXJ8 RPT VERSION V1R11 RMF F A C I L I T Y DATE 11/28/09 TIME 12.00.00 A C T I V I T Y PAGE INTERVAL 030.00.000 CYCLE 01.000 SECONDS 1
-----------------------------------------------------------------------------------------------------------------------------COUPLING FACILITY NAME = CF1 TOTAL SAMPLES(AVG) = 1781 (MAX) = 1799 (MIN) = 1671 -----------------------------------------------------------------------------------------------------------------------------COUPLING FACILITY USAGE SUMMARY -----------------------------------------------------------------------------------------------------------------------------STRUCTURE SUMMARY -----------------------------------------------------------------------------------------------------------------------------% OF CF STOR 2.1 2.1 2.1 0.6 % OF ALL REQ 0.9 0.9 0.0 34.8 % OF CF UTIL 2.0 3.0 1.0 25.0 AVG REQ/ SEC 1.96 1.96 0.00 72.21 LST/DIR DATA LOCK DIR REC/ ENTRIES ELEMENTS ENTRIES DIR REC TOT/CUR TOT/CUR TOT/CUR XI'S 41K 168 9915 154 1024 1 7730 11 2032 0 15K 1 0 0 390 0 28 0 81K 483 20K 288 1000 17 16K 96 0 0 0 0 0 0 75 0 17 0 N/A N/A N/A N/A N/A N/A N/A N/A 131K 27 1049K 650 66K 3215 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 0 0 0 0
TYPE LIST
STRUCTURE NAME DB2D_SCA DB2E_SCA DB2H_SCA IXCPLEX_PATH1
STATUS CHG ACTIVE PRIM ACTIVE PRIM ACTIVE PRIM ACTIVE PRIM ACTIVE PRIM ACTIVE PRIM ACTIVE PRIM ACTIVE ACTIVE
ALLOC SIZE 16M 16M 16M 5M
# REQ 1766 1765 0 64993
LOCK
DB2D_LOCK1 DB2E_LOCK1 DB2T_LOCK1
31M 31M 125M
4.0 4.0 16.2
1770 1770 82436
0.9 0.9 44.1
5.0 5.0 30.0
1.97 1.97 91.60
CACHE
IRRXCF00_B001 SYSIGGCAS_ECS
2M 5M ------420M
0.2 0.6
221 27907
0.1 14.9
1.0 16.0
0.25 31.01
STRUCTURE TOTALS
----- ------- ------- ----- -----54.4 186917 100 88.0 207.69
STORAGE SUMMARY -----------------------------------------------------------------------------------------------------------------------------ALLOC SIZE TOTAL CF STORAGE USED BY STRUCTURES TOTAL CF DUMP STORAGE TOTAL CF STORAGE AVAILABLE TOTAL CF STORAGE SIZE 263M 6M 2G ------2G ALLOC SIZE TOTAL CONTROL STORAGE DEFINED TOTAL DATA STORAGE DEFINED 2G 0K % OF CF STORAGE 12.9 0.3 86.8 ------- DUMP SPACE ------% IN USE MAX % REQUESTED
0.0
0.0
% ALLOCATED
13.2 0.0
PROCESSOR SUMMARY -----------------------------------------------------------------------------------------------------------------------------COUPLING FACILITY AVERAGE CF UTILIZATION 2084 (% BUSY) MODEL B16 0.9 CFLEVEL 15 LOGICAL PROCESSORS: DYNDISP OFF DEFINED 5 SHARED 2 EFFECTIVE AVG WEIGHT
0.9 50.0
Figure 95. Coupling Facility Activity Report - Usage Summary
The Usage Summary part of the report gives summary data for the structures, the storage and the processors.
158
Indicator Fields:
LST/DIR ENTRIES TOT/CUR DATA ELEMENTS TOT/CUR LOCK ENTRIES TOT/CUR
Description: The first column reports on list entries defined and in use OR directory entries defined and in use, dependent on the type of structure. Similarly, the second column reports on the data elements defined and in use in the structure. The third column reports on lock entries. The information presented on this report can be used to validate the size of various types of structures. In most cases, some knowledge of the structure is required to interpret the data. v Cache structures The CUR in-use values will indicate if the structure is overlarge (if it never fills up) or misproportioned in terms of entry to element ratio (one never fills up but the other does). It will not help in knowing if the structure is too small (i.e. 100% full is normal). To know if the structure is too small you have to look at reclaiming and perhaps hit ratios. v List structures Some list structures, like the JES2 checkpoint structure, are very static in size if not content/static data. For these structures, the TOT can be only slightly larger than CUR. Other functions, like the MVS logger, off-load data as the structure fills up. The threshold which triggers the Offload is based on parameters associated with the function. The ratio of CUR to TOT should not be higher than this threshold. Other list structures, such as the XCF structure and the VTAM Generic Resource structure, must handle large peaks of data. For these structures, the TOT should be much larger than the CUR in-use to prevent backups of signals during spikes. v Lock structures Lock structures, like the IRLM lock structures, are divided in two parts: The actual lock table contains the number of entries in the LOCK ENTRIES TOT column. The number of LOCK ENTRIES CUR is a sampled value of the number in use at the end of the interval. The best way to evaluate the size of this table is to look at the FALSE CONTENTION data for the lock structure on the Coupling Facility Structure Activity report, as described later in this document. The second part of the lock structure contains RECORD DATA entries. The number of these entries is reported in the LIST ENTRIES column. If there are not enough of these entries, you will not be able to obtain locks, and transactions will start failing at that point. IRLM provides no external means of apportioning the structure storage between record data and locks; you can only obtain adequate entries for both parts by judiciously choosing the size of the lock structure. The easiest way to do this is to use the lookup tables in INFOSYS Q662530.
159
Indicator Field: DIR RECLAIMS - DIR RECLAIMS XI'S Description: All cache structures have directory entries and may or may not have data entries (for example, IMS OSAM and VSAM cache structures have only directory entries; while RACF and DB2 cache structures have both directory and data entries). N/A, for not applicable, will be displayed in the directory reclaim column for list and lock structures. A cache structure can be over committed by the data base managers. This occurs when the total number of unique entities cached by the data base managers exceeds the number of directory entries in the cache structure. Whenever a shortage of directory entries occurs, the coupling facility will reclaim in-use directory entries associated with unchanged data. These reclaimed items will be used to satisfy new requests for directory entries by the data base manager. (Directory entries are used by the data base manager to ensure data validity). For the coupling facility to reclaim a directory entry, all users of the data item represented by the directory entry must be notified their copy of the data item is invalid. As a consequence, when the data base manager needs access to the now invalidated data item, the item must be re-read from DASD and registered with the coupling facility. When there are insufficient directory entries, the directory reclaim activity can lead to thrashing. (The situation is analogous to real storage shortages and page stealing in MVS.) Directory reclaim activity can result in the following: v Increased read I/O activity to the data base to re-acquire a referenced data item. v Increased CPU utilization caused by registered interest in data items having to be re-acquired. v Elongated transaction response times whenever a read miss occurs in the local buffer pool. Directory reclaim activity can be managed by increasing the number of directory entries for a particular structure. This can be accomplished by: v Increasing the size of the structure. For directory only structures, only the number of directory entries is affected (i.e. IMS OSAM and VSAM structures). For structures with data elements and directory entries, both will increase in the ratio specified by the structure user (i.e. RACF structures). v Changing the proportion of the structure space used for directory entries and data elements. This action is dependent on the structure users implementation. Some cache structure exploiters allow the installation to specify the ratio of directory entries to data entries, which it internally maps to a ratio of directory entries to data elements (a data entry may be composed of multiple data elements). DB2 provides this capability. On the other hand, RACF is hard coded to organize the structure in a 1:1 ratio of directory entries to data elements. If the cache structure directory to data element ratio is installation specifiable AND there are ample data elements, then one can increase the number of directory entries at the expense of data elements without a performance impact, and without increasing the structure size. Determining the impact of decreasing the number of data items is at best an inexact science. Unless the Structure Summary report indicates a consistent difference between the total and current number of data elements, it is difficult to estimate the impact of changing the
160

directory entry to data element ratio. The CUR value reported by RMF is not a peak value or average value during the interval, it is the number in use at the end of the interval. The usual causes for directory reclaim activity are: 1. Initial structure size is insufficient to support the amount of local buffering. 2. Increasing the size of the local buffer pools without an accompanying increase in structure size. 3. Increasing the number of users of the structure without an accompanying increase in structure size. The advice here is to monitor trends. If directory reclaims are increasing and response time is impacted, take appropriate action. The second value in this column is the number of reclaims that caused an XI (see 168). A high value is an indicator for a performance problem in this structure.
Indicator Field: STRUCTURE SUMMARY - TOTAL CONTROL/DATA STORAGE DEFINED Description: The amount of coupling facility storage that is allowed to be occupied by control information (CONTROL STORAGE) or data (DATA STORAGE). Guideline: Each structure plus the dump area is allocated some control storage and some data storage. The coupling facility defines an area called control storage; structure control information is restricted to that area. The remaining storage is called data storage and is used for structure data. If the data storage area becomes full, structure data can then be allocated from the control storage area. If TOTAL DATA STORAGE DEFINED is zero, it means control information can reside anywhere on the coupling facility and there are no allocation restrictions. If % ALLOC field for control storage shows a percentage approaching 100, it means the control storage is close to being completely allocated even though the CF SPACE AVAILABLE field may still show an amount of total free space. Possible actions include: v Changing structure preference lists in the coupling facility policy specification to direct some structures away from this facility v Adding another coupling facility to the sysplex
Indicator Fields: AVERAGE CF UTILIZATION (%BUSY) LOGICAL PROCESSORS DEFINED / EFFECTIVE Description: These fields report on the average CPU utilization of the CPs supporting the coupling facility partition, and in addition, on the number of logical processors assigned to the partition and the effective number of logical processors active during the RMF interval.
161

For example, if a CEC contains six CPs, and a test coupling facility LPAR has two logical CPs, but is capped at 5% of the CEC, then the LOGICAL PROCESSORS DEFINED will be 2 and the EFFECTIVE LOGICAL PROCESSORS number will be 0.3 (if the partition is uncapped the EFFECTIVE LOGICAL PROCESSORS number will be 2). CFCC does not enter a wait state when there are no commands to be processed. CFCC continues to search for work until LPAR management removes processor resources or work is found (the active wait phenomenon). Note: The above statement is true for LPARs providing real coupling facility functionality. It is NOT true for LPARs providing integrated coupling facility (ICMF) function. LPAR and CFCC have been enhanced to allow an ICMF LPAR with nothing to do to give up its CP resources. In this environment, use the Monitor I Partition Data Report, rather than the Coupling Facility Usage Summary report, to determine the utilization of the ICMF LPAR. Guideline: These fields will assist you in diagnosing the following types of problems which have occurred in customer environments. v High coupling facility processor utilization and excessively high service times when only one CP out of six in the CEC was assigned to the coupling facility LPAR. v High coupling facility processor utilization and excessively high service times caused by two LPARs in the same CEC with active CFs. (Production coupling facility weight of 80, test coupling facility LPAR weight of 50, all CPs shared). The test coupling facility consumed 5/13 of the available CP resources even though it was unused. Coupling facility processor utilization does influence structure access service times and thus the responsiveness and CPU utilization of the coupling facility exploiter. Please, see Factors affecting Coupling Facility performance on page 174 for additional information.
162
Coupling Facility Structure Activity
C O U P L I N G z/OS V1R11 SYSPLEX PLEXPERF RPT VERSION V1R11 RMF
F A C I L I T Y
A C T I V I T Y PAGE 3 INTERVAL 030.00.000 CYCLE 1.000 SECONDS
DATE 04/11/2009 TIME 13.39.00
-----------------------------------------------------------------------------------------------------------------------------COUPLING FACILITY NAME = CF5 -----------------------------------------------------------------------------------------------------------------------------COUPLING FACILITY STRUCTURE ACTIVITY -----------------------------------------------------------------------------------------------------------------------------STRUCTURE NAME = COUPLE_CKPT1 # REQ -------------SYSTEM TOTAL # NAME AVG/SEC REQ J80 8463 4.70 SYNC ASYNC CHNGD 2927 5535 1 TYPE = LIST REQUESTS ------------% OF -SERV TIME(MIC)ALL AVG STD_DEV 5.8 306.6 127.8 11.0 1502.1 1263.2 0.0 INCLUDED IN ASYNC STATUS = ACTIVE -------------- DELAYED REQUESTS ------------REASON # % OF ---- AVG TIME(MIC) ----REQ REQ /DEL STD_DEV /ALL NO SCH 1240 PR WT 0 PR CMP 0 DUMP 0 22.4 0.0 0.0 0.0 508.4 0.0 0.0 0.0 597.7 0.0 0.0 0.0 113.9 0.0 0.0 0.0
EXTERNAL REQUEST CONTENTIONS REQ TOTAL REQ DEFERRED 8339 123
... -----------------------------------------------------------------------------------------------------------------------------TOTAL 50370 SYNC 17K 34.3 342.2 124.5 NO SCH 8797 26.6 1047 1245 278.4 REQ TOTAL 50K 27.98 ASYNC 33K 65.5 557551G 0.0 PR WT 0 0.0 0.0 0.0 0.0 REQ DEFERRED 497 CHNGD 69 0.1 PR CMP 0 0.0 0.0 0.0 0.0 DUMP 0 0.0 0.0 0.0 0.0 ... STRUCTURE NAME = IRLMLOCK1 TYPE = LOCK STATUS = ACTIVE # REQ -------------- REQUESTS -------------------------- DELAYED REQUESTS ------------SYSTEM TOTAL # % OF -SERV TIME(MIC)REASON # % OF ---- AVG TIME(MIC) ----- EXTERNAL REQUEST NAME AVG/SEC REQ ALL AVG STD_DEV REQ REQ /DEL STD_DEV /ALL CONTENTIONS ... JF0 128K 71.26 SYNC ASYNC CHNGD 128K 0 0 7.7 221.9 37.5 0.0 0.0 0.0 0.0 INCLUDED IN ASYNC NO SCH PR WT PR CMP DUMP 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 REQ TOTAL REQ DEFERRED -CONT -FALSE CONT 128K 80 80 25
... -----------------------------------------------------------------------------------------------------------------------------TOTAL 1676K SYNC 1676K 100 198.1 44.7 NO SCH 0 0.0 0.0 0.0 0.0 REQ TOTAL 1674K 930.9 ASYNC 0 0.0 0.0 0.0 PR WT 0 0.0 0.0 0.0 0.0 REQ DEFERRED 906 CHNGD 0 0.0 PR CMP 0 0.0 0.0 0.0 0.0 -CONT 906 DUMP 0 0.0 0.0 0.0 0.0 -FALSE CONT 225 ... STRUCTURE NAME = DSNDB1G_GBP3 TYPE = CACHE STATUS = ACTIVE PRIMARY # REQ -------------- REQUESTS -------------------------- DELAYED REQUESTS ------------SYSTEM TOTAL # % OF -SERV TIME(MIC)REASON # % OF ---- AVG TIME(MIC) ----NAME AVG/SEC REQ ALL AVG STD_DEV REQ REQ /DEL STD_DEV /ALL ... JA0 1032 0.57 SYNC ASYNC CHNGD 1025 0 7 19.1 448.8 84.3 0.0 4508.9 1409.1 0.1 INCLUDED IN ASYNC NO SCH PR WT PR CMP DUMP 7 0 0 0 100 0.0 0.0 0.0 2060 0.0 0.0 0.0 1671 0.0 0.0 0.0 2060 0.0 0.0 0.0
... -----------------------------------------------------------------------------------------------------------------------------TOTAL 5356 SYNC 5342 100 432.8 89.0 NO SCH 14 100 1510 1374 1510 -- DATA ACCESS --2.98 ASYNC 0 0.0 3685.2 1634.2 PR WT 0 0.0 0.0 0.0 0.0 READS 1331 CHNGD 14 0.3 PR CMP 0 0.0 0.0 0.0 0.0 WRITES 1709 DUMP 0 0.0 0.0 0.0 0.0 CASTOUTS 44 XI'S 1296
Figure 96. Coupling Facility Structure Activity
This section of the report provides detailed information on the frequency of structure operations and the associated service/queue times.
163
Indicator Fields:
-------------- DELAYED REQUESTS ------------REASON # % OF ---- AVG TIME(MIC) ----REQ REQ /DEL STD_DEV /ALL
Coupling Facility exploiters make structure manipulating requests through the APIs provided by Cross-system extended services (XES). In some instances, the API requests can specify that the operation be performed synchronously or asynchronously relative to requesting task. XES will execute the API request through some number of commands to the Coupling Facility. The coupling facility commands are in turn executed synchronously or asynchronously, based on the type of operation and the API specification. RMF reports on the operations to the coupling facility and the structures within. The only data on API requests is reported in CONTENTION data for lock and serialized list structures. For an operation to be initiated against a coupling facility structure, one of the subchannels associated with the Coupling Facility must be available. When no subchannel is available, XES must either queue the operation for later scheduling, or spin waiting for a subchannel to become available. XES takes one action or the other, depending on the type of structure and operation. Most SYNC operations which cannot be started due to unavailable subchannels are changed to ASYNC operations (RMF reports them as CHNGD). Certain SYNC operations are not changed from SYNC to ASYNC when no subchannels are available. Most notably, lock requests are not changed to ASYNC. For these operations, the processor spins until a subchannel becomes available. RMF reports on ASYNC operations to a structure which are delayed due to subchannel unavailability. RMF does NOT report on SYNCH operations to a structure which are delayed (spinning) due to subchannel unavailability. The amount of delay for SYNC operations must estimated from the subchannel activity report. Subchannel unavailable condition will manifest itself in one of the following ways: 1. A non-zero percentage of CHNGD requests relative to the total number of ASYNC requests. 2. A non-zero percentage of delayed ASYNC requests. Depending on the structure either one or both of the above phenomena may be observed, not necessarily both. In evaluating response times for ASYNC requests, the following should be noted: v ASYNC command processing is performed primarily by the I/O processor (IOP or SAP). v ASYNC requests generally take longer (elapsed time) than SYNC requests. v The reported service time (not shown in the above example, but present on the ASYNC line) does not include delay time. To determine the average elapsed time for an ASYNC operation add the average service time to the average amortized delay time (/ALL column) for all ASYNC operations during the interval.
164

v ASYNC delay times (/DEL column) can be an indication of inadequate processing capacity in the coupling facility as well as insufficient number of subchannels. In evaluating response times for SYNC requests, the following should be noted: v SYNC command processing is performed by the CP. v The reported service time (not shown in the above example, but present on the SYNC line) does not include the delay spent spinning awaiting subchannel busy if OW12415 is installed. v A SYNC request delayed due to unavailable subchannel is using CPU resources. Subchannel unavailability has the following negative consequences: v ASYNC requests will be queued, resulting in delayed access to a coupling facility structure and corresponding diminished responsiveness by the exploiter. v Non-immediate SYNC requests will be transformed to ASYNC requests, resulting in delayed access to a coupling facility structure and corresponding diminished responsiveness by the exploiter. v Immediate SYNC requests will spin waiting for an available subchannel, resulting in increased CPU usage, delayed access to a coupling facility structure and corresponding diminished responsiveness by the exploiter. Normally, none of these effects are desirable. The following activities can minimize the detrimental effects: v Make certain CPU resource is sufficient in the coupling facility LPAR. v Redistribute structures among the CFs to minimize the delay. v Add additional coupling facility links between the MVS processor and the coupling facility. Each path will contribute two subchannels. v Influence the coupling facility activity rates (downward) of coupling facility exploiters if possible. For example, you can reduce the number of XCF signals by reducing lock contention or tuning XCF to eliminate signals related the expansion of transport classes sizes.
Indicator Fields:
EXTERNAL REQUEST CONTENTIONS REQ TOTAL REQ DEFERRED EXTERNAL REQUEST CONTENTIONS REQ TOTAL REQ DEFERRED -CONT -FALSE CONT
Description: These fields provide information on serialized list contention. A serialized list structure is a list structure which has an associated lock structure within. For example, JES2 and VTAM structures are serialized list structures; the XCF structure is simply a list structure. A serialized list structure can provide the structure user with increased control (exclusive usage) over the entire structure, a portion of the entries or even a single list entry. The use of the lock structure to control access is completely under the control of the exploiter.
165

There are a number of operations against a serialized list structure. Some operations against a serialized list structure succeed if the designated lock entry is not held by any other user. Additionally, these requests are queued for FIFO processing by XES whenever the command cannot be processed due to lock contention. (Specifically these are UNCONDITIONAL SET and NOTHELD requests). XES queues the request in a sysplex-wide FIFO queue until the lock is available. Whenever the lock is released, the first queued request is restarted on the appropriate processor. XES is providing multi-system FIFO queuing on behalf of the users to the serialized list structure. RMF reports on the number of accesses to the serialized list structure and the number of accesses deferred due to lock contention. This value is reported from the perspective of the indicated connector, the value is not a global value across all connectors to the structure. Having XES queue and manage deferred requests requires some amount of CPU usage and XCF signalling usage. Excess contention is most likely a reflection on the usage of the structure itself by the exploiter or poor tuning specifications by the installation. For example, consider JES2 using a checkpoint structure. Each JES2 member of the MAS acquires and holds the checkpoint for some period of time (HOLD) and restrains itself from attempting to acquire control for some period of time upon release (DORMANCY). If all members of the MAS specify short dormancy values, then each member (except the current owner) will attempt to acquire the checkpoint lock. XES will happily manage this queue of requestors in a FIFO manner at some cost. It may be worth the effort to adjust these values so members of the MAS receive the necessary service without the overhead of XES managing the queue of requestors. The average time a serialized list request with lock contention is queued by XES is not tracked and thus not reported by RMF. APAR OW11789 corrected a reporting problem which added the queue time into the ASYNC service time. This resulted in inflated and widely varying service times being reported.
Indicator Fields:
-- DATA ACCESS --READS WRITES CASTOUTS XI'S
Description: These fields provide information on cache structure accesses. Data access counts for a cache structure have been added to the report. Prior to describing what READS, WRITES, etc. counts actually count, the following should be noted: v The information is acquired from counters in the coupling facility and is global in nature. The information cannot be broken down to give an individual connection contributions to this global count. Thus, the information is reported in the TOTAL section of the RMF report. v Depending on the cache structure implementation one or more of these counter values may be zero.
166

v Most importantly, only the cache structure user (i.e. data base manager) knows how efficiently the local buffers and cache structure are being used. If the data base manager does not provide usage information, then tuning is a best guess proposition. The coupling facility maintains a number of structure level counters accessible to RMF. RMF reports on a subset of these counters (to be described momentarily) and stores all the counters in the SMF74 subtype 4 records. Additional information on these counters can be obtained in z/OS MVS Programming: Sysplex Services Guide in the section labelled IXLCACHE REQUEST=READ_STGSTATS. RMF provides four cache structure usage counters. READS This is a count of the number of times the coupling facility returned data on a read request by any connector. Directory only caches will always have a zero value reported since the structure contains no data. WRITES This is a count of the number of times a connector placed changed or unchanged data into the coupling facility structure. (Changed/unchanged is an attribute assigned to the data when written by the connector. From a performance/capacity view point, the importance of the attribute is: changed data cannot be reclaimed from the structure should directory or data elements become scarce.) Just as directory only cache structures always have READS counts of zero, they have WRITES counts of zero as the structure contains no data. The condition to be concerned with is a large number of WRITEs and a small number of READS. This condition may indicate: v Insufficient structure space allocated, and data entries (and perhaps directory entries) are being discarded by the coupling facility space management routines. v Inappropriate allocation of the ratio of directory entry to data elements is causing the data entries to be discarded by the coupling facility space management routines. Directory entry reclaim activity is reported by RMF as discussed earlier; however, data element reclaim activity is not reported. CASTOUTS This is a count of the number of times a connector retrieved a changed data entry, wrote the data to DASD (discarded it, whatever), and caused the changed attribute to be reset to unchanged. This process is known as casting out changed data from the structure. This counter is of interest for store-in cache structures (i.e. DB2 global buffer pool structures) in determining the volume of changed data being removed from the structure. This counter is not an indicator of the number times castout processing was performed during the RMF interval. A large amount of castout activity on a single structure may warrant additional cache structures and redirecting locally buffered data to different cache structure. Castout processing by the connectors must keep pace with the rate at which changed data is placed in the structure. When all
167

directory or data elements are associated with changed data, no new data items can be registered or written to the structure. This condition is not desirable and will adversely effect the data base manager and/or the user of the data base manager. XIS This is the number of times a data item residing in a local buffer pool was marked invalid by the coupling facility during the RMF interval. To the cache structure user, this means the data item must be re-acquired from DASD or perhaps the coupling facility structure, and interest in the item must be re-registered in the coupling facility structure. There are several XI counts obtained from the coupling facility which are consolidated into this value. They are: v XI for Directory Reclaim v XI for Write v XI for Name Invalidation v XI for Complement Invalidation v XI for Local Cache Vector Entry Replacement Consult section IXLCACHE REQUEST=READ_STGSTATS in the z/OS MVS Programming: Sysplex Services Guide for additional information on these XI categories. It is important to understand what the XIS counter really counts. This is most easily explained by example. Suppose there are 5 connectors (i.e. #1, #2, #3, #4, and #5). to a cache structure. Further suppose data item A is locally cached in connectors #1, #2, #4, and #5 and registered in the coupling facility structure. Now suppose connector number #5, changes data item A and issues the command to the coupling facility to have all other copies invalidated (i.e. XI for complement invalidation). In response to the command, the coupling facility would cause the copies of data item A in the local buffer pools of connectors #1, #2 and #4 to be marked invalid. At the completion of the XI operation, the XIS count value would be incremented by 3! XIS count values are seen for directory, store-in and store-thru caches. This count reflects both the amount of data sharing among the users of the cache and the amount of write/update activity against the data bases.
168
Subchannel Activity
C O U P L I N G z/OS V1R11 SYSPLEX PLEXPERF RPT VERSION V1R11 RMF F A C I L I T Y A C T I V I T Y PAGE DATE 04/11/2009 TIME 13.39.00 INTERVAL 030.00.000 CYCLE 1.000 SECONDS 6
------------------------------------------------------------------------------------------------------------------------COUPLING FACILITY NAME = CF1 ------------------------------------------------------------------------------------------------------------------------SUBCHANNEL ACTIVITY ------------------------------------------------------------------------------------------------------------------------# REQ ----------- REQUESTS ----------- -------------- DELAYED REQUESTS -------------SYSTEM TOTAL -- CF LINKS -- PTH # -SERVICE TIME(MIC)# % OF ----- AVG TIME(MIC) ---NAME AVG/SEC TYPE GEN USE BUSY REQ AVG STD_DEV REQ REQ /DEL STD_DEV /ALL *W04 30740 51.2 ICP SUBCH 3 42 3 21 0 SYNC 957 26.8 95.5 ASYNC 22791 736.7 2811 CHANGED 0 INCLUDED IN ASYNC UNSUCC 0 0.0 0.0 SYNC 790 43.3 116.5 ASYNC 8100 937.8 3543 CHANGED 1 INCLUDED IN ASYNC UNSUCC 0 0.0 0.0 LIST/CACHE LOCK TOTAL 2 0 2 0.0 174.0 0.0 0.0 0.0 1797 69.5 50.9 0.0 348.0 0.0
*W05
9793 9.6
CFS SUBCH
2 4
2 354 4
LIST/CACHE 1537 19.1 LOCK 4 0.5 TOTAL 1541 17.3
5626 15.6
2761K 278.0
Figure 97. Coupling Facility Activity Report - Subchannel Activity
The following fields can be used as a quick way of determining which systems are generating the most activity for a given facility, which in turn indicates where to focus tuning or load balancing efforts.
Indicator Field: CF LINKS Description: The subchannel configuration is described by these fields: TYPE This column describes the channel path type. GEN USE The number of subchannels that are defined, and that MVS is currently using for coupling facility requests. The description of the channel path type can help in better analyzing the performance of the different channel path types. There are differences between two types of coupling facility channels: CFS /CFR - CBS /CBR - ICS / ICR Two subchannels per path. CFP - CBP - ICP Seven subchannels per path.
169
Indicator Field: PTH BUSY Description: Path busy - the number of times a coupling facility request was rejected because all paths to the coupling facility were busy. Guideline: A high count combined with lengthy service times for requests indicates a capacity constraint in the coupling facility. If coupling facility channels are being shared among PR/SM partitions, the contention could be coming from a remote partition. Identifying path contention: There can be path contention even when the path busy count is low. In fact, in a non-PR/SM environment where the subchannels are properly configured, # DELAYED REQUESTS, not PTH BUSY, is the indicator for path contention. If PTH BUSY is low but # DELAYED REQUESTS is high, it means MVS is delaying the coupling facility requests and in effect gating the workload before it reaches the physical paths. PR/SM environment only: If coupling facility channels are being shared among PR/SM partitions, PTH BUSY behaves differently. You potentially have many MVS subchannels mapped to only a few coupling facility command buffers. You could have a case where the subchannels were properly configured (or even underconfigured), subchannel busy is low, but path busy is high. This means the contention is due to activity from a remote partition. If a coupling facility capacity constraint is suspected, the first action to consider is adding more paths. If the coupling facility is already fully configured for paths, you need to determine whether some structures can be off-loaded to another coupling facility or whether another coupling facility should be added to the configuration. Use the Coupling Facility Usage Summary reports to balance workloads across coupling facilities based on activity rates and storage usage. In a PR/SM environment, you may need to increase or adjust the configuration of shared coupling facility channels to reduce path contention.
170
Indicator Fields:
--------------- DELAYED REQUESTS ----------# % OF ------AVG TIME(MIC)----REQ REQ /DEL STD_DEV /ALL LIST/CACHE LOCK TOTAL
Description: Field TOTAL # DELAYED REQUESTS is the same as BUSY COUNT SCH that was shown in previous versions of this report. It is the number of times an immediate request was delayed because subchannel resources were not available. Immediate requests (such as locking operations) could not be completed because all the subchannels were busy. These requests are not queued. They are processed before the non-immediate requests which are reported in the Structure Activity report under QUEUED requests. To get a complete picture of subchannel activity, look at both fields. Guideline: If this count is high, you should ensure that sufficient subchannels are defined. See PTH BUSY above for suggested actions to relieve a coupling facility path constraint. In a data sharing environment, lock structure requests account for a large percentage of coupling facility accesses. By estimating the amortized delay for SYNC operations on lock structure accesses, one can determine the time delay due to subchannel unavailable on lock structure operations and the impact on CPU busy. Note: Requests which are delayed because a structure was being dumped were removed from this report because they are not related to subchannel contention.
CF to CF Activity
C O U P L I N G z/OS V1R11 SYSPLEX PLEXPERF RPT VERSION V1R11 RMF F A C I L I T Y A C T I V I T Y PAGE DATE 04/11/2009 TIME 13.39.00 INTERVAL 030.00.000 CYCLE 1.000 SECONDS 6
-----------------------------------------------------------------------------------------------------------------------------COUPLING FACILITY NAME = CF1 -----------------------------------------------------------------------------------------------------------------------------CF TO CF ACTIVITY ---------------------------------------------------------------------------------------------------------------------# REQ ---------- REQUESTS ------------------------ DELAYED REQUESTS -------------PEER TOTAL -- CF LINKS -# -SERVICE TIME(MIC)# % OF ---- AVG TIME(MIC) --CF AVG/SEC TYPE USE REQ AVG STD_DEV REQ REQ /DEL STD_DEV /ALL CF2 86830 86.2 CBP ICP 1 1 SYNC 86830 53.0 33.8 SYNC 19 0.2% 14.5 2.7 0.0
Figure 98. Coupling Facility Activity Report - CF to CF Activity
This section of the Coupling Facility Activity report provides information about the activities between coupling facilities.
171
Indicator Fields:
--------------- DELAYED REQUESTS ----------# % OF ------AVG TIME(MIC)----REQ REQ /DEL STD_DEV /ALL
Description: Field # DELAYED REQUESTS is the number of signals of all types which have experienced a delay in being sent from the subject CF to this remote CF. It is an contention indicator for the peer-to-peer communication between both coupling facilities.
Spreadsheet report
If you are interested to display the data of the Coupling Facility Activity report graphically, you can use the Spreadsheet Reporter with an interval report as input. One of the several charts that you can get is the following one:
Figure 99. Coupling Facility Structure Activity. Spreadsheet Reporter macro RMFR9CF.XLS (CF Report)
The graphic shows the total number of requests for structure ISGLOCK, and you can modify the graphic by selecting a specific type of data in the drop-down list.
Using the Monitor III online reports

Three Monitor III reports enable you to analyze bottlenecks or problems in the coupling facility area that might result in a performance degradation in the parallel
172

sysplex. To avoid critical situations for production data bases and transaction processing systems, customers can see the immediate state of the coupling facility and the structures, and they have all relevant data at a glance in a granularity they can choose. In addition, they can see the results of tuning actions in this area, for example after having removed a coupling structure from an overloaded coupling facility to another one.
The Coupling Facility Overview report - CFOVER

The Coupling Facility Overview report (CFOVER) gives you information about all coupling facilities which are connected to the sysplex.
RMF V1R11 Samples: 120 Systems: 2 CF Overview Date: 11/28/09 - TRXPLEX Time: 14.40.00 Line 1 of 2 Range: 120 Sec
CF Policy: SYSPOL1
Activated at: 11/08/06 09.16.04
----- Coupling Facility ------ ------ Processor ------- Request - Storage -Name Type Model Lvl Dyn Util% Def Shr Wgt Effect Rate Size Avail CF01 CF02 2094 2094 S18 S18 15 15 OFF ON 9.6 91.3 1 2 1 100 1 50 1.0 1.0 9585 4140 423M 423M 171M 232M
Figure 100. CFOVER Report
The Coupling Facility Activity report - CFACT

The Coupling Facility Activity report (CFACT) gives you information about the activities in each structure.
173
RMF V1R11 Samples: 120 CF: ALL Structure Name BIGONE LOCK LOCK LOCK IGWLOCK00 LOCK LOCK LOCK LOCK LOCK LOCK IXCPLEX_PATH1 LIST LIST LIST JES2CKPT1 LIST LIST LIST LIST LIST LIST SYSZWLM_6AAD2094 CACHE CACHE CACHE CACHE CACHE CACHE Figure 101. CFACT Report AP *ALL TRX1 TRX2 AP *ALL TRX1 TRX2 AS *ALL TRX1 TRX2 A *ALL TRX1 TRX2 AP *ALL TRX1 TRX2 AS *ALL TRX1 TRX2 AP *ALL TRX1 TRX2 AS *ALL TRX1 TRX2 Systems: 2 Type
CF Activity Date: 11/28/09
- TRXPLEX Time: 14.40.00
Line 1 of 78 Range: 120 Sec
ST System
CF --- Sync ----------- Async -------Util Rate Avg Rate Avg Chng Del % Serv Serv % % 1.0 0.4 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.1 0.1 <0.1 0.0 <0.1 0.0 0.0 0.0 0.0 0.0 0.0 330 330 0 0 0 0 0 0 0 0 0 0 30 7 53 595 0 595 0 0 0 0 0 0 30.1 30.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9 0.5 0.4 2.0 1.1 0.9 1.5 0.8 0.7 14.5 7.8 6.7 7.4 4.0 3.4 492 492 0 0 0 0 0 0 0 406 98 786 570 374 802 494 432 566 406 293 537 753 566 974 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.2
0.1
0.1
0.2
0.0
0.2
0.0
The Coupling Facility Systems report - CFSYS

The Coupling Facility Systems report (CFSYS) gives you information about the distribution of coupling facility requests among the systems and about the activities in the subchannels and paths attached to the coupling facilities in the sysplex.
RMF V1R11 Samples: 120 CF Name System Systems: 2 Subchannel Delay Busy % % 0.0 0.0 0.0 0.0 2.3 0.0 5.9 0.0 CF Systems Date: 11/28/09 -- Paths -Avail Delay % 4 4 4 4 0.0 0.0 0.0 0.0 - TRXPLEX Time: 14.40.00 Line 1 of 4 Range: 120 Sec
-- Sync --- ------- Async ------Rate Avg Rate Avg Chng Del Serv Serv % % 6359 2.0 1116 0.0 34 9 83 0 2778 9.1 2984 4.0 147 606 519 928 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
CF01 CF02
TRX1 TRX2 TRX1 TRX2
Figure 102. CFSYS Report
Factors affecting Coupling Facility performance

This section is intended to assist one in detecting and assessing processing delays which will impact applications utilizing the coupling facility. Processing delays
174

may manifest themselves as increased in CPU time, increased in elapsed time due to waiting, or some combination of both. There are several approaches to achieving better application or system performance: v Perform work as quick as possible. v Dont create additional work. v Dont do unnecessary work.
Perform work as quick as possible

Conditions elongating coupling facility request response (service + delay) times include the following: v Insufficient processor power allocated to the coupling facility LPAR. The most common cause of this condition is the sharing of CPs among coupling facility LPARs. The changes to RMF and CFCC will allow the installation to accurately determine the amount of available processing power being used. Elongation of SYNC service times against directory only cache structures is a good indicator of this condition. Alternatively, use the lock structure SYNC service times after removing the no subchannel delay component. v Insufficient number of paths between the CECs containing multiple MVS images and a coupling facility RMF does not report on the length of the delay due to path busy conditions, only the number of requests encountering the condition are reported. Requests encountering path delay will have elongated service times. The most common condition to cause path busy condition is EMIFing the sender ISC links among the MVS LPARs. The current rule of thumb is that the number of requests delayed for path busy (as reported on the Subchannel Activity report in the PTH field under the BUSY COUNTS column) should be less than 10% of total requests (on the same report under the # REQ TOTAL column). Note: In the case of EMIFed ISC links, be certain to take into account all the MVS LPARS utilizing a common physical connection(s). The solution to reducing the incidence of path busy conditions is to add ISC links between the processors or refrain from using EMIF, which will probably require additional links as well. v Insufficient number of subchannels between the MVS image and the coupling facility RMF reports on delay times and service times at the structure level and the coupling facility level (as previously discussed). Consequences of insufficient subchannels are: Increased MVS CPU consumption when the SYNC operation cannot be converted to an ASYNC operation. Increased service time for the SYNC operation which cannot be converted to an ASYNC operation. Requests being converted from SYNC to ASYNC. Increased queue time for ASYNC requests. The effect of the last two items is difficult to quantify. The effect depends on what the exploiter can do while a request is queued (i.e. was the delay anticipated and the request made asynchronously). Either of these two events will result in an elapsed time increase for some function. Whether this affects only one user or many users is dependent on the exploiter and its use of the structure. For example, if all cache structure castout operations are delayed, then it is possible for the cache to become full of changed data and is not a desirable condition.
175

There are two indicators to be monitored. 1. Increased SYNC operation service times (particularly against the lock structure) and the amount of spin time associated with the service time. 2. The percentage of operations delayed. Use the same rules and values previously discussed. The only way to increase the number of subchannels is to add more physical paths (i.e. ISC links). There are two subchannels per ISC link. Additional ISC links will help alleviate path busy conditions in an EMIF environment as well. v Insufficient IOP/SAP capacity on the CEC MVS image and the coupling facility The IOP/SAP on the CEC housing the MVS image handles IO to DASD, CTC traffic and ASYNC operations. As the number of ASYNC operations increases, the workload on the IOP/SAP increases. If the IOP/SAP becomes overloaded, the service time for ASYNC operations will elongate. ASYNC operations can be quite long and have a large standard deviation, thus changes are somewhat difficult to detect. Perhaps the easiest indicator to track is the Avg Q Length on the I/O Queuing Activity Report; if this value gets large it is an indicator that ASYNC requests as well as I/O requests are being delayed in IOP/SAP processing. v Hardware problems causing ISC link degradation. Monitor the hardware incidents in LOGREC and relevant console messages. Get the problem fixed.
Do not create additional work

XES and the coupling facility must work within the constraints imposed by the people that setup the parallel sysplex. When structures are not of sufficient size, XES, the coupling facility, and the exploiter will cope as best they can. For example if a directory only cache structure is being utilized and insufficient structure space is defined, directory entries will be reclaimed according to a least recently used algorithm. This will result in locally cached data items being marked invalid. Whenever these locally cached, but invalid, items are referenced, they will have to be read from DASD and registered with the coupling facility. In essence, unnecessary work has been created due to an improper definition. CPU cycles are required to perform the I/O and to register the data item with the coupling facility. Furthermore, the entity accessing the data item was delayed (elapsed time increased) while these two events occurred. RMF now provides information allowing the installation to detect, and perhaps alleviate, these conditions on a structure by structure basis. v Lock Structures Lock structure request service times should be short. Lock contention should be avoided, as contention must be resolved with the assistance of XES lock services requiring time and CPU cycles. Lock contention also generates additional XCF signalling. False lock contention should be avoided, it is expensive to resolve and is avoidable. RMF reports on the number of lock requests, the number delayed due to contention and the number delayed due to false contention. False lock contention can be alleviated by increasing the size of the lock table. To reduce true lock contention may require a change to the application, the data base, or the workload mix during the period of time the conflict occurs. One should attempt to manage true contention to less than 1.0 percent and false contention to less than 0.1 percent of the requests. v Cache Structures
176

Cache structures are not meant to replace local caches, they are intended to augment their use by: 1. Providing a mechanism for local buffer coherency 2. Providing additional high speed caching for frequently changed data. Use of the cache structure should not negatively impact the use of local cache. Prevent invalidation of a local cache entity unless the entity has been changed. This means avoid doing directory entry reclaims. This activity may be avoided by specifying a larger structure. RMF now reports on this activity to assist the installation in verifying the cache structure contains sufficient directory entries to keep track of locally cached data. If the cache structure user keeps data in the structure, attempt to keep frequently referenced data in the structure. Caching data in the coupling facility is only a performance boost when the data resides in the structure. It is important to have the structure sufficiently large and of the correct entry/data item configuration to support coupling facility caching of data without hindering performance. The RMF changes discussed previously will assist the installation in tracking anomalies or changes in data access patterns. RMF data is not sufficient in determining optimal read hit ratios.
Avoid unnecessary work

This activity is difficult to assess. The servers using coupling facility structures for message passing, locking, caching, state saving, etc. normally do very little coupling facility accesses unless prompted by some user of the server. Normally applications are causing the servers to perform some task involving coupling facility accesses. In extensive data sharing environments, the data base manager and the lock manager together will account for over half of the accesses to the coupling facility. The applications using the data base manager are normally critical to the business. Thus the first two activities are most often pursued. When all coupling facility tuning efforts have failed to meet the applications response time objectives, be it CPU time and/or elapsed time, then you should examine the applications. The causes of contention, excessive locking, cache disruption, etc. must be determined and reduced.
177
178
Appendix A. PR/SM LPAR considerations
Considerations in PR/SM Environments This appendix presents performance considerations for systems running in LPAR mode, including: v Interpretation of RMF reports v Tuning of an LPAR configuration
179
PR/SM environment
Understanding the Partition Data report

Using the Postprocessor report option REPORTS(CPU), you get the CPU Activity report. This report is divided into these sections: CPU ACTIVITY Information about processors and address spaces PARTITION DATA REPORT Information about all configured partitions LPAR CLUSTER REPORT Information about LPAR clusters Although this is an inconsistency, the first section is typically referred to as CPU Activity report, too. The third section is new, it is available with the Intelligent Resource Director (IRD) of the zSeries processors and is described in Appendix B. The Intelligent Resource Director.
C P U z/OS V1R11 SYSTEM ID NP1 RPT VERSION V1R11 RMF H/W MODEL E26 A C T I V I T Y PAGE DATE 11/28/09 TIME 09.30.00 INTERVAL 14.59.678 CYCLE 1.000 SECONDS 1
CPU
2097
MODEL
720
SEQUENCE CODE 00000000000473BF LOG PROC SHARE % 22.0 22.0 44.0
HIPERDISPATCH=NO --I/O INTERRUPTS-RATE % VIA TPI 0.00 7.98 7.98 0.00 2.48 2.48
---CPU--NUM TYPE
---------------- TIME % ---------------ONLINE LPAR BUSY MVS BUSY PARKED 16.77 26.49 24.92 100.0 99.88 99.90 0.00 0.00
2 CP 19.33 3 CP 100.00 CP TOTAL/AVERAGE
Figure 103. CPU Activity Report
The first part of the CPU Activity report provides information about the processors assigned to that partition which is gathering the data.
Report Analysis v ONLINE TIME PERCENTAGE The total online time percentage for all processors is 119.33% which is equivalent to 1.2 processors. This value is shown in the Partition Data report in the field PROCESSOR NUM. v LPAR BUSY TIME PERC The average utilization of the processors is 24.92%, this value can be found in the Partition Data report in the field LOGICAL PROCESSORS TOTAL for partition NP1. The LPAR BUSY time is (with Wait Completion = NO) the dispatch time of the processors that are assigned to the partition, and the percentage is based on their online time. v MVS BUSY TIME PERC This field shows the MVS view of the CPU utilization. The MVS BUSY time is the difference between online time and wait time. If the MVS system is busy when the partition is losing control, it stays in busy mode until the partition will be dispatched again, then the current task can continue, and wait state will be reached later. Therefore, the MVS BUSY time can be higher than the LPAR BUSY time, and the difference between both values is an indicator for CPU constraints in the system. The Partition Data report shows measurement data for all configured partitions. The line *PHYSICAL* is for reporting purposes only; it does not reflect a real partition.
180
PR/SM environment
PHYSICAL PROCESSORS TOTAL DISPATCH TIME DATA TOTAL (Partition 1) DISPATCH TIME DATA EFFECTIVE MVS Captured Time MVS Uncaptured Time Partition LPAR Mgmt Time ... DISPATCH TIME DATA TOTAL (Partition n) DISPATCH TIME DATA EFFECTIVE MVS Captured Time MVS Uncaptured Time Partition LPAR Mgmt Time ... *PHYSICAL*
... ...
... ... LPAR Time not attributed
P A R T I T I O N z/OS V1R11 SYSTEM ID NP1 RPT VERSION V1R11 RMF NP1 100 9 NO DYNAMIC
D A T A
R E P O R T PAGE 2 INTERVAL 14.59.678 CYCLE 1.000 SECONDS
DATE 11/28/09 TIME 09.30.00
9 7 2
N/A N/A N/A
--------- PARTITION DATA -----------------
-- LOGICAL PARTITION PROCESSOR DATA --
-- AVERAGE PROCESSOR UTILIZATION PERCENTAGES -LOGICAL PROCESSORS --- PHYSICAL PROCESSORS --EFFECTIVE TOTAL LPAR MGMT EFFECTIVE TOTAL 24.86 0.60 23.97 80.10 40.30 24.92 0.61 23.98 80.15 40.31 0.01 0.01 0.02 0.01 0.01 0.05 -----0.11 0.01 0.00 0.03 -----0.04 4.24 4.25 0.34 0.35 3.41 3.43 68.68 68.69 23.02 23.03 0.05 ------ -----99.69 99.80 99.96 0.00 0.03 ------ -----99.95 99.99 99.95 0.00
NAME
WGT 20 1 10 300 200
----MSU---- -CAPPING-- PROCESSOR- ----DISPATCH TIME DATA---DEF ACT DEF WLM% NUM TYPE EFFECTIVE TOTAL 100 0 5 95 50 10 NO 1 YES 8 NO 155 NO 52 NO 62.2 1.2 0.0 4 3.3 1.0 0.0 6.0 0.0 4 CP CP CP CP CP 00.04.27.302 00.00.21.680 00.03.35.761 01.12.05.405 00.24.11.147 00.04.27.519 00.00.22.083 00.03.35.859 01.12.06.405 00.24.11.311 00.00.03.103 ------------ -----------01.44.41.295 01.44.46.280 00.14.59.625 00.00.00.000 00.00.00.065 ------------ -----------00.14.59.611 00.00.00.000
NP1 A NP2 A NP3 A NP4 A NP5 A *PHYSICAL* TOTAL CFC1 A CFC2 A *PHYSICAL* TOTAL CB88 CB89 D D
DED DED
1 1
ICF ICF
99.98 0.00
99.99 0.00
Figure 104. Partition Data Report
Report Analysis v PROCESSOR NUM The number of online processors for partition NP1 is 1.2 , this reflects the online time percentages of 19.33% + 100% = 119.33% in the CPU report. v LOGICAL PROCESSORS - TOTAL DISPATCH TIME
DISPATCH TIME ONLINE TIME LP TOTAL UTIL 4.27.519 = 268 sec 119.33% of 14.59.678 = 1074 sec (268/1074)*100 = 24.92 %
The value of 24.92% is shown as LPAR BUSY TIME PERC in the CPU report.
181
PR/SM environment
Report Analysis v PHYSICAL PROCESSORS - TOTAL DISPATCH TIME

DISPATCH TIME INTERVAL TIME (7 CPSs) LP TOTAL UTIL 4.27.519 = 268 sec 7 * 14.59.678 = 6298 sec (268/6298)*100 = 4.25 %
Partition NP1 has a processor utilization of 4.25% of the total 2064-109 system. v PHYSICAL PROCESSORS - LPAR MGMT Each partitions CPU consumption for LPAR management is calculated as the difference between total and effective dispatch time. It is possible that the total dispatch time is smaller than the effective dispatch time. This situation occurs when partitions get overruns in their dispatch intervals caused by machine delays. The most typical form of this is caused by an MVS partition trying to talk to a coupling facility but getting significant delays or time-outs. It is sometimes symptomatic of recovery problems on the machine. In this case, field LPAR MGMT is filled with ****. v *PHYSICAL* The Physical Management Time is collected and reported by RMF in this line. The partition named *PHYSICAL* does not exist, this line is created for reporting purposes.
Report Analysis v LOGICAL PROCESSORS - UTILIZATION CF PARTITION PARTITION CFC1 99.99% UTILIZATION CF partitions are actually always busy 100% of the time. For purposes of reporting CF utilization, the CPU time is accumulated in two buckets: busy and idle. Idle is the CPU time spent looking for work (the polling loop), busy is all other time (time spent processing a command or background work). From these times, the CF Utilization reported in the Coupling Facility Activity report is 100*busy/(busy+idle).
Defining logical processors

Obviously, there is a cost to operate in LPAR Mode. One of the key aspects for defining an LPAR configuration is the number of logical partitions. Defining more logical partitions than are required to handle a partitions workload affects LPAR management time. Define as few LPARs as are needed to support your business needs.
Rule-of-Thumb Number of Logical Processors Number of Logical Processors 2 * Number of Physical Processors
182
PR/SM environment
The following example demonstrates the impact of the definition of logical processors on the total utilization of a system. In this sample, the processor is an 8-way system. For a system with 8 physical processors, 9 partitions have been defined with a total of 70 logical processors which is far away from the rule-of-thumb that has been mentioned above.
LPAR # 1 2 3 4 5 6 7 8 9 LPAR Name PROD1 PROD2 PROD3 PROD4 PROD5 TEST PROD6 PROD7 PROD8 # LP 8 8 8 6 8 8 8 8 8 Weights 80 50 40 12 50 7 106 40 5
With this configuration, the LPAR management time percentage was 22.83% which is nearly the capacity of 2 processors in this 8-way system. Due to significant performance problems, the configuration has been changed with defining 20 LPs for the same number of partitions, which is close to the 2:1 rule. The weighting factor was taken to estimate the number of LPs for each partition.
LPAR # 1 2 3 4 5 6 7 8 9 LPAR Name PROD1 PROD2 PROD3 PROD4 PROD5 TEST PROD6 PROD7 PROD8 # LP 4 2 1 2 3 1 4 2 1 Weights 80 50 40 12 50 7 126 40 5
By changing the number of LPs from 70 to 20, the LPAR management time percentage could be reduced by 20% of the total capacity (from 22.83% to 2.87%) for an 8-way processor which is equivalent to the capacity of 1.6 processors.
Intelligent Resource Director Considerations about the number of logical processors are relevant and required only for partitions which are not under control of WLM LPAR management. This is a function of the Intelligent Resource Director which is available on z900 servers, and it is described in the following chapter.
183
PR/SM environment
184
Appendix B. The Intelligent Resource Director
Learning about the Intelligent Resource Director This appendix describes the functions of the Intelligent Resource Director and its reporting in RMF, including: v Dynamic Channel Path Management v Channel Subsystem Priority Queuing v LPAR CPU Management
185
Intelligent Resource Director
Overview
The Intelligent Resource Director (IRD) is a functional enhancement exclusive to IBMs zSeries family. It is provided in z900 servers and is an extension of IBMs industry leading clustering architecture, the Parallel Sysplex. IRD uses LPAR clusters (see LPAR cluster on page 190 for a description) that will efficiently balance processor and I/O resources between multiple applications based on quality of service goals defined by the customer. These enhancements will ensure that the unpredictable needs of new e-transaction processing workloads can be managed dynamically according to business requirements. The current implementation addresses three separate but mutually supportive functions: v Dynamic Channel Path Management (DCM) This feature enables customers to have channel paths that dynamically and automatically move to those I/O devices that have a need for additional bandwidth. The benefits are enhanced by the use of clustered LPARS. v Channel Subsystem Priority Queuing Channel subsystem priority queuing on the z900 allows the priority queuing of I/O requests within the channel subsystem and the specification of relative priority among LPARs. v LPAR CPU Management WLM dynamically adjusts the number of logical processors online to an LPAR and the processor weight based on the WLM policy. The ability to move the CPU resources across an LPAR cluster provides processing power to where it is most needed based on WLM service policy.
Dynamic channel path management

There is no typical workload. The requirements for processor capacity, I/O capacity and other resources vary throughout the day, week, month and year. Traditional methods of monitoring and tuning result in configurations that attempt to address the majority of your workload characteristics, leaving resources under or over utilized at various times. Even a well tuned DASD subsystem can quickly become inappropriately configured as data is migrated to and from different subsystems. A better approach would involve tuning decisions being made more responsively, using the capabilities of the operating system to self-monitor and self-adjust. DCM provides the ability to have the system automatically manage the number of paths available to disk subsystems. By making additional paths available where they are needed, this increases the effectiveness of your installed channels, and potentially reduces the number of channels required to deliver a given level of service. DCM also provides availability benefits by attempting to ensure that the paths it adds to a control unit have as few points of failure in common with existing paths as possible, and configuration management benefits by removing unused paths from over-configured control units. Where paths can be shared by Multiple Image Facility (MIF), DCM will coordinate its activities within an LPAR cluster. Where several channels are attached from a central processor complex (CPC) to a switch, the channels can be considered as a resource pool for accessing any of the control units attached to the same switch. To achieve this without DCM would
186

require deactivating paths, performing a dynamic I/O reconfiguration and activating new paths. DCM achieves the equivalent process automatically, using those same mechanisms. Channels managed by DCM are referred to as managed channels. With HCD, it has to be defined how many channel paths of a control unit are to be managed. At least one channel path must be defined as static.
Value of dynamic channel path management

v Improved system performance Improved system performance is achieved by automatic path balancing and service policy. Reassignment is done in real-time or on WLM adjustment interval. A customer may reassign once a year, WLM does it every few seconds. v More effective utilization of installed hardware Channels are automatically balanced, providing opportunities to use fewer I/O paths to service the same workload. v Simplified I/O definition The connection between managed channels and control units does not have to be explicitly defined, but at least one static path is required for each control unit. v Reduced skills required to manage z/OS Managed channels and control units are automatically monitored, balanced, tuned and reconfigured.
Reporting of dynamic channel path management

All RMF reports about channel activities and I/O queuing activities provide information about DCM. The following examples show Monitor III reports.
187
Channel Path Activity report

| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
RMF V1R11 Command ===> Samples: 60 Channel Path ID No G Type 4 4 12 14 16 20 27 2B 2C 30 31 37 38 39 3A 3E 7C 7D 81 82 83 84 85 8C A6 B6 E0 E1 E2 E3 *CNCSM *FC_SM OSD OSD OSD CTC_S CNC_S CNC_S CNC_S FC_S FC_S FC_S FC_S FC_S FC_S FC_S CNCSM CNCSM FC_S FC_S FC_S FC_S FC_S FC_S FC_SM FC_SM IQD IQD IQD IQD System: CB88 Date: 11/28/09 Time: 08.00.00 Channel Path Activity Line 1 of 69 Scroll ===> HALF Range: 60 Sec
Utilization(%) Part Tot Bus 0.1 0.0 0.0 0.0 0.4 0.0 0.0 1.3 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.1 1.1 0.1 0.0 0.0 0.4 0.6 0.0 0.0 0.3 0.0 0.0 0.0 1.3 0.0 0.0 5.2 0.5 32.5 33.3 0.5 0.5 0.1 0.1 0.0 0.8 0.1 14.2 0.4 0.4 0.0 6.8 10.8 0.0 0.0
Read(B/s) Write(B/s) FICON OPS zHPF OPS Part Tot Part Tot Rate Actv Rate Actv
5 5 4 4 4 4 4
3 5 5 4 3 3 5 5
Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
0.0 0.0 0.0 0.0
0 0 0 0 2K 19K 0 0 5K 531K 511K 514K 511K 5M 3M 5M
8.9 8.5 0.1 0.1 0.0 0.0 0.0
205 429 0 0 374 365 0
52M 50M 619K 613K 23K 21K 10K
205 330 0 0 0 0 0
235K 249K 42K 73K 31K 32K 1K
186 185 24 30 8 7 3
1 2 1 2 1 1 1
0 0 0 0 0 0 0
0 0 0 0 0 0 0
3.2 0.2 0.2 0.0 0.7 1.4 0.0 0.0
801K 37K 36K 25 62K 344K 0 0
18M 147K 1M 870K 28K 86K 887K 27K 83K 101 0 0 2M 61K 1M 6M 61K 801K 0 0 0 0 0 0 0 315K 0 0 0 0 0 0
738 7 8 0 420 720 0 0
2 1 1 1 1 2 0 0
132 36 36 0 157 0 0 0
1 1 1 0 1 0 0 0
Figure 105. CHANNEL Report
For all channels that are managed by DCM, additional information is available. These channels are not assigned to a specific control unit permanently, but belong to a pool of channels. Based on workload requirements in the system, these channels will be assigned dynamically by DCM. On top of the report, there is a consolidated data section for managed channel paths displaying the total number of channel paths for each type and the average activity data. The character M as suffix of the acronym for the channel path type is an indicator that the channel is managed by DCM.
188
I/O Queuing Activity report

RMF V1R11 Command ===> Samples: 30 System: S5C Date: 11/28/09 Time: 03.23.30 I/O Queuing Activity Line 1 of 54 Scroll ===> HALF Range: 30 Sec
Path D7 D6 B0 B1 B2 B3 95 B0 B1 B2 B3 PF PF NP NP PF NP NP PF PF
DCM Group DCM CTL Units MN MX DEF LCU 5F00 5F00 8000 8000 8000 8000 8000 8100 8100 8100 8100 0048 0048 0048 0069 0069 0069 0069 0069 0069 006A 006A 006A 006A
Cont Del Q AVG Rate Lngth CSS
CHPID %DP %CU AVG AVG Taken Busy Busy CUB CMR 1.13 0.97 2.10 82.17 83.83 0.00 0.00 83.17 249.17 0.00 0.00 124.53 124.87 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ----0.0 0.0 ----0.0 0.0 0.2 0.2 0.2 0.2 0.2 ----0.2 0.2 ----0.2 0.2
0.0
0.00
0.3
0.0
0.00
0.4
Figure 106. IOQUEUE Report
The values in columns MIN MAX report the minimum and maximum number of DCM managed channels for one LCU (in this interval). DEF is the maximum number of managed channels for each LCU as it has been defined with HCD. The line with these values is available only for LCUs with DCM managed channels. It also contains the I/O activity rate, director port contention, and control unit contention of all DCM managed channels. These values may include also measurements of managed channels which were partially online.
Channel subsystem priority queuing

Channel subsystem (CSS) priority queuing is a new function available on z900 processors either in basic or LPAR mode that allows the operating system to specify a priority value when starting an I/O request. When there is contention causing queuing in the channel subsystem, the request will be prioritized by this value. z/OS sets the priority based on the WLM service policy. This complements the priority management that sets I/O priority for IOS UCB queues, and for queuing in the storage subsystem. CSS priority queuing uses different priorities calculated in a different way than the I/O priorities used for UCB and control unit queuing.
189
Value of channel subsystem priority queuing

v Improved performance I/O from work that is not meeting its goals may be prioritized ahead of I/O from work that is meeting its goals, providing the Workload Manager with an additional method of adjusting I/O performance. Channel subsystem priority queuing is complementary to UCB priority queuing and control unit priority queuing, each addressing a different mechanism that may affect I/O performance. v More effective use of providing I/O bandwidth Because interference from other I/O is reduced, the total I/O bandwidth required to meet the goals of high importance work may possibly be reduced. Previously, bandwidth had to be acquired to service the whole workload in order to meet the performance requirements of important work. v Reduced skills required to manage z/OS Monitoring and tuning requirements are reduced because of the self tuning abilities of the channel subsystem.
Reporting of channel subsystem priority queuing

There is no specific reporting of this function in RMF, but you may see a reduced number of I/O delay samples in the Workload Activity report.
LPAR CPU management

LPAR CPU management allows WLM to manage the processor weighting and number of logical processors across an LPAR cluster. CPU resources are automatically moved toward LPARs with the most need, by adjusting the partitions weight. The sum of the weights for the participants of an LPAR cluster is viewed as a pooled resource which can be apportioned among the participants to meet the goals defined in the WLM service policy. The installation can place limits on the processor weight range. WLM also manages the available processors by varying unneeded CPs offline (more logical CPs implies more parallelism, and less weight per CP). LPAR overheads are decreased by reducing the number of logical processors.
LPAR cluster
An LPAR cluster is a group of logical partitions that are resident on the same physical server and in the same sysplex.
190
CPC
Sysplex 1 LPAR Cluster 1

Partition 1 z/OS V1R1 Partition 2 z/OS V1R1

Figure 107. LPAR Cluster - Example 1
The next figure is a more complex configuration, with four LPAR clusters grouped into two sysplexes across two CPCs:
CPC 1

CPC 2
LPAR Cluster 3

LPAR Cluster 4
Figure 108. LPAR Cluster - Example 2
191
Value of LPAR CPU management

v Logical CPs are performing at the fastest uniprocessor speed available. This results in the number of logical CPs tuned to the number of physical CPs of service being delivered by the LPARs current weight. If the LPAR is getting 4 equivalent physical CPs of service and has 8 logical CPs online to z/OS, then each logical CP only gets half of an equivalent physical CP. For example, if a CP delivers 200 MIPS, half of it will deliver 100 MIPS. This occurs because each logical CP gets fewer time slices. v Reduces LPAR overhead. There is an LPAR overhead for managing a logical CP. The higher the number of logical CPs in relation to the number of physical CPs, the higher the LPAR overhead. This is because LPAR has to do more processing to manage the number of logical CPs which exceeds the number of physical CPs. v Gives z/OS more control over how CP resources are distributed. z/OS is able to change the assigned CP resources (LPAR weights) and place them where they are required for the work. Making both of these adjustments is simple from an operator perspective, but what is difficult is identifying when the changes are required and whether the changes have had a positive effect. LPAR CPU management v identifies what changes are needed and when v projects the likely results on both the work it is trying to help and the work that it will be taking the resources from v performs the changes v analyzes the results to ensure the changes have been effective.
LPAR CPU management decision controls

There are two parts of LPAR CPU management: v WLM LPAR Weight Management v WLM LPAR Vary CPU Management
WLM LPAR weight management

WLM LPAR weight management involves extending the receiver/donor logic for CPU delays to CP resource being used by other LPARs in the LPAR cluster. Until now, WLM could only move CP resource from one service class period to another on the same system by adjusting the dispatching priority of the receiver and/or the donor. When using WLM LPAR weight management, WLM can now alter the weights of LPARs in the LPAR cluster, effectively moving CP resource from one LPAR to another. In the past, an operator was required to change LPAR current weights manually. However, an operator would first have to determine that there was a CPU delay and then choose which other LPAR the weight units should be taken from. This approach was not practical.
WLM LPAR vary CPU management

WLM LPAR vary CPU management does not look at service class periods when making adjustments. It ensures the number of online logical CPs is approximately equal to the number of physical CPs required to do the work. It is more like a tuning action for the CP resources available to the system. Prior to LPAR CPU management, there was no function in WLM to tune the available CP resources in relation to the physical CP resources being used by the system. Once again if this
192

function is to be performed manually, an operator first needs to identify this is a problem before adjusting the number of logical CPs.
LPAR CPU management controls

The IBM z900 CPC and CPU management introduce new terms for LPAR weights. The following are used when discussing LPAR weights. v Initial processing weight This weight becomes the LPARs actual weight immediately following an IPL. v Minimum and maximum processing weight This is the minimum and maximum weight that WLM LPAR weight management will assign as an LPARs actual weight. v Actual weight An LPARs weight at a point in time when the systems is running. This is the weight that WLM LPAR weight management and WLM LPAR vary CPU management use when performing their primary functions. v Enable work load management must be checked to enable LPAR CPU management. Recommended usage is to use large (2 or 3 digits) values for CPU weights to allow WLM a better granularity.
Reporting of LPAR CPU management

The Postprocessor CPU Activity report has been extended by a new section called LPAR Cluster report.
LPAR Cluster report

L P A R z/OS V1R11 C L U S T E R R E P O R T PAGE SYSTEM ID NP1 RPT VERSION V1R11 RMF DATE 11/28/09 TIME 09.30.00 INTERVAL 14.59.678 CYCLE 1.000 SECONDS ------ STORAGE STATISTICS ------- CENTRAL ---- EXPANDED -3
------ WEIGHTING STATISTICS -------- DEFINED --- ---- ACTUAL ----INIT MIN MAX AVG MIN % MAX % 20 20 40 20 100 0.0 1 1 10 10 50 10 100 0.0 ---------------------------------31 300 100 500 300 0.0 0.0 200 200 ---------------------------------500
---- PROCESSOR STATISTICS ------- NUMBER --DEFINED ACTUAL -- TOTAL% -LBUSY PBUSY
CLUSTER SVPLEX1
PARTITION
SYSTEM
NP1 NP1 NP2 NP2 NP3 NP3 --------------------------TOTAL SVPLEX2 NP4 NP4 NP5 NP5 --------------------------TOTAL
4 1.2 24.92 3.30 4 4 0.73 0.27 4 1.0 23.99 2.67 -----------------------------12 49.64 6.24 6 4.2 80.15 53.43 4 4 40.33 17.92 -----------------------------10 120.5 71.36
1024 N/A 1024 N/A 1024 N/A ------------------------------3072 N/A 1024 N/A 1024 N/A ------------------------------2048 N/A
Figure 109. LPAR Cluster Report
This new report, as well as the Partition Data report (also part of the CPU Activity report), provides information about LPAR CPU management.
193
Report Analysis v CLUSTER PARTITION The three partitions NP1, NP2, and NP3 belong to the LPAR cluster SVPLEX1, but only NP1 and NP3 are under control of LPAR CPU management. v DEFINED INIT MIN MAX NP1 has an initial weight of 20 which also is the defined minimum, and it has a defined maximum weight of 40 . v ACTUAL AVG The actual average weight in the interval for NP1 is 20 . v ACTUAL MIN % MAX % These values provide the percentage of the interval when the partitions weight was within 10% of the defined minimum or maximum weight. NP1 was within 10% of the specified minimum (20) for the entire interval (MIN%= 100 ). v NUMBER DEFINED ACTUAL NP1 has been defined with 4 logical processors. In the current interval, WLM LPAR CPU management has assigned 1.2 processors to this partition in average. This value can be seen also in the Partition Data report (PROCESSOR NUM) and reflects the total ONLINE TIME PERCENTAGE in the CPU Activity report. v TOTAL% LBUSY PBUSY These values give the utilization of the logical processors assigned to the partition (based on its online time) and of all physical processors of the CPC (based on the interval time). They are also shown in the Partition Data report. v STORAGE STATISTICS The amount of central and expanded storage assigned to each partition is given in these columns.
194

P A R T I T I O N z/OS V1R11 SYSTEM ID NP1 RPT VERSION V1R11 RMF D A T A R E P O R T PAGE DATE 11/28/09 TIME 09.30.00 INTERVAL 14.59.678 CYCLE 1.000 SECONDS 2
9 7 2
N/A N/A N/A
--------- PARTITION DATA --------------------MSU---DEF ACT
-- LOGICAL PARTITION PROCESSOR DATA -----DISPATCH TIME DATA---EFFECTIVE TOTAL 00.04.27.302 00.04.27.519 00.00.21.680 00.00.22.083 00.03.35.761 00.03.35.859 01.12.05.405 01.12.06.405 00.24.11.147 00.24.11.311 00.00.03.103 ------------ -----------01.44.41.295 01.44.46.280 00.14.59.611 00.00.00.000 -----------00.14.59.625 00.00.00.000 00.00.00.065 ------------
-- AVERAGE PROCESSOR UTILIZATION PERCENTAGES -LOGICAL PROCESSORS --- PHYSICAL PROCESSORS --EFFECTIVE TOTAL LPAR MGMT EFFECTIVE TOTAL 24.86 0.60 23.97 80.10 40.30 24.92 0.61 23.98 80.15 40.31 0.01 0.01 0.02 0.01 0.01 0.05 -----0.11 0.01 0.00 0.03 -----0.04 4.25 0.35 3.43 68.69 23.03 0.05 ------ -----99.69 99.80 99.96 0.00 0.03 ------ -----99.95 99.99 99.95 0.00 4.24 0.34 3.41 68.68 23.02
S A A A A A
WGT
-CAPPING-- PROCESSORDEF WLM% NUM TYPE 62.2 1.2 0.0 4 3.3 1.0 0.0 6.0 0.0 4 CP CP CP CP CP
20 100 1 0 10 5 300 95 200 50
10 NO 1 YES 8 NO 155 NO 52 NO
DED DED
1 1
ICF ICF
99.98 0.00
99.99 0.00
Figure 110. Partition Data Report
The Partition Data report contains in the fields WGT and PROCESSOR NUM the actual values for weighting and number of logical processors, either as dynamically assigned by LPAR CPU management (for example, NP1 or NP3) or as statically defined by the customer (for example, NP2). Workload License Charges (WLC) Workload License Charges (WLC) is a pricing model that provides more flexibility, supports the variability and growth of e-business workloads, and improves the cost of computing. Customers will have the flexibility to configure their system to match workload requirements and be charged for the products used by these workloads at less than the full machine capacity, allowing customers to pay for what they need. The Postprocessor CPU Activity (Partition Data) report will show CPU resource consumption within an LPAR in terms of millions of service units (MSUs) and the corresponding LPAR MSU defined capacity. This helps you to understand how much of the defined capacity an LPAR is consuming. In addition, new overview criteria are available for the Postprocessor.
195
Report Analysis The following values in the report are related to Workload License Charges (WLC): v IMAGE CAPACITY A new value IMAGE CAPACITY in this report gives information about the CPU capacity which is available to the partition. It is measured in MSUs (millions of CPU service units) per hour. There are these alternatives: The partitions defined capacity set via the Hardware Management Console, if any. The report shows that NP1 has a defined capacity of 100 MSUs. The capacity based on logical processors defined for the partition, if the partition is uncapped and has no defined capacity. Example:
MSU of CPC # physical processors # logical processors Image capacity 290 9 4 (4/9) * 290 = 129
The capacity at the partitions weight, if the partition is capped via the Hardware Management Console. v MSU DEF ACT For each partition, the defined and actual MSU values are given. For the partition which is gathering the data, here partition NP1, DEF MSU ( 100 ) is identical to IMAGE CAPACITY while the actual MSU consumption in the interval is 10 MSUs (calculated as MSUs per hour). A value of zero in field DEF indicates there is no defined capacity set for this partition. ACT MSU contains the actual consumption of service units in MSUs per hour. This value does not correlate to the value of service units given in the Workload Activity report. Service Units in Workload Activity Report In the Service Policy page you find for each processor the value SU/SEC. This value is based on the number of logical processors assigned to the partition. If you have 4 logical processors in a z900 server, then this value relates to a 2064-104 (with 148 MSUs).
1 LP 4 LPs = 10289 SU/SEC = 41156 SU/SEC = 148 MSU/H
Service Units in Partition Data Report Service unit consumption in a WLC context is based on the number of physical processors in the CPC. In a 9-way z900 server 2064-109 with 290 MSUs, this value is taken:
1 CP 9 CPs = 8964 SU/SEC = 80676 SU/SEC = 290 MSU/H
v CAPPING WLM% The concept of WLC is to allow a partition a limited resource consumption - the defined capacity limit. If the partition is using more service units than defined, WLM will cap this partition. This is not done directly on the current consumption but it is based on a long-term average. So peaks exceeding the defined limit are possible. The field CAPPING WLM% shows the percentage of the reporting interval when a partition has been capped by WLM.
196
Appendix C. Planning considerations for HiperDispatch mode

With z/OS V1R10, workload management and dispatching have been enhanced to take advantage of the IBM System z10 hardware design. A dispatching mode called HiperDispatch has been implemented to provide additional processing efficiencies. With HiperDispatch the intent is to align work to a smaller subset of processors in order to maximize the benefits of the processor cache structures, and thereby, reduce the amount of CPU time required to execute work.
Introduction to HiperDispatch mode

For all levels of z/OS, a TCB or SRB may be dispatched on any logical processor of the type required (general purpose processor, zAAP zIIP). A unit of work starts on one logical processor and subsequently may be dispatched on any other logical processor. The logical processors for one LPAR image receive an equal share for equal access to the physical processors under PR/SM LPAR control. For example, if the weight of a logical partition with four logical processors results in a share of two physical processors, or 200%, the LPAR hypervisor manages each of the four logical processors with a 50% share of a physical processor. All logical processors are used if there is work available, and they typically have similar processing utilizations. With HiperDispatch mode, the intention is to manage work across fewer logical processors. A new concept of maintaining a working set of processors required to handle the workload is introduced. In the previous example of a logical partition with a 200% processor share and four logical processors, two logical processors are sufficient to obtain the two physical processors worth of capacity specified by the weight; the other two logical processors allow the partition to access capacity available from other partitions with insufficient workload to consume their share. z/OS limits the number of active logical processors to the number needed based on partition weight settings, workload demand and available capacity. z/OS also takes into account the processor topology when dispatching work, and it works with enhanced PR/SM microcode to build a strong affinity between logical processors and physical processors in the processor configuration. The logical processors for a partition in HiperDispatch mode fall into one of the following categories: v Some of the logical processors for a partition may receive a 100% processor share, meaning this logical processor receives an LPAR target for 100% share of a physical processor. These would be viewed as having a high processor share. Typically, if a partition is large enough, most of the logical partitions share is allocated among logical processors with a 100% share. PR/SM LPAR establishes a strong affinity between the logical processor and a physical processor, and these processors provide optimal efficiencies in HiperDispatch mode. v Other logical processors may have a medium amount of physical processor share. The logical processors would have a processor share greater than 0% and up to 100%. These medium logical processors have the remainder of the partitions share after the allocation of the logical processors with the high share. LPAR reserves at least a 50% physical processor share for the medium processor assignments, assuming the logical partition is entitled to at least that amount of service.
197
v Some logical processors may have a low amount, or 0%, of physical processor share. These discretionary logical processors are not needed to allow the partition to consume the physical processor resource associated with its weight. These logical processors may be parked. In a parked state, discretionary processors do not dispatch work; they are in a long term wait state. These logical processors are parked when they are not needed to handle the partitions workload (not enough load) or are not useful because physical capacity does not exist for PR/SM to dispatch (no available time from other logical partitions).
How RMF supports HiperDispatch mode

When examining a CPU Activity report for a system running in HiperDispatch mode, you may see very different processor utilizations across different logical processors of a logical partition. Figure 111 shows a report with an example of this behavior, and it also shows two columns related to HiperDispatch: TIME % PARKED and LOG PROC SHARE %.
C P U z/OS V1R11 SYSTEM ID S59 RPT VERSION V1R11 RMF H/W MODEL E40 A C T I V I T Y PAGE DATE 11/28/2009 TIME 16.45.00 INTERVAL 14.59.997 CYCLE 1.000 SECONDS 1
CPU
2097
MODEL
732
SEQUENCE CODE 00000000000DC6CE LOG PROC SHARE % 100.0 100.0 100.0 100.0 100.0 70.0 70.0 0.0 640.0
HIPERDISPATCH=YES --I/O INTERRUPTS-RATE % VIA TPI 5.80 4.59 5.10 2.40 8435 20.74 14.15 0.00 8488 48.75 55.30 55.18 53.75 10.05 4.95 19.39 0.00 10.14
---CPU--NUM TYPE
---------------- TIME % ---------------ONLINE LPAR BUSY MVS BUSY PARKED 96.33 95.96 95.79 95.46 95.08 73.92 74.33 13.84 80.09 97.34 97.07 96.84 96.68 96.41 96.86 97.13 98.89 96.94 0.00 0.00 0.00 0.00 0.00 0.00 0.00 85.78
0 CP 100.00 1 CP 100.00 2 CP 100.00 3 CP 100.00 4 CP 100.00 5 CP 100.00 6 CP 100.00 7 CP 100.00 TOTAL/AVERAGE ... ... ...
Figure 111. CPU Activity report
In this example, the logical processor share for the partition of 640.0% was allocated across five logical processors with a high share of 100%, two logical processors with a medium share of 70%, and one discretionary logical processor, CP 7, with a low share of 0% which was parked 85.78% and thus unparked 14.22% of the online interval time. During this same interval, CP 7 was busy processing 13.84% of the time. With HIPERDISPATCH=NO, the logical processor share would be 80% for each of the 8 logical processors. The TIME % MVS BUSY field in the CPU Activity report reflects the effective used capacity for the logical processors and the entire logical partition. The figures are calculated from the difference between online time and MVS wait time to provide an operating system perspective of busy time. Parked processors in HiperDispatch mode will generally reflect unavailable capacity at high physical processor utilizations. The formula for MVS BUSY TIME % has been changed with HiperDispatch mode to exclude Parked Time to show how busy the logical processor was when not parked. The formula is now:
MVS BUSY TIME % = Online Time - (Wait Time + Parked Time) ----------------------------------------- * 100 Online Time - Parked Time
HiperDispatch mode changes the effect of the CPU management provided by the Intelligent Resource Director (IRD). The WLM LPAR Weight Management functions
198
as before. However, the Vary CPU Management is replaced by the parked / not parked aspect of discretionary processors. As with the IRD function, the initial specification for the number of logical processors in HiperDispatch mode is simple: define as many as there will probably be required. The MSU calculation associated with a logical partition does not change with the presence of parked processors.
Processing benefits
HiperDispatch can improve efficiency in both hardware and software functionality: v Work may be dispatched across fewer logical processors, therefore reducing the multi-processor (MP) effects and lowering the interference among multiple partitions. v Specific z/OS tasks may be dispatched to a small subset of logical processors which PR/SM will tie to the same physical processors thus improving the hardware cache re-use and reference performance such as reducing the rate of cross-book communication. Therefore, the potential improvement from HiperDispatch depends from: v the number of physical processors v the size of the z/OS images in the configuration v the logical to physical processor ratio v the memory reference pattern or storage hierarchy characteristics of the workload. Generally, a configuration where the largest z/OS image fits within a book will see minimal improvement. Workloads which are fairly CPU-intensive (like batch applications) will see only small improvements even for configurations with larger z/OS images since they typically have long-running tasks that tend to stick on a logical engine anyway. Workloads that tend to have common tasks and high dispatch rates, typical for transactional applications, may see larger improvements, again depending on the size of the z/OS images involved. LPAR configurations that are over-committed, that is, they have higher logical to physical ratios, may see some improvement although the benefit of dispatching to a reduced number of logical processors overlaps with benefits already available with IRD and various automation techniques that tend to reduce the number of online logical processors to match capacity needs. The range of benefit may vary from 0% to 10% depending on the circumstances as described above; specifically, configurations with z/OS images small enough to fit in a book or running batch-like workloads will tend to achieve only small benefit. Multibook configurations with z/OS images in the 16way to 32way range and running transactional workloads will tend to achieve medium benefit, and very large multi-book configurations with very large z/OS images and running workloads with intense memory reference patterns will tend to achieve benefit at the high end of the range. For further topics related to HiperDispatch (for example, an introduction to HiperDispatch or WLM policy considerations) refer to the white paper:
Planning Considerations for HiperDispatch Mode
Appendix C. Planning considerations for HiperDispatch mode
199
200
Appendix D. Other delays
Some Other Delays This information unit discusses the delays shown by Monitor III that have not been covered yet: v Enqueue delays v HSM delays v JES delays v OPER delays v Unknown delays
201
Other delays
Enqueue delays
Here is an approach to investigating ENQ delays using Monitor III.
ENQ report
RMF V1R11 Command ===> Samples: 100 DLY % 49 System: PRD1 Date: 04/07/09 Time: 10.32.00 ENQ Delays Line 1 of 2 Scroll ===> HALF Range: 100 Sec
Jobname BHOLEQB
-------------- Resource Waiting -------------% STAT Major/Minor Names (Scope) 49 EW SYSDSN SYS3.BB.DATA (SYS)
---- Holding ---% Name/SYS STAT 49 BTEUPRT
Figure 112. Monitor III Enqueue Delays Report
Look at the ENQ report for the user or user group. Use the cursor to select the resource name with most delay and press ENTER to get the ENQR report: v Use this report to view all the contentions. v What is the major name contributing to most of the delay? v Find the largest holding user for this resource, then go through the dialog with this jobname. | | | The HELP function (PF1) on the ENQJ report provides description of some different MAJOR/MINOR names. Also check the z/OS MVS Diagnosis: Reference for a description of the MAJOR/MINOR names. If the name does not exist there, it may be a local name. Knowing the resource or the application provides a greater understanding of this problem.
Major names
SYSDSN Enqueue, look for the following: v Submitted job needs a data set that the TSO user has v Jobs actually have to run sequentially SYSIGGV2 / SYSCTLG Enqueue, look for the following: v Large LISTCATs running concurrently with a DELETE or HDELETE v Massive deletes occurring from another address space v Allocating and deleting but not using the ISPF log data set Try to find out what the largest holder of SYSIGGV2 was doing at the time of the problem. If this is not possible, and the user is a TSO user, he may have been doing massive deletes or a LISTCAT. If the problem occurred while exiting from ISPF, the delete of the Log data set could have caused the catalog access. If so, consider making the primary and secondary allocation of the log data set zero. This causes the log data set never to be allocated when entering ISPF.
202
Other delays
SYSVTOC Enqueue, look for the following: Try to determine who is allocating space and why it is taking so long. Possibly a job or user is frequently acquiring and freeing the space. If this is the case, acquire it once and do not free it. If the space is for a temporary data set, try using VIO instead. SYSZRACF Enqueue SYSZRACF is enqueued exclusively when updating the RACF data set. Find out why it is taking so long and reduce the time. SYSIEFSD Enqueue SYSIEFSD has several minor names: CHNGDEVS, DDRTPUR, ALLOCTP, DDRDA, Q4, Q6, QlO, RPL, STCQUE, VARYDEV, and TSOQUE. If the minor name is CHNGDEVS, look for a large DEVICE USING or DEVICE DELAY for the HOLDING address space. For example DEVICE DELAY may be near 100%. Obtain the dominant volumeID involved. If the time duration spans several intervals look at the volumeIDs for each interval within the problem time range. Browse the SYSLOG for mount or vary activity at or about this time. Determine if anything can be done to prevent this in the future.
203
Other delays
HSM delays
RMF V1R11 Command ===> Samples: 100 DLY % 94 82 77 System: MVS1 Date: 11/28/09 Time: 10.03.20 HSM Delays Line 1 of 3 Scroll ===> HALF Range: 100 Sec
Jobname AUDTRPTZ APETER TJSMITH
--------------------- Main Delay Reason -----------------------% F-Code Explanation 94 82 77 3 3 3 Dataset recall from auxiliary storage. Dataset recall from auxiliary storage. Dataset recall from auxiliary storage.
Figure 113. HSM Report
Here are some general problems to look for: v HSM is down v HSM is partially down (HSM device not responding) v HSM backed up because: Not enough primary space Not enough level one space HSM doing housekeeping Display the HSM address space with the DELAYJ command. Is the HSM data normal? Ask yourself why, when HSM data is unusual. You can compare this data to times when HSM delays were considered acceptable. Typical things to look for are: v PROC TCB + SRB at 0.0% (HSM may be dead) v PROC DELAY very high (HSM may have a poor dispatching queue position) v DEVICE DELAY or DEVICE USING excessive (see DEVR report and determine why) v A non-typical volume is in heavy use (modules being fetched?) v Excessive ENQ DELAY time because of CATALOG or VTOC contention Do a HLIST PVOL TERMINAL for volumes under HSM or ISMF control. The AGE column under SPACE-MGMT is the number of days a data set on this volume must be inactive before it is eligible for the type of space management indicated under TYPE. Different volumes may have different ages. MIN AGE shows the inactive age of the data set that most recently migrated from the volume. Pick the one of the following that seems to be most probable: v Something is broken (HSM in loop, HSM in infinite wait, control unit down, HSM address space terminated, STOR LOCL is 100%). v HSM address space is running slowly because of contention with others (high PROC DELAY. high STOR LOCL, high DEVICE DELAY, high ENQ DELAY). v HSM is in house cleaning mode (HSM is reorganizing data sets to reclaim space. There will be a high amount of PROC TCB + SRB and DEVICE USING for the amount of work in progress). v There is a shortage of primary or level-one space (for example migration age less than 7 days). v There is excessive traffic (migration age greater than 7 but many recall requests outstanding).
204
Other delays
JES delays
Here are some general problems to look for: v JES is down v JES is partially down (JES device not responding) v JES backed up Check the SYSLOG for more information. Look for excessive amount of I/O in the JES address space.
OPER delays
In RMF terms, the operator is another resource hence can appear as a delay. The Monitor III Delay report has a field of OPER%. This means the job is delayed by a mount request or is waiting for a reply to a WTOR message. Where practicable, batch jobs or applications should not issue WTORs.
Unknown delays
The Monitor III Delay report has a field of UKN%. RMF considers jobs that are not delayed for a monitored resource, or not in an idling state to be in an unknown state. Examples of unknown state delays are: v AS waiting for I/O other than DASD or tape v Idle address spaces which use an unmonitored mechanism for determining when they are active. Most STCs show as unknown when they are idle. v AS waiting for a request from another AS to be satisfied.
205
Other delays
206
Accessibility
Accessibility features help a user who has a physical disability, such as restricted mobility or limited vision, to use software products successfully. The major accessibility features in z/OS enable users to: v Use assistive technologies such as screen readers and screen magnifier software v Operate specific or equivalent features using only the keyboard v Customize display attributes such as color, contrast, and font size
Using assistive technologies

Assistive technology products, such as screen readers, function with the user interfaces found in z/OS. Consult the assistive technology documentation for specific information when using such products to access z/OS interfaces.
Keyboard navigation of the user interface

Users can access z/OS user interfaces using TSO/E or ISPF. Refer to z/OS TSO/E Primer, z/OS TSO/E Users Guide, and z/OS ISPF Users Guide Vol I for information about accessing TSO/E and ISPF interfaces. These guides describe how to use TSO/E and ISPF, including the use of keyboard shortcuts or function keys (PF keys). Each guide includes the default settings for the PF keys and explains how to modify their functions.
z/OS information
z/OS information is accessible using screen readers with the BookServer/Library Server versions of z/OS books in the Internet library at:
http://www.ibm.com/systems/z/os/zos/bkserv/
207
208
Glossary A
AS. address space auxiliary storage (AUX). All addressable storage, other than main storage, that can be accessed by means of an I/O channel; for example storage on direct access devices. CICS. Customer Information Control System contention. Two or more incompatible requests for the same resource. For example, contention occurs if a user requests a resource and specifies exclusive use, and another user requests the same resource, but specifies shared use. CP. Central processor. criteria. Performance criteria set in the WFEX report options. You can set criteria for all report classes (PROC, SYSTEM, TSO, and so on). CRR. Cache RMF Reporter CPU speed. Measurement of how much work your CPU can do in a certain amount of time. Customer Information Control System (CICS). An IBM licensed program that enables transactions entered at remote terminals to be processed concurrently by user-written application programs. It includes facilities for building, using, and maintaining data bases. cycle. The time at the end of which one sample is taken. Varies between 50 ms and 9999 ms. See also sample
B
balanced systems. To avoid bottlenecks, the system resources (CP, I/O, storage) need to be balanced. basic mode. A central processor mode that does not use logical partitioning. Contrast with logically partitioned (LPAR) mode. bottleneck. A system resource that is unable to process work at the rate it comes in, thus creating a queue. BLSR. Batch LSR Subsystem. See also LSR
C
cache fast write. A storage control capability in which the data is written directly to cache without using nonvolatile storage. This 3990 Model 3 Storage Control extended function should be used for data of a temporary nature, or data which is readily recreated, such as the sort work files created by DFSORT. Contrast with DASD fast write. cache hit. Finding a record already in the storage (cache) of the DASD control unit (3990). capture ratio. The ratio of reported CPU time to total used CPU time. captured storage. See shared page group. central processor (CP). The part of the computer that contains the sequencing and processing facilities for instruction execution, initial program load, and other machine operations. central processor complex (CPC). A physical collection of hardware that consists of central storage, one or more central processors, timers, and channels. CFWHIT. Cache fast write and read hits channel path. The channel path is the physical interface that connect control units and devices to the CPU.
D
DASD fast write. An extended function of the 3990 Model 3 Storage Control in which data is written concurrently to cache and nonvolatile storage and automatically scheduled for destaging to DASD. Both copies are retained in the storage control until the data is completely written to the DASD, providing data integrity equivalent to writing directly to the DASD. Use of DASD fast write for system-managed data sets is controlled by storage class attributes to improve performance. Contrast with cache fast write. data sample. See sample DCM. See Dynamic Channel Path Management DCME. Dynamic cache management extended delay. The delay of an address space represents a job that needs one or more resources but that must wait because it is contending for the resource(s) with other users in the system. DFWHIT. DASD fast write hit DFWRETRY. DASD fast write retry DIM. Data-In-Memory
209
direct access storage device (DASD). A device in which the access time is effectively independent of the location of the data. DLY. Delay DP. Dispatching priority dynamic channel path management. Dynamic channel path management provides the capability to dynamically assign channels to control units in order to respond to peaks in demand for I/O channel bandwidth. This is possible by allowing you to define pools of so-called floating channels that are not related to a specific control unit. With the help of the Workload Manager, channels can float between control units to best service the work according to their goals and their importance.
H
high-speed buffer (HSB). A cache or a set of logically partitioned blocks that provides significantly faster access to instructions and data than provided by central storage. HS. hiperspace HSB. High-speed buffer
I
IBM System z9 Application Assist Processor (zAAP). A special purpose processor configured for running Java programming on selected zSeries machines. IBM System z9 Integrated Information Processor (zIIP). A special purpose processor designed to help free-up general computing capacity and lower overall total cost of computing for selected data and transaction processing workloads for business intelligence (BI), ERP and CRM, and selected network encryption workloads on the mainframe. IFA. Integrated Facility for Applications IMS. Information Management System Information Management System (IMS). A database/data communication (DB/DC) system that can manage complex databases and networks. Synonymous with IMS/VS. Intelligent Resource Director (IRD). The Intelligent Resource Director (IRD) is a functional enhancement exclusive to IBMs zSeries family that can continuously and automatically reallocate resources throughout the system in response to user requirements and is based on a customers business priorities. ITR. Internal throughput rate, see also ETR.
E
EMIF. ESCON multiple image facility enclave. An enclave is a group of associated dispatchable units. More specifically, an enclave is a group of SRB routines that are to be managed and reported on as an entity. EPDM. Enterprise Performance Data Manager/MVS, new name of the product is Performance Manager for MVS execution velocity. A measure of how fast work should run when ready, without being delayed for processor or storage access. ESCON multiple image facility (EMIF). A facility that allows channels to be shared among PR/SM logical partitions in an ESCON environment. ETR. External throughput rate, see also ITR exception reporting. So that you dont have to watch your monitor all the time to find out if there is a performance problem, you can define performance criteria. RMF will then send a report when the criteria are not being met (an exception has occurred).
L
Latency. The time waiting for a particular record on a track to rotate to the read/write head. This usually averages one half the rotation time. LCU. Logical control unit. Logical control units are also called Control Unit Headers (CUH). For details about LCU/CUH please refer to the applicable System z Input/Output Configuration Program Users Guide for ICP IOCP (SB10-7037). local shared resources. An option for sharing I/O buffers, I/O-related control blocks, and channel programs among VSAM data sets in a resource pool that serves one partition or address space. logically partitioned (LPAR) mode. A central processor mode that is available on the Configuration
G
generalized trace facility (GTF). An optional OS/VS service program that records significant system events, such as supervisor calls and start I/O operations, for the purpose of problem determination. GO mode. In this mode, the screen is updated with the interval you specified in your session options. The terminal cannot be used for anything else when it is in GO mode. See also mode. GTF. generalized trace facility
210
frame when using the PR/SM feature. It allows an operator to allocate processor unit hardware resources among logical partitions. Contrast with basic mode. logical partition (LP). A subset of the processor hardware that is defined to support an operating system. See also logically partitioned (LPAR) mode. LP. Logical partition. LPAR. Logically partitioned (mode). LPAR cluster. An LPAR cluster is the subset of the systems that are running as LPARs on the same CEC. Based on business goals, WLM can direct PR/SM to enable or disable CP capacity for an LPAR, without human intervention. LSPR methodology. Recommended methodology for assessing processor power. LSR. Local shared resources LUE. Low utilization effect
P
partitioned data set (PDS). A data set in direct access storage that is divided into partitions, called members, each of which can contain a program, part of a program, or data. Synonymous with program library. PDS. partitioned data set peak-to-average ratio. The ratio between the highest CPU utilization and the average utilization (peak hour busy/prime shift average busy). PER. Program event recording performance management. (1) The activity which monitors and allocates data processing resources to applications according to goals defined in a service level agreement or other objectives. (2) The discipline that encompasses collection of performance data and tuning of resources. PR/SM. Processor Resource/Systems Manager. Processor Resource/Systems Manager (PR/SM). The feature that allows the processor to run several operating systems environments simultaneously and provides logical partitioning capability. See also LPAR. program event recording (PER). A hardware feature used to assist in debugging programs by detecting and recording program events.
M
migration rate. The rate (pages/second) of pages being moved from expanded storage through central storage to auxiliary storage. mintime. The smallest unit of sampling in Monitor III. Specifies a time interval during which the system is sampled. The data gatherer combines all samples gathered into a set of samples. The set of samples can be summarized and reported by the reporter. mode. Monitor III can run in various modes: The GO mode displays a chosen report, and updates it according to the interval you have chosen, the STOP mode, which is the default mode. The GRAPHIC mode presents the reports in graphic format using the GDDM* product. You can also use TABULAR mode by setting GRAPHIC OFF. MPL. Multiprogramming level MTW. Mean time to wait
R
range. The time interval you choose for your report. RMA. The 3990 RPS miss avoidance feature. RHIT. Read hit
S
sample. Once in every cycle, the number of jobs waiting for a resource, and what job is using the resource at that moment, are gathered for all resources of a system by Monitor III. These numbers constitute one sample. SCP. System control program seek. The DASD arm movement to a cylinder. A seek can range from the minimum to the maximum seek time of a device. In addition, some I/O operations involve multiple imbedded seeks where the total seek time can be more than the maximum device seek time. service level agreement (SLA). A written agreement of the information systems (I/S) service to be provided to the users of a computing installation.
N
NVS. Nonvolatile storage nonvolatile storage (NVS). Additional random access electronic storage with a backup battery power source, available with a 3990 Model 3/6 Storage Control, used to retain data during a power failure. Nonvolatile storage, accessible from all storage directors, stores data during DASD fast write and dual copy operations.
Glossary
211
Service Level Reporter (SLR). An IBM licensed program that provides the user with a coordinated set of tools and techniques and consistent information to help manage the data processing installation. For example, SLR extracts information from SMF, IMS, and CICS logs, formats selected information into tabular or graphic reports, and gives assistance in maintaining database tables. service rate. In the system resources manager, a measure of the rate at which system resources (services) are provided to individual jobs. It is used by the installation to specify performance objectives, and used by the workload manager to track the progress of individual jobs. Service is a linear combination of processing unit, I/O, and main storage measures that can be adjusted by the installation. shared page groups. An address space can decide to share its storage with other address spaces using a function of RSM. As soon as other address spaces use these storage areas, they can no longer be tied to only one address space. These storage areas then reside as shared page groups in the system. The pages of shared page groups can reside in central, expanded, or auxiliary storage. SLA. service level agreement SLIP. serviceability level indication processing SLR. Service Level Reporter SMF. System management facility speed. See workflow SRB. Service request block SRM. System resource manager SSCH. Start subchannel sysplex. A complex consisting of a number of coupled z/OS systems. system control program (SCP). Programming that is fundamental to the operation of the system. SCPs include z/OS, z/VM and z/VSE operating systems and any other programming that is used to operate and maintain the system. Synonymous with operating system.
THRU. These are DASD writes to devices behind 3990s that are not enabled for DFW. TP. Teleprocessing TPNS. Teleprocessing network simulator TSO. Time Sharing Option, see Time Sharing Option/Extensions Time Sharing Option Extensions (TSO/E). In z/OS, a time-sharing system accessed from a terminal that allows user access to z/OS system services and interactive facilities.
U
UIC. Unreferenced interval count uncaptured time. CPU time not allocated to a specific address space. using. Jobs getting service from hardware resources (PROC or DEV) are using these resources.
V
velocity. A measure of how fast work should run when ready, without being delayed for processor or storage access. See also execution velocity. VLF. Virtual Lookaside Facility VTOC. Volume table of contents
W
WLM. Workload Manager workflow. (1) The workflow of an address space represents how a job uses system resources and the speed at which the jobs moves through the system in relation to the maximum average speed at which the job could move through the system. (2) The workflow of resources indicates how efficiently users are being served. workload. A logical group of work to be tracked, managed, and reported as a unit. Also, a logical group of service classes. WSM. Working Set Manager
T
TCB. Task control block threshold. The exception criteria defined on the report options screen. throughput. A measure of the amount of work performed by a computer system over a period of time, for example, number of jobs per day.
Z
zAAP. see IBM System z9 Application Assist Processor. zIIP. see IBM System z9 Integrated Information Processor.
212
Notices
This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the users responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact: IBM Corporation Mail Station P300 2455 South Road Poughkeepsie New York 12601-5400 U.S.A. Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us. For license inquiries regarding double-byte (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to: IBM World Trade Asia Corporation Licensing 2-31 Roppongi 3-chome, Minato-ku Tokyo 106, Japan The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION AS IS
213
WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. If you are viewing this information softcopy, the photographs and color illustrations may not appear.
Policy for unsupported hardware

Various z/OS elements, such as DFSMS, HCD, JES2, JES3, and MVS, contain code that supports specific hardware servers or devices. In some cases, this device-related element support remains in the product even after the hardware devices pass their announced End of Service date. z/OS may continue to service element code; however, it will not provide service related to unsupported hardware devices. Software problems related to these devices will not be accepted for service, and current service activity will cease if a problem is determined to be associated with out-of-support devices. In such cases, fixes will not be issued.
Programming Interface Information

This book is intended to help you work with the reports RMF provides to understand your systems performance and to plan for appropriate tuning activities. The book documents information that is Diagnosis, Modification or Tuning Information provided by RMF. Warning: Do not use this Diagnosis, Modification or Tuning Information as a programming interface.
Trademarks
The following terms are trademarks of the IBM Corporation in the United States and/or other countries:
IBM The IBM logo ibm.com CICS CICS/ESA DB2 DFSMS DFSMS/MVS DFSORT Enterprise Storage Server ESCON eServer FICON Parallel Sysplex PR/SM Processor Resource/Systems Manager RACF RAMAC RMF Seascape System z System z9 System z10 S/390 S/390 Parallel Enterprise Server Versatile Storage Server
214
GDDM Hiperbatch IMS MVS NetView OS/390
VTAM z9 z10 z10 EC z/OS zSeries
IBM, the IBM logo, ibm.com and DB2 are registered trademarks of International Business Machines Corporation in the United States, other countries, or both. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States and/or other countries. Other company, product, and service names may be trademarks or service marks of others.
Notices
215
216
Index Numerics
64-bit common virtual storage measurements xi CICS 55 workload activity 146 collection, long term data 5 collection, short term data 5 components of response time 7 computing CPU time per workload type 14 constraint, single-CP 55 contention device, report 120 for resources, Monitor III report 35 continuous monitoring of system status 5 control unit, logical 48 coupling facility activity report 157 CPU Activity report, Monitor I 45 decreasing delay 73 idle due to bottleneck 72 indication of bottleneck 72 indicators of problem 62 measuring utilization 13 Monitor I indicators of problem 68 problem with 61 problem, delay for 73 speed, calculating 20 time per transaction 15 time per workload type, computing 14 time, calculating 20 CPU Activity report, Monitor I 69 CPU Activity report, partition data, Monitor I 71, 79 CR 55 CS 132 cycle, sampling 6 delay (continued) unknown 205 using Monitor III to find 35 delay due to paging 139 delay for CPU, problem 73 Delay report, Monitor III 64 demand, latent 3 DEV 41 device contention, report 120 DEVICE UTILIZATION, %, rule-of-thumb 114 DFWHIT 89 DFWRETRY 90 diagnosing a problem 23 DIM 78 disability 207 dispatching priority 74 distributing uncaptured time 15 DLY %, rule-of-thumb 137 DP 74 Dynamic Cache Management Extended 101 Dynamic Channel Path Management 186
A
access to processor storage, prioritize 144 accessibility 207 analyzing workload characteristics 9 APPL% CP, rule-of-thumb 55 AUX 132 auxiliary storage 132 auxiliary storage space 141 auxiliary storage tuning 144 AVG CONN TIME, rule-of-thumb 113 AVG DISC TIME, rule-of-thumb 112 AVG HIGH UIC, rule-of-thumb 141 AVG IOSQ TIME, rule-of-thumb 110 AVG PEND TIME, rule-of-thumb 111 AVG RESP TIME, rule-of-thumb 109, 143 AVG SLOTS ALLOCATED, rule-of-thumb 141 AVG SLOTS USED, rule-of-thumb 141
E
ENQ 72 enqueue delay 202 Enterprise Storage Server 85 ESS (Enterprise Storage Server) 85 exception reporting 40 guideline values 40 speed 40 Workflow/Exceptions report, Monitor III 66
B
balanced systems 3 batch 37 blocked workload promotion BUSY TIME PERCENTAGE, rule-of-thumb 45, 69 BYPASS mode 90 xii
C
cache hit ratio, rule-of-thumb 101 Cache Subsystem Activity report 97 CACHE, what to measure 89 capacity groups xiii capacity planning approaches for 4 benefits 3 common mistakes 3 definition 3 input to 55 questions to ask 3 resources 3 capture ratio computing 55 definition 13 central storage 132 CF (coupling facility activity) 157 CFWHIT 89 Channel Path Activity report, Monitor I 48 Channel Subsystem Priority Queuing 189 Copyright IBM Corp. 1993, 2009
D
daily monitoring, system indicators 41 DASD guidelines, general 107 response time components 104 subsystem health check 84 system resource indicator 24 what to measure 89 DASD Activity report, Monitor I 143 DASD guidelines 107 data collection, long term 5 data collection, short term 5 DCM 186 DCME 101 delay causes of 8 enqueue 202 finding major 35 HSM 204 JES 205 OPER 205 report, Monitor III 38 tuning approach 26
F
FICON performance monitoring xii
G
GO mode 35 group capacity limit xiii Group Response Time report, Monitor III 37 GTF 79
H
HiperDispatch xi HiperDispatch mode introduction 197 processing benefits 199 support in RMF 198 hit ratio, low, cache 101 HSM delay 204 HyperPAV support xiii
217
I
I/O I/O Queuing Activity report, Monitor I 49 indicators for contention 82 rate of workload 16 resolving specific problems 118 IBM System z9 Application Assist Processor 210 IBM System z9 Integrated Information Processor xiii, 210 IFA xiv IMS workload activity 146 indicator sysplex monitoring, Monitor III 28 indicators of problems CPU problem 62 for storage problems 37 of overcommitted tape buffers 50 processor storage, Monitor I 132 processor storage, Monitor III 132, 133 speed 40 swap 53 workflow 40 INHIBIT mode 90 Integrated Facility for Applications xiv Intelligent Resource Director Channel Subsystem Priority Queuing 189 Dynamic Channel Path Management 185 LPAR CPU Management 190 Overview 185 interval 35 reporting 6 IRD (Intelligent Resource Director) 185 ITR 21
LPAR (continued) CPU constraints 47 Monitor I Channel Path Activity report 48 LPAR Cluster 190 LPAR CPU Management 190
M
measuring CPU utilization 13 online 5 resource consumption 11 message retrieval tool, LookAt viii MFLOPS 21 MIPS 20 Monitor III processor storage indicators 133 Monitor I reports Channel Path Activity report 48 CPU Activity report 45, 69 CPU Activity report, partition data 47, 71, 79 DASD Activity report 50, 108, 143 I/O Queuing Activity report 49 Magnetic Tape Device Activity report 127 Page/Swap Data Set Activity report 54, 141 Paging Activity report 53, 138, 141 Monitor II reports Address Space State Data report 74 Monitor III reports Delay report 38, 64, 119 Device Delays report 120 Device Resource Delays report 121 Group Response Time report 37, 63 Job Delays report 39, 73, 77, 135 Processor Delays report 66, 76 Storage Delays report 38, 137 System Information report 118, 134 Workflow/Exceptions report 40, 66, 78 monitoring of system status, continuous 5
J
JES delay 205 Job Delays report, Monitor III 39, 135
K
keyboard 207
O
online measurements 5 OPER delay 205 OUT READY, rule-of-thumb 45 out-ready delay, indicator 134
Paging Activity report, Monitor I 53, 138 paging delay 134, 139 parallel access volumes 87 partition data report section, Monitor I CPU Activity report 47 partition, logical 47 PAV 87 peak-to-average ratio 3, 45 PER 78 performance management approaches for 4 definition 2 performance problem auxiliary storage tuning 144 CPU 61 definition 24 delay for CPU 73 diagnosing 23 finding major delay 35 I/O, resolving specific problems 118 indicators of CPU problem 62 investigating 25 Monitor I indicators of CPU problem 68 processor storage 131 resolving 25 PR/SM LPAR considerations 179 prioritize access to processor storage 144 problem indicators CPU problem 62 for storage problems 37 of overcommitted tape buffers 50 processor storage, Monitor I 132 processor storage, Monitor III 132, 133 speed 40 swap 53 workflow 40 PROC 41 Processor Delays report, Monitor III 66 processor storage indicators of 131 Monitor I 132, 138 Monitor III 132, 133 prioritizing access to 144 problem 131 use by workload 18 Program Event Recording 78 promoting blocked workload xii
L
latent demand 3 LCU 48 LCU AVG RESP TIME, rule-of-thumb LOG SWAP/EXP STOR EFFECTIVE, rule-of-thumb 53 logical control unit 48 logical partition 47 long term data collection 5 LookAt message retrieval tool viii loops 76 low hit ratio, cache 101 LP 47 LPAR 47 constraints, what to do 75 50
R
range 35 rate, SSCH 17 ratio capture 13 capture, computing 55 ratio, low hit, cache 101 ratio, peak-to-average 3 recommendations 144 reconfigurable storage 138 Relative Processor Power 21 report on device contention 120 reporting interval 6 reports, using Monitor III 36 resolving specific I/O problems 118 resource consumption, measuring 11
P
PAGE MOVEMENT WITHIN CENTRAL STORAGE, rule-of-thumb 138 PAGE-IN EVENTS, rule-of-thumb 139 PAGE-IN RATES, rule-of-thumb 55 page/swap 143 Page/Swap Data Set Activity report, Monitor I 54, 141 PAGES XFERD, rule-of-thumb 54 paging 55 paging activity 19 Paging Activity report 141
218
response time batch 37 components of 7 computing 7 end-to-end 35 external response time 35 finding major delay 35 internal response time 35 Monitor I Workload Activity report 7 Monitor III Group Response Time report 7 TSO 37 where to start 35 Response Time Distribution report description 26 RHIT 89 RMF monitors 5 tasks 5 RPP 21 RSU 138 rule-of-thumb % Delayed for STR 135 % IN USE 54, 142 APPL% CP 55 AVG CONN TIME 113 AVG DISC TIME 112 AVG HIGH UIC 141 AVG IOSQ TIME 110 AVG PEND TIME 111 AVG RESP TIME 109, 143 AVG SLOTS ALLOCATED 141 AVG SLOTS USED 141 BUSY TIME PERCENTAGE 45 BUSY TIME PERCENTAGE, CPU 69 cache hit ratio 101 DASD guidelines 107 DEVICE UTILIZATION, % 114 DLY % 137 LCU AVG RESP TIME 50 LOG SWAP/EXP STOR EFFECTIVE 53 OUT READY 45 PAGE MOVEMENT WITHIN CENTRAL STORAGE 138 PAGE-IN EVENTS 139 PAGE-IN RATES 55 PAGES XFERD 54 STOR 134
STAGE 89 STOR 41 STOR, rule-of-thumb 134 storage See also storage problem auxiliary 132 central 132 delay report, Monitor III 137 reconfigurable 138 space, auxiliary 141 tuning, auxiliary 144 Storage Delays report, Monitor III 38 storage problem indicators for 37 processor 131 storage use by workload, processor 18 swapping delay, indicator 134 swaps, TSO 53 Sysplex Summary report description 26 system indicators 26 system indicators, daily monitoring 41 System Information report, Monitor III 134 system status, continuous monitoring of 5 systems, balanced 3
workflow 40 Workflow/Exceptions report, Monitor III 40, 66 workload analyzing 9 computing CPU time per type 14 grouping of 9 I/O rate of 16 measuring CPU utilization 13 measuring resource utilization 10 paging rate of 19 processor storage use by 18 types of 9 Workload Activity report 27, 55 Workload License Charges 195 workload type, compute CPU time per 14
X
XES structure activity 158
Z
z9 EC xiv zAAP xiv, 210 zIIP xiii, 210 zSeries Application Assist Processor xiv
T
TCB 79 throughput external throughput rate 21 internal throughput rate 21 THRU 90 time per transaction, CPU 15 time, distributing uncaptured 15 tracing 79 transaction computing CPU time 15 definition 2 response time, components of 7 TSO 37 TSO swaps 53 tuning 24 top-down approach 25 where to begin 5 types of workload 9
S
sampling cycle 6 service level agreement contents of 2 definition 2 Service Level Agreement 2 short term data collection 5 shortcut keys 207 single-CP constraint 55 SLA 2, 41 SLIP 78 slip trap 78 space, auxiliary storage 141 speed 40 SRB 55 SSCH rate 17
U
uncaptured time, distributing 15 unknown delay 205 use by workload, processor storage using Monitor III reports 36 utilization, measuring CPU 13 18
V
VIO, delay, indicator 134
W
WLC 195 WLM blocked workload promotion xii Index
219
220
Readers Comments Wed Like to Hear from You

z/OS Resource Measurement Facility Performance Management Guide Version 1 Release 11 Publication No. SC33-7992-09 We appreciate your comments about this publication. Please comment on specific errors or omissions, accuracy, organization, subject matter, or completeness of this book. The comments you send should pertain to only the information in this manual or product and the way in which the information is presented. For technical questions and information about products and prices, please contact your IBM branch office, your IBM business partner, or your authorized remarketer. When you send comments to IBM, you grant IBM a nonexclusive right to use or distribute your comments in any way it believes appropriate without incurring any obligation to you. IBM or any other organizations will only use the personal information that you supply to contact you about the issues that you state on this form. Comments:
Thank you for your support. Submit your comments using one of these channels: v Send your comments to the address on the reverse side of this form. v Send a fax to the following number: FAX (Germany): 07031+16-3456 FAX (Other Countries): (+49)+7031-16-3456 v Send your comments via e-mail to: s390id@de.ibm.com v Send a note from the web page: WORLD WIDE WEB: http://www.ibm.com/servers/eserver/zseries/zos/rmf If you would like a response from IBM, please fill in the following information:
Name Company or Organization Phone No.
Address
E-mail address
___________________________________________________________________________________________________
Readers Comments Wed Like to Hear from You

SC33-7992-09
Cut or Fold Along Line
Tape do not Fold and _ _ _ _ _ _ _Fold _ _ _and ___ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _Please _____ __ _ _ staple _____________________________ ___ _ _ Tape ______ NO POSTAGE NECESSARY IF MAILED IN THE UNITED STATES
BUSINESS REPLY MAIL

FIRST-CLASS MAIL PERMIT NO. 40 ARMONK, NEW YORK POSTAGE WILL BE PAID BY ADDRESSEE
IBM Deutschland Research & Development GmbH Department 3248 Schoenaicher Strasse 220 D-71032 Boeblingen Federal Republic of Germany
_________________________________________________________________________________________ Fold and Tape Please do not staple Fold and Tape
SC33-7992-09
Cut or Fold Along Line
Program Number: 5694A01
Printed in USA
SC33-7992-09
Spine information:
z/OS
Version 1 Release 11
SC33-7992-09

RMF

Uploaded by

Copyright:

Available Formats

RMF

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RMF

Uploaded by

Copyright:

Available Formats

z/OS

Resource Measurement Facility Performance Management Guide

Resource Measurement Facility Performance Management Guide

Chapter 3. Analyzing processor activity 61

Chapter 1. Performance overview . . . . 1

Chapter 4. Analyzing I/O activity . . . . 81

Chapter 2. Diagnosing a problem: The first steps . . . . . . . . . . . . . 23

. 85 . 85 . 85 . 89 . 89 . 89 . 90 . 92 . 102 . 104 . 104 . 105

Defining logical processors .

Appendix B. The Intelligent Resource Director . . . . . . . . . . . . . . 185

Chapter 5. Analyzing processor storage activity . . . . . . . . . . 131

Appendix C. Planning considerations for HiperDispatch mode . . . . . . . 197

Chapter 6. Analyzing sysplex activity

Appendix D. Other delays . . . . . . 201

Glossary . . . . . . . . . . . . . 209 Notices . . . . . . . . . . . . . . 213

Appendix A. PR/SM LPAR considerations . . . . . . . . . . . 179

z/OS V1R11.0 RMF Performance Management Guide

Copyright IBM Corp. 1993, 2009

. 155 . 158 . 163 . 169 . . . . 171 172 173 174

z/OS V1R11.0 RMF Performance Management Guide

About this document

Who should use this document

When to use this document

How this document is organized

Copyright IBM Corp. 1993, 2009

The z/OS RMF library

Using LookAt to look up message explanations

z/OS V1R11.0 RMF Performance Management Guide

About this document

z/OS V1R11.0 RMF Performance Management Guide

Whats new in z/OS V1.11 New real storage measurements

Measuring WLMs promotion for workloads holding locks

Enhanced group capacity reporting

History of changes Whats new in z/OS V1R10

Support of HiperDispatch mode

64-Bit Common Storage measurements

New reports on spin locks and suspend locks

Support of WLM enqueue management

Providing DASD volume capacity

Whats new in z/OS V1R9

Measurements about WLMs blocked workload promotion

Enhanced monitoring of FICON performance

z/OS V1R11.0 RMF Performance Management Guide

New Coupling Facility measurements

Whats new in z/OS V1R8

Support of IBM System z9 Integrated Information Processors (zIIP)

Whats new in z/OS V1R7

Support for IBM System z9 Enterprise Class processors

Queue length distribution in the CPU Activity report

Whats new in z/OS V1R6

Whats new in SC33-7992-04

Whats new in SC33-7992-03

z/OS V1R11.0 RMF Performance Management Guide

Whats new in z/OS V1R4

State samples breakdown in the WLMGL report

Whats new in z/OS V1R2

Reporting of report class periods

Enhanced reporting for the coupling facilities

Whats new in z/OS V1R1

Intelligent Resource Director (IRD)