For the DB2 for z/OS professional the two most common systems tuning scenarios are tuning a DB2 data sharing group or tuning a series of application SQL statements. The data sharing group environment can involve multiple hardware installations and many other cross-system features and functions such as coupling facilities and management policies. Resource constraint analysis is a useful tool in both situations.
Resource constraint analysis involves monitoring, analyzing, and managing resources. It can be done at many levels, from resource management across the IT enterprise down to that of a single DB2 subsystem or instance or even to a specific SQL statement. It can be done on many levels simultaneously.
This method of resource management can be done independent of hardware platform or DB2 flavor (z/OS or Linux / Unix / Windows).
This article will concentrate on tuning a single DB2 for z/OS subsystem and its applications. This is the ultimate goal; however, to get there we must both understand overall system resource usage and address any current resource constraints. Only then can we take a system-wide view in an efficient manner.
There are many ways to get an introduction to the basics of systems tuning including IBM and other vendor training, conference presentations, IBM manuals, RedBooks, and so forth. Along with this knowledge, you will need some common tools and techniques. Here are the initial tasks you must perform.
Step 1: Select Resources and Monitoring Tools
Begin by reviewing your monitoring and performance reporting tools. Most tools have ways of measuring resources. Some typical resources and measurements you want to consider are the following.
CPU cycles. This includes CPU usage by the DB2 address spaces and CPU usage by thread and connection type. On an application level, these times are displayed in the Accounting Short Report of the IBM DB2 Performance Expert for z/OS (TM) as "Class 2 CPU Time".
One common metric is CPU Time per DML Statement. This can be calculated by dividing the Class 2 CPU Time by the number of DML statements issued by the application. Further refinements may be made by dividing the CPU time by number of inquiry statements (Select, Fetch), or assigning weights to DML statement types.
Central storage (memory). This includes memory usage, especially in the DBM1 address space.
Disk storage (I/O). This includes time for I/Os, device-specific reporting, random miss ratio, system miss ratio, time per sync I/O, and time per GetPage. One common metric is I/O Wait Time per DML Statement. Much of the I/O wait time appears in one or more entries on the Accounting Long Report of the IBM DB2 Performance Expert for z/OS (TM) under Class 3 Suspensions, with the most relevant being Synchronous I/O Wait. Another common metric is Asynchronous Prefetches per DML Statement, calculated from the Total Prefetch count on the Accounting Short Report.
Object access (locking, thread wait, enqueues). This includes lock/latch waits and in-DB2 and not-in-DB2 wait times. Many of these wait times are reported on the Accounting Long Report under Class 3 Suspensions.
Application throughput. This includes SQL statements per unit of time and application DB2 Class 2 elapsed time. Some of these measurements can be derived from the Accounting Reports, while others are usually developed via user-written algorithms that analyze job output messages or application-created performance measures.
The IBM Resource Measurement Facility(TM) is an excellent tool for measuring overall DB2 subsystem CPU, memory, and DASD resources, especially when used with the RMF Spreadsheet Reporter. See the IBM web site reference at the end of this article for more information.
Step 2: Choose Key Areas of Inquiry
Next, consider the various areas where you will concentrate your initial attention. Base these choices on issues that are either urgent or important.
Urgent issues are those you must react to immediately or in the near term; for example, your nightly batch cycle is running too long and interfering with morning on-line transactions.
Important issues are more long-range and have the potential to negatively affect budgets, service levels, or customer satisfaction. For example, you may be concerned that quickly rising CPU usage will force your department to add processor capacity sooner than anticipated.
Step 3: Choose Levels for Initial Measurements
Systems tuning seems to imply an emphasis on only system-level objects like the DBM1 address space. I recommend that your first measurements and tuning concentrate on critical objects at multiple levels. This will give you experience in creating metrics, implementing period monitoring, and analyzing results. It also has the advantage of allowing you to find "resource hogs" that can be addressed to get immediate improvements.
Critical Tables and Indexes. Based on your own criteria determine your most critical table and index objects. These may be your largest, the most volatile, the most heavily-referenced, or those accessed by one or more business-critical applications. They may also be the ones that may benefit most from the table-based partitioning introduced in DB2 for z/OS V8, where you can partition based on one key and cluster data based upon another.
With the critical objects identified, use the DB2 catalog (or other means) to determine what application SQL statements access these objects. Concentrate some specific monitoring on these statements.
Critical Applications. On another level, determine your most critical applications and set up performance monitoring for them. These may be business-critical, such as order entry or on-line customer inquiry applications, or mission-critical such as an enterprise resource planning (ERP) suite.
Critical Subsystem Components. Your choice of subsystem components to measure and tune will depend in large part upon your environment. As companies and IT enterprises vary greatly across industries there is no one set of components that can be deemed most important. I recommend reviewing the DB2 performance Redbooks to determine where to concentrate your attention.
Typical examples of subsystem components to monitor are the virtual pool sizes and thresholds (including the global buffer pools in a data sharing environment), DBM1 address space memory usage (if you have virtual storage constraint issues), DBM1 address space I/O performance, and CPU usage attributed to enclaves initiated by the distributed data facility (DDF).
Critical Environments. The next level up involves environments. In the z/OS environment, this usually means DB2 subsystems or data sharing groups. Begin with your most critical environment (again based on your own criteria). It may be an on-line production system, a data warehouse, or even your nightly batch cycle.
Each type of environment has its own resource performance profile. For example, a data warehouse will usually involve a heavy I/O load due to queries during its interrogation period (the on-line day), but heavy I/O load due to extract-transform-load processes during its update period. For another example, consider a long-running batch cycle. In this case, the most common resource constraint is time. Hence, performance management usually involves trading CPU and I/O resources in order to shorten job elapsed times.
Step 4: Implement Period Monitoring and Reporting
After you have familiarized yourself with your resource measurement toolset and chosen areas to concentrate your efforts you then construct and implement regular resource period monitoring. This is monitoring and gathering data about specific categories of objects over time. The purpose is to allow you to analyze objects in the context of their time dependence (if any) to resource constraints.
Perform Resource Constraint Analysis
Now, based on your measurements, follow this procedure.
Use the period monitoring reports to identify resource constraints. For example, you determine that CPU usage during the on-line day is approaching 100% in a critical production region.
Use the reports to identify resources that are available or underutilized. At the same time as CPU usage is high, you determine that GetPage rates are relatively low.
Analyze resource usage looking for balancing opportunities. You will be looking to reduce or eliminate a resource constraint by re-directing the application to use a different resource. For example, you determine that you have a CPU constraint during your nightly batch window. Upon further analysis, you note that many applications access a large partitioned tablespace using CPU parallelism. You can reduce the CPU constraint by inhibiting parallelism for those applications; however, the applications may run longer, thus reducing application throughput.
As another example, you measure I/O service times for a critical table and find them to be excessive. You can reduce this I/O bottleneck by allocating a virtual pool exclusively for this object. The trade-off is faster I/O (as successive references to the table will most likely find it in memory) but greater use of central storage.
So, by identifying available resources (application throughput and central storage in the above examples), you can develop strategies for reducing resource bottlenecks. Some possibilities might be:
* Analyze critical tables and indexes; consider them to be constraints and develop ways of using excess CPU, DASD, or other resources to reduce contention
* Analyze critical applications; consider them to be constraints and develop methods of using excess resources to increase application throughput
* Analyze the DB2 subsystem; determine the resource bottlenecks and develop techniques for using excess resources to relieve the constraints