UPDATE 2009.02.17: I received a response from IBM which I have tested and appears to resolve the problem (described below). It involves DB2 registry variable DB2ENVLIST, and an environment variable DB2_HMON_DFT_STACK_SZ which appears undocumented. Please scroll to bottom of article for the update.

Description of the problem: I have several DB2 version 9 Fixpack 4 instances installed on Solaris 10, and have several unresolved and serious Health Monitor-related problems. Many DB2 customers appear to have had similar problems since Health Monitor was introduced. My recommendation as of version 9 Fixpack 4 is that Health Monitor should be left inactivated, and IBM needs to focus much more effort on understanding why HMON is as buggy as it is.

My own experiences with Health Monitor occurred January – March 2008, resolved like so many customers by deactivating Health Monitor. Many reports of possibly related issues go back to version 8.1 and 2004. One thing I find suspect is that in many of the Health Monitor APARs (listed below). IBM shows the APAR as resolved, simply by advising customers to deactivate Health Monitor. They didn’t fix the problem at all. Yet general IBM documentation advises the use of the Health Monitor. I believe IBM docs should mention some of the risks involved in activating Health Monitor.

I have provided full system configuration, db2support and db2trc output to IBM, and will be happy to provide those details and PMR numbers to any IBM engineers who are interested in tracking the progress of resolving these issues.

Note About HMON Instance And Database Capture

DB2 documentation in most places states that deactivating Health Monitor only entails setting DBM CFG parameter HEALTH_MON OFF. However that only deactivates database-level health capture associated with the db2hmon process. Health monitoring involves more than just db2hmon. Health capture will continue to restart every two hours via the process db2acd, which captures instance level health information and performs tasks related to automated maintenance. The only way to deactivate db2acd at present is to eliminate fenced memory through this procedure. This is not generally desirable because it interferes with use of the db2expln utility and functions which require fenced memory.

Not all the problems I detail here may be related to db2hmon, they may be related to db2acd or other functionality.

To eliminate one factor: we received early advice from IBM to disable Self-Tuning Memory Manager (STMM), and we did so. This did not resolve the Health Monitor issues.

Problem Summary

I have two instances on two separate Solaris Zones and separate Zone Clusters, both experiencing different Health Monitor issues. They are:

Hang-Wait, Appearance of Heavy Lock Contention

When Health Monitor is active (DBM CFG parameter HEALTH_MON ON) every two hours when Health Monitor would activate, we would see all database processes grind to a halt, which caused an application processing pileup and eventual application crash. I saw this for an application which experiences an average transaction speed of 3500 UOW (commits) per minute with occasional spikes of 38,000 transactions per minute. The SQL execution delays resembled table lock contention, but there was not significant lockwait seen at these times. It seems clearly to be a hang-wait. The application was designed to deal with processing delays by spawning more database connections, which it did, worsening the problem, by eventually exceeding MAXAPPLS or other resource limits.

My guess is that this site, among all IBM customers, experienced the Health Monitor issue so starkly because the high transaction volume caused immediate and dramatic crashes for the application. Another shop with low transaction volume might not notice the possible bi-hourly hang-waits caused by Health Monitor.

I have had an unresolved PMR about this problem in IBM’s hands for over six months now. Reading other APAR’s related to Health Monitor I can make some SWAG’s about what might be happening. APAR IY40216 [http://www-01.ibm.com/support/docview.wss?uid=swg1IY40216] notes a possible problem with db2hmon’s termination routine, which might incorrectly fail to notify the DB2 engine system controller that it has completed it’s work. It is possible that db2hmon acquires a latch without releasing it properly, or fails to issue a termination semaphore.

Another symptom of these process delays was that every two hours, when Health Monitor ran and caused the application pile-up, DB2 would gain control of about 0.5 GB of memory and swap. So every four hours we would lose about a gigabyte of free memory. Eventually performance would be degraded by running excessively on swap, and, left untended, the instance would crash (actually during this period we did regular restarts of the instance, which freed the memory).

The bihourly process halts were solved instantly by deactivating Health Monitor.

Health Monitor Crashes

On another DB2 database instance with much lower transaction activity, I do not see the above behavior. Instead, db2hmon just crashes when HEALTH_MON ON, with the following errors:

1.	2008-03-13-00.18.44.059066-300 E876A408           LEVEL: Warning
2.	PID     : 10499                TID  : 1           PROC : db2sysc 0
3.	INSTANCE: db2                  NODE : 000
4.	FUNCTION: DB2 UDB, routine_infrastructure, sqlerReturnFmpToPool, probe:999
5.	DATA #1 : String, 22 bytes
6.	Removing FMP from pool
7.	DATA #2 : Hexdump, 16 bytes
8.	0xFFFFFFFF7FFFE160 : 0000 0000 0000 0000 0000 084F 0000 0000    ...........O....
9.
10.	2008-03-13-00.18.44.076838-300 E1285A456          LEVEL: Error
11.	PID     : 10499                TID  : 1           PROC : db2sysc 0
12.	INSTANCE: db2                  NODE : 000
13.	FUNCTION: DB2 UDB, routine_infrastructure, sqlerRemoveAllIPCforRow, probe:10
14.	DATA #1 : String, 32 bytes
15.	Freeing IPC resource explicitly:
16.	DATA #2 : SQLO_PID, PD_TYPE_SQLO_PID, 4 bytes
17.	2127
18.	DATA #3 : Hexdump, 4 bytes
19.	0x0000000200447FE0 : 0000 0000                                  ....
20.
21.	2008-03-13-00.18.44.077523-300 E1742A345          LEVEL: Error
22.	PID     : 10499                TID  : 1           PROC : db2sysc 0
23.	INSTANCE: db2                  NODE : 000
24.	FUNCTION: DB2 UDB, routine_infrastructure, sqlerRemoveAllIPCforRow, probe:20
25.	DATA #1 : String, 22 bytes
26.	IPC resources Address:
27.	DATA #2 : Pointer, 8 bytes
28.	0x0000000210010080
29.
30.	2008-03-13-00.18.44.078221-300 E2088A1046         LEVEL: Error
31.	PID     : 10499                TID  : 1           PROC : db2sysc 0
32.	INSTANCE: db2                  NODE : 000
33.	FUNCTION: DB2 UDB, routine_infrastructure, sqlerRemoveAllIPCforRow, probe:30
34.	DATA #1 : String, 29 bytes
35.	Number of IPC resource found:
36.	DATA #2 : signed integer, 4 bytes
37.	1
38.	DATA #3 : String, 29 bytes
39.	Number of IPC resource freed:
40.	DATA #4 : signed integer, 4 bytes
41.	1
42.	CALLSTCK:
43.	  [0] 0xFFFFFFFF7C88B804 __1cXsqlerRemoveAllIPCforRow6FpnLsqlerFmpRow_b_i_ + 0x89C
44.	  [1] 0xFFFFFFFF7C881A68 __1cXsqlerRemoveFmpFromTable6FpnLsqlerFmpRow_b_i_ + 0x220
45.	  [2] 0xFFFFFFFF7C884A6C __1cUsqlerReturnFmpToPool6FccpnOsqlerFmpHandle_pnN  sqle_agent_cb__i_ + 0x15DC
46.	  [3] 0x000000010000AD9C __1cOsqleRunSysCtlr6F_i_ + 0x3D4
47.	  [4] 0x0000000100006A64 __1cLsqleSysCtlr6F_i_ + 0x384
48.	  [5] 0xFFFFFFFF7AF75CDC __1cHDBGTerm6F_i_ + 0x1F94
49.	  [6] 0xFFFFFFFF7AF76984 sqloRunInstance + 0x5E4
50.	  [7] 0x0000000100006524 main + 0x924
51.	  [8] 0x0000000100004E5C _start + 0x17C
52.	  [9] 0x0000000000000000 ?unknown + 0x0
53.
54.	2008-03-13-00.18.44.091581-300 I3135A433          LEVEL: Severe
55.	PID     : 10499                TID  : 1           PROC : db2sysc 0
56.	INSTANCE: db2                  NODE : 000
57.	FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:15
58.	DATA #1 : Hexdump, 31 bytes
59.	0x000000010000DB20 : 4865 616C 7468 204D 6F6E 6974 6F72 2050    Health Monitor P
60.	0x000000010000DB30 : 726F 6365 7373 2063 7261 7368 6564 2E      rocess crashed.
61.
62.	2008-03-13-00.18.44.092207-300 I3569A340          LEVEL: Severe
63.	PID     : 10499                TID  : 1           PROC : db2sysc 0
64.	INSTANCE: db2                  NODE : 000
65.	FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:16
66.	DATA #1 : Hexdump, 4 bytes
67.	0xFFFFFFFF7FFFE170 : 0000 084F                                  ...O
68.
69.	2008-03-13-00.18.44.092769-300 I3910A340          LEVEL: Severe
70.	PID     : 10499                TID  : 1           PROC : db2sysc 0
71.	INSTANCE: db2                  NODE : 000
72.	FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:17
73.	DATA #1 : Hexdump, 4 bytes
74.	0xFFFFFFFF7FFFE174 : 0000 0101                                  ....
75.
76.	2008-03-13-00.18.44.093349-300 I4251A340          LEVEL: Severe
77.	PID     : 10499                TID  : 1           PROC : db2sysc 0
78.	INSTANCE: db2                  NODE : 000
79.	FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:18
80.	DATA #1 : Hexdump, 4 bytes
81.	0xFFFFFFFF7FFFE178 : FFFF FFFF                                  ....
82.
83.	2008-03-13-00.18.44.110896-300 I4592A282          LEVEL: Warning
84.	PID     : 10499                TID  : 1           PROC : db2sysc 0
85.	INSTANCE: db2                  NODE : 000
86.	FUNCTION: DB2 UDB, base sys utilities, sqleRunSysCtlr, probe:63
87.	MESSAGE : Health Monitor Process restarted.

We got this error every time we started the instance, every 15 minutes for a total of two hours after starting DB2. At that point DB2 appears to stop attempting to restart the Health Monitor.

I am posting these issues as a gift to the open community; to invite comments, not necessarily solicit advice or solutions. IBM with full db2support and db2trc output has not been able to resolve these issues in over six months. I have two open PMR’s and if anyone from IBM laboratories would like more information, please feel free to contact me at consulting AT ebenner DOT com.

Update February 17, 2009 from IBM

Set the registry variable for the instance once.

db2set  DB2ENVLIST=DB2_HMON_DFT_STACK_SZ

And set the environment variable before db2start every time.

export DB2_HMON_DFT_STACK_SZ=655360
db2start

This does indeed appear to end the HMON crashes, in a test server where we were able to replicate the condition. It is unlikely that we will be allowed to implement the suggested change in Production, since we are not using Health Monitor functionality, and because the risk to the Production system is still an unknown – even if Health Monitor no longer crashes, HMON may impose performance penalties on the high transaction volume application.

IBM Health Monitor-related APARs

Other web reports about similar Health Monitor Issues

DB2 LUW DBA HowTo

UPDATE: DB2 Health Monitor Problems

Note About HMON Instance And Database Capture

Problem Summary

Hang-Wait, Appearance of Heavy Lock Contention

Health Monitor Crashes

Update February 17, 2009 from IBM

IBM Health Monitor-related APARs

Other web reports about similar Health Monitor Issues

RealVNC : A Real Nice Remote Windows Viewer

Pages

DB2 Administration

DB2 Community

DB2 Monitoring

IBM documentation

IT Contract Work

Solaris

Categories

Recent Posts

RSS feed for DB2 DBA HowTo

Archives

Recent Comments