UPDATE 2009.02.17: I received a response from IBM which I have tested and appears to resolve the problem (described below). It involves DB2 registry variable DB2ENVLIST, and an environment variable DB2_HMON_DFT_STACK_SZ which appears undocumented. Please scroll to bottom of article for the update.
Description of the problem: I have several DB2 version 9 Fixpack 4 instances installed on Solaris 10, and have several unresolved and serious Health Monitor-related problems. Many DB2 customers appear to have had similar problems since Health Monitor was introduced. My recommendation as of version 9 Fixpack 4 is that Health Monitor should be left inactivated, and IBM needs to focus much more effort on understanding why HMON is as buggy as it is.
My own experiences with Health Monitor occurred January – March 2008, resolved like so many customers by deactivating Health Monitor. Many reports of possibly related issues go back to version 8.1 and 2004. One thing I find suspect is that in many of the Health Monitor APARs (listed below). IBM shows the APAR as resolved, simply by advising customers to deactivate Health Monitor. They didn’t fix the problem at all. Yet general IBM documentation advises the use of the Health Monitor. I believe IBM docs should mention some of the risks involved in activating Health Monitor.
I have provided full system configuration, db2support and db2trc output to IBM, and will be happy to provide those details and PMR numbers to any IBM engineers who are interested in tracking the progress of resolving these issues.
Note About HMON Instance And Database Capture
DB2 documentation in most places states that deactivating Health Monitor only entails setting DBM CFG parameter HEALTH_MON OFF. However that only deactivates database-level health capture associated with the db2hmon process. Health monitoring involves more than just db2hmon. Health capture will continue to restart every two hours via the process db2acd, which captures instance level health information and performs tasks related to automated maintenance. The only way to deactivate db2acd at present is to eliminate fenced memory through this procedure. This is not generally desirable because it interferes with use of the db2expln utility and functions which require fenced memory.
Not all the problems I detail here may be related to db2hmon, they may be related to db2acd or other functionality.
To eliminate one factor: we received early advice from IBM to disable Self-Tuning Memory Manager (STMM), and we did so. This did not resolve the Health Monitor issues.
Problem Summary
I have two instances on two separate Solaris Zones and separate Zone Clusters, both experiencing different Health Monitor issues. They are:
Hang-Wait, Appearance of Heavy Lock Contention
When Health Monitor is active (DBM CFG parameter HEALTH_MON ON) every two hours when Health Monitor would activate, we would see all database processes grind to a halt, which caused an application processing pileup and eventual application crash. I saw this for an application which experiences an average transaction speed of 3500 UOW (commits) per minute with occasional spikes of 38,000 transactions per minute. The SQL execution delays resembled table lock contention, but there was not significant lockwait seen at these times. It seems clearly to be a hang-wait. The application was designed to deal with processing delays by spawning more database connections, which it did, worsening the problem, by eventually exceeding MAXAPPLS or other resource limits.
My guess is that this site, among all IBM customers, experienced the Health Monitor issue so starkly because the high transaction volume caused immediate and dramatic crashes for the application. Another shop with low transaction volume might not notice the possible bi-hourly hang-waits caused by Health Monitor.
I have had an unresolved PMR about this problem in IBM’s hands for over six months now. Reading other APAR’s related to Health Monitor I can make some SWAG’s about what might be happening. APAR IY40216 [http://www-01.ibm.com/support/docview.wss?uid=swg1IY40216] notes a possible problem with db2hmon’s termination routine, which might incorrectly fail to notify the DB2 engine system controller that it has completed it’s work. It is possible that db2hmon acquires a latch without releasing it properly, or fails to issue a termination semaphore.
Another symptom of these process delays was that every two hours, when Health Monitor ran and caused the application pile-up, DB2 would gain control of about 0.5 GB of memory and swap. So every four hours we would lose about a gigabyte of free memory. Eventually performance would be degraded by running excessively on swap, and, left untended, the instance would crash (actually during this period we did regular restarts of the instance, which freed the memory).
The bihourly process halts were solved instantly by deactivating Health Monitor.
Health Monitor Crashes
On another DB2 database instance with much lower transaction activity, I do not see the above behavior. Instead, db2hmon just crashes when HEALTH_MON ON, with the following errors:
1. 2008-03-13-00.18.44.059066-300 E876A408 LEVEL: Warning 2. PID : 10499 TID : 1 PROC : db2sysc 0 3. INSTANCE: db2 NODE : 000 4. FUNCTION: DB2 UDB, routine_infrastructure, sqlerReturnFmpToPool, probe:999 5. DATA #1 : String, 22 bytes 6. Removing FMP from pool 7. DATA #2 : Hexdump, 16 bytes 8. 0xFFFFFFFF7FFFE160 : 0000 0000 0000 0000 0000 084F 0000 0000 ...........O.... 9. 10. 2008-03-13-00.18.44.076838-300 E1285A456 LEVEL: Error 11. PID : 10499 TID : 1 PROC : db2sysc 0 12. INSTANCE: db2 NODE : 000 13. FUNCTION: DB2 UDB, routine_infrastructure, sqlerRemoveAllIPCforRow, probe:10 14. DATA #1 : String, 32 bytes 15. Freeing IPC resource explicitly: 16. DATA #2 : SQLO_PID, PD_TYPE_SQLO_PID, 4 bytes 17. 2127 18. DATA #3 : Hexdump, 4 bytes 19. 0x0000000200447FE0 : 0000 0000 .... 20. 21. 2008-03-13-00.18.44.077523-300 E1742A345 LEVEL: Error 22. PID : 10499 TID : 1 PROC : db2sysc 0 23. INSTANCE: db2 NODE : 000 24. FUNCTION: DB2 UDB, routine_infrastructure, sqlerRemoveAllIPCforRow, probe:20 25. DATA #1 : String, 22 bytes 26. IPC resources Address: 27. DATA #2 : Pointer, 8 bytes 28. 0x0000000210010080 29. 30. 2008-03-13-00.18.44.078221-300 E2088A1046 LEVEL: Error 31. PID : 10499 TID : 1 PROC : db2sysc 0 32. INSTANCE: db2 NODE : 000 33. FUNCTION: DB2 UDB, routine_infrastructure, sqlerRemoveAllIPCforRow, probe:30 34. DATA #1 : String, 29 bytes 35. Number of IPC resource found: 36. DATA #2 : signed integer, 4 bytes 37. 1 38. DATA #3 : String, 29 bytes 39. Number of IPC resource freed: 40. DATA #4 : signed integer, 4 bytes 41. 1 42. CALLSTCK: 43. [0] 0xFFFFFFFF7C88B804 __1cXsqlerRemoveAllIPCforRow6FpnLsqlerFmpRow_b_i_ + 0x89C 44. [1] 0xFFFFFFFF7C881A68 __1cXsqlerRemoveFmpFromTable6FpnLsqlerFmpRow_b_i_ + 0x220 45. [2] 0xFFFFFFFF7C884A6C __1cUsqlerReturnFmpToPool6FccpnOsqlerFmpHandle_pnN sqle_agent_cb__i_ + 0x15DC 46. [3] 0x000000010000AD9C __1cOsqleRunSysCtlr6F_i_ + 0x3D4 47. [4] 0x0000000100006A64 __1cLsqleSysCtlr6F_i_ + 0x384 48. [5] 0xFFFFFFFF7AF75CDC __1cHDBGTerm6F_i_ + 0x1F94 49. [6] 0xFFFFFFFF7AF76984 sqloRunInstance + 0x5E4 50. [7] 0x0000000100006524 main + 0x924 51. [8] 0x0000000100004E5C _start + 0x17C 52. [9] 0x0000000000000000 ?unknown + 0x0 53. 54. 2008-03-13-00.18.44.091581-300 I3135A433 LEVEL: Severe 55. PID : 10499 TID : 1 PROC : db2sysc 0 56. INSTANCE: db2 NODE : 000 57. FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:15 58. DATA #1 : Hexdump, 31 bytes 59. 0x000000010000DB20 : 4865 616C 7468 204D 6F6E 6974 6F72 2050 Health Monitor P 60. 0x000000010000DB30 : 726F 6365 7373 2063 7261 7368 6564 2E rocess crashed. 61. 62. 2008-03-13-00.18.44.092207-300 I3569A340 LEVEL: Severe 63. PID : 10499 TID : 1 PROC : db2sysc 0 64. INSTANCE: db2 NODE : 000 65. FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:16 66. DATA #1 : Hexdump, 4 bytes 67. 0xFFFFFFFF7FFFE170 : 0000 084F ...O 68. 69. 2008-03-13-00.18.44.092769-300 I3910A340 LEVEL: Severe 70. PID : 10499 TID : 1 PROC : db2sysc 0 71. INSTANCE: db2 NODE : 000 72. FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:17 73. DATA #1 : Hexdump, 4 bytes 74. 0xFFFFFFFF7FFFE174 : 0000 0101 .... 75. 76. 2008-03-13-00.18.44.093349-300 I4251A340 LEVEL: Severe 77. PID : 10499 TID : 1 PROC : db2sysc 0 78. INSTANCE: db2 NODE : 000 79. FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:18 80. DATA #1 : Hexdump, 4 bytes 81. 0xFFFFFFFF7FFFE178 : FFFF FFFF .... 82. 83. 2008-03-13-00.18.44.110896-300 I4592A282 LEVEL: Warning 84. PID : 10499 TID : 1 PROC : db2sysc 0 85. INSTANCE: db2 NODE : 000 86. FUNCTION: DB2 UDB, base sys utilities, sqleRunSysCtlr, probe:63 87. MESSAGE : Health Monitor Process restarted.
We got this error every time we started the instance, every 15 minutes for a total of two hours after starting DB2. At that point DB2 appears to stop attempting to restart the Health Monitor.
I am posting these issues as a gift to the open community; to invite comments, not necessarily solicit advice or solutions. IBM with full db2support and db2trc output has not been able to resolve these issues in over six months. I have two open PMR’s and if anyone from IBM laboratories would like more information, please feel free to contact me at consulting AT ebenner DOT com.
Update February 17, 2009 from IBM
Set the registry variable for the instance once.
db2set DB2ENVLIST=DB2_HMON_DFT_STACK_SZ
And set the environment variable before db2start every time.
export DB2_HMON_DFT_STACK_SZ=655360 db2start
This does indeed appear to end the HMON crashes, in a test server where we were able to replicate the condition. It is unlikely that we will be allowed to implement the suggested change in Production, since we are not using Health Monitor functionality, and because the risk to the Production system is still an unknown – even if Health Monitor no longer crashes, HMON may impose performance penalties on the high transaction volume application.
