29 May 2010
27 May 2010
A couple of times a year I’ve encountered a strange problem in our ColdFusion servers where sessions mount up and aren’t removed after they should have expired. Just today we had 100’s of sessions left in memory with all of the session scope variables still there at the end of the day, hours after they should have been deleted. Automated session housekeeping ceased to be.
Two other symptoms drew my attention to the above problem which must surely be related. Emails stopped being sent, the spool directory filled up without any cfmail files leaving. And from the website CF sporadically threw the error “The session is invalid” which was temporarily resolved by closing the browser and logging into the app again.
Restarting the CF services doesn’t resolve the situation because the service refuses to stop if asked politely. A full reboot is the only way to restore normality with confidence.
You’re probably thinking what good is it complaining now because 8.01 isn’t the current release and we should upgrade to 9.x. Well how do we know that 9.x has fixed this problem? Are Adobe aware of the issue? We could run a 30 day trial of 9.x on a test server but we’d have to run it for at least 6 months with a constant load to mimic our production server keeping in mind the rare appearance of the bug.
If you have encountered this problem before or know how to fix it then please let me know.
25 May 2010
Good news! We’ve been issued by Microsoft with a public release of the hotfix KB982210, as it will be known. The fix will only work on 2008 R2 and not with any previous releases of the OS.
So how does the fix work? First let me explain the problem more clearly than in previous blog entries. Whenever a device is attached to Windows the Plug & Play Manager creates an entry for that device in the registry. If it’s a USB device, for example, and you unplug it then its entry will remain in the registry so when it’s plugged back in the computer will recognise it and any settings that have previously been set up for it. The same is true for snapshots created by VSS, the Volume Snapshot Service. A snapshot is treated like a device so one a new snapshot is created so too is a registry entry for the device.
Now here’s the problem. The registry entries are not removed – ever. While many users will never have a problem with that there are a number of power users who generate 1000’s of snapshots over a short period of time. For example, in our case where we use Windows Server Backup (WSB) with a backup schedule set to every 30 minutes which includes backing up 14 VHDs used by several VMs on Hyper-V. VSS will create 14 snapshots (one for each VHD) every time a backup is run. That’s 14 snapshots every 30 minutes. That’s 672 a day and over 20,000 per month. See how quickly they mount up and none of the device entries in the registry are being removed.
Severe problems manifest when the host server is rebooted and the registry is processed, analysing tens of thousands of devices, causing the server to look as if it has hung. It freezes for 2 or 3 hours, possibly more if I had let the server carry on taking more backups.
The registry key you need to check in Windows 2008 R2 is:
There should be 10 to 50 entries in there for a normal healthy machine, depending on how many devices you have attached. On our server we had 28,000 entries!
Now, how does the fix work?
The new hotfix from Microsoft makes a change to the Plug & Play Manager that adds a timestamp (or a tombstone date as they prefer to call it) for each new snapshot device that’s created. This means Windows is now aware of exactly when a snapshot was created and can make a decision to fully remove the device’s entry from the registry after a certain period of time. I’m not sure how long it waits but from our experience it can be counted in minutes rather than days or weeks.
So well done to Microsoft for creating a smart solution to a critical problem. It took a massive amount of effort to get our case brought to the attention of the right person. Before that I had spent weeks working with a Microsoft support team in India by phone and remote desktop trying to explain what the problem was (and no, the problem will not be resolved by re-installing Windows, thank you!). It wasn’t until a Premiere Support case was opened and a Microsoft account manager from the UK got involved who contacted the right technical person that we started to make rapid progress. Microsoft had been aware of the technical issue for a while but our case seemed to have given them the incentive to fully investigate it. And we are grateful that we were finally listened to and some very expensive new servers can finally be put into service.
Cleaning up the registry
The only problem that remains is for other people experiencing this. You need to clean up your registry before applying the hotfix otherwise the freezing symptom will persist. To do this you need a tool from Microsoft called devnodeclean. Sadly this is not available anywhere on the Internet to download based on my Google and Bing searches. Microsoft should be able to email you a copy if you open a support case with them and refer them to KB982210 and this blog entry for good measure. Run devnodeclean without any switches at first to see what it makes of your registry, then use the /r switch to force it to remove the unwanted devices from the registry.
Next you may need to compact your registry if it has become huge, anything over 50MB I would say. Ours peaked at over 450MB. The “system” hive (found in C:\Windows\System32\config\) can be compacted using regchk using the switches /l /c /r /v. Again, chkreg is only available by request from Microsoft and was in fact developed to repair the registry in Windows 2000 but amazingly still works for 2008 R2. Please note that regchk cannot compact a live “system” file. You need to back up your system settings first and run chkreg on a restored copy of the “system” file (restore it to a new folder somewhere else). Then boot up from the Windows setup DVD and enter the recovery console. Rename “system” to “system.old” and copy the compacted system file into the config directory. Then reboot into Windows.