+1.916.577.1977 | Downloads | Buy | Register | Login
 Search  
Tuesday, July 08, 2008
Search Blogs
 

Available Blogs
 

Previous Blogs
 

Technorati
 
More blogs about coversant.

About Coversant
 

Production Debugging with WinDbg, Son Of Strike, and ADPlus
 
Location: BlogsMullin' with Mullins    
Posted by: Chris Mullins 11/26/2006

Tracking down a production bug

Production Applications have bugs. It doesn't seem to matter how well tested, or under what conditions that testing happens under, the bugs still persist. These bugs typically are hard or impossible to reproduce in our test lab, so having cooperation from the customer in tracking it down is invaluable.

A client running the SoapBox Server 2005 SR2 recently reported major memory issues under load conditions that should not have caused any troubles at all. After scratching our heads for a bit, the need to get to the bottom of the problem was serious enough to take a break out of daily routine and find the issue. Fortunately the customer has been exceptionally helpful, and was able to collect dump files for us.

Dump files, you ask? Well. This blog highlights the methodology taken to get to the bottom of a critical problem in a production .Net application. This mostly means obtaining a dump file, and using WinDbg to track down the issues.

For those of you used to developing in Visual Studio, the WinDbg debugging experience will come as something of a shock. Frankly, it stinks. It's like going back 20+ years and using GDB or some other command line driven debugger. Unfortunately WinDbg is the only option for debugging MiniDump files that contain Managed Code.

The Problem

With relatively few users logged in, memory on the server was far higher than it should have been. As more users log in, memory continues to grow until an OutOfMemoryException happens, and the process restarts.

Environment

Windows 2003 Server (x86), .Net 1.1 SP1, SoapBox Server 2005 SR2, Dual Core Processor.

Obtaining a Dump File

The first step is work with the customer so that we can get a MiniDump file. This file is a snapshot of the memory used by the process, the running threads and their call stacks, and just about everything else you can think of. With this we can poke and prod to determine the application state when the dump was generated.

The easiest way to generate MiniDumps is to use with ADPlus. To get this setup:

  1. On the server, download and install the Debugging Tools For Window for your platform.
  2. Copy the DNExitProcess.xml file that I shameless stole from John Robbin's Bugslayer column to your local machine.

With the process up and running (but not yet doing anything), setup AD Plus to capture a dump when the process exits:

  • C:\Program Files\Debugging Tools for Windows"\adplus -c c:\Temp\CrashDump\DNExitProcess.xml -o C:\Temp\CrashDump\ -pn SoapBoxServer.exe

At this point, when the SoapBoxServer.exe process terminates, ADPlus will create a MiniDump in the C:\Temp\CrashDump directory.

The name MiniDump is deceiving. These files will be huge. 2+ GB per file, and there will typically be two of them. Fortunately, they compress very well.

Now That your customer has generated dumps, compressed them, and gotten them to you via a FTP or WebDAV folder, the real fun begins.

Setting up for Debugging

On your local workstation:

  1. Download and install the x86 version of the Debugging tools for Windows.
  2. Track down the actual binary and symbol files (.pdb files) that match the version of software your customer is running. If you didn't save these when you did your build, you're in real trouble. I bet you won't make THAT mistake again! :)
  3. Put the binary and symbol files into a local directory like: C:\Temp\SoapBoxBuildSymbols\
  4. Copy the dump files into a local directory like: C:\Temp\Debugging\
  5. Create an empty symbol directory C:\temp\symbols\, that will contain symbols downloaded from the Microsoft Symbol Server.

Setting up WinDbg

Run WinDbg. If you installed the Debugging Tools, you can find this in your start menu. If you didn't install the Debugging Tools, go do it now. They're at Debugging Tools For Window.

With WinDbg running, you need to set your Symbol Path, which is found under the File Menu. We want to set WinDbg to use our local symbols (for our custom application), as well as the Microsoft Symbol Server for all .Net framework and Windows Symbols. To do this, set the symbol path to:

  • C:\temp\SoapBoxBuildSymbols\; SRV*c:\temp\symbols*http://msdl.microsoft.com/download/symbols

Now we're ready to open the dump file and start debugging. Select Open Crash Dump, and select the dump file to debug. If there are two dumps, I generally debug the 1st Chance Exception and/or Process Shutdown dump.

In the dump file window type, .load clr10/sos.dll. This loads Son of Strike which is the .Net extension to WinDbg.

Without Son of Strike you're not going to be able to do any significant debugging. Unfortunately the only version of this I've been able to find is for x86 - I'm not sure yet what the answer is for x64 or IA64. If anyone knows the answer to this, please let me know.

Starting the Debug Process

Based on the fact that we're running out of memory, its pretty likely that we're suffering some sort of memory leak. Let's take a quick look at the threads that are running on the system and the state they're in:

0:000> !threads
ThreadCount: 180
UnstartedThread: 0
BackgroundThread: 60
PendingThread: 0
DeadThread: 119
                                  PreEmptive   GC Alloc                     Lock     
        ID  ThreadOBJ       State     GC       Context           Domain     Count APT Exception
XXX  0x98c 0x00150070      0xa020 Enabled  0x00000000:0x00000000 0x0014b0a8     0 Ukn
XXX  0xbb4 0x00151790      0xb220 Enabled  0x00000000:0x00000000 0x0014b0a8     0 Ukn (Finalizer)
XXX      0 0x00197458   0x1800820 Enabled  0x00000000:0x00000000 0x0014b0a8     0 Ukn (Threadpool Worker)
XXX      0 0x001bb030      0x1820 Enabled  0x00000000:0x00000000 0x0014b0a8     0 Ukn System.OutOfMemoryException
...
XXX      0 0x128eebc8   0x1800820 Enabled  0x00000000:0x00000000 0x0014b0a8     0 Ukn (Threadpool Worker) System.OutOfMemoryException
XXX      0 0x12828628    0x800820 Enabled  0x00000000:0x00000000 0x0014b0a8     0 MTA (Threadpool Completion Port) System.OutOfMemoryException
...
XXX  0xd74 0x0d156680    0x800220 Enabled  0x00000000:0x00000000 0x0014b0a8     0 Ukn (Threadpool Completion Port) System.OutOfMemoryException
...
XXX  0xff0 0x12906490   0x1800220 Enabled  0x00000000:0x00000000 0x0014b0a8     0 Ukn (Threadpool Worker) System.OutOfMemoryException
...
  0  0xbc8 0x0b39da48   0x1800220 Enabled  0x99dc8528:0x99dca49c 0x0014b0a8     0 MTA (Threadpool Worker) System.OutOfMemoryException

All those OutOfMemory Exceptions prove, beyond any doubt, that we've got a memory issue. Let's pull statistics on the Managed Heap and see if we can figure out a bit more.

0:000> !DumpHeap -Stat
------------------------------
Heap 0
total 552,517 objects
------------------------------
Heap 1
total 306,058 objects
------------------------------
total 858,575 objects
Statistics:
        MT      Count     TotalSize Class Name
0x7bee8da4          1            12 System.InvariantComparer
0x7becc074          1            12 System.Xml.XmlDownloadManager
0x7b58ed44          1            12 System.Drawing.Printing.PrintingPermission
.
.
.
0x006c2c3c    437,647 1,797,282,964 System.Byte[]

Just looking at these heap stats, I can see that I have 437K Byte Arrays allocated, consuming a total of 1.8 Gigabytes. In terms of figuring out what the problem is, I just found it. We're leaking byte arrays.

I'm actually somewhat surprised by this, as I was 95% certain this would end up being a heap fragmentation issue. Finding the cause of this should be pretty easy - certainly easier than finding the root cause of fragmentation or some other obscure issue. Ironically, the place where these byte arrays are allocated exists primarily to help us minimize heap fragmentation.

Using DumpHeap we can look at the byte arrays. With 400K+ of these, I want to limit the number I return. Let's take a look at some of these byte arrays:

0:000> !DumpHeap -Type Byte[] -l 5 -min 4000
------------------------------
Heap 0
   Address         MT     Size  Gen
0x00e45430 0x006c2c3c    8,204    2 System.Byte[] 
0x00e4dc58 0x006c2c3c    4,108    2 System.Byte[] 
0x00e4ec64 0x006c2c3c    4,108    2 System.Byte[] 
0x00e4fc70 0x006c2c3c    4,108    2 System.Byte[] 
0x00e50c7c 0x006c2c3c    4,108    2 System.Byte[] 
total 5 objects
------------------------------
Heap 1
   Address         MT     Size  Gen
0x04de7518 0x006c2c3c   65,548    2 System.Byte[] 
0x04e1e7c8 0x006c2c3c    8,204    2 System.Byte[] 
0x04e208c0 0x006c2c3c    8,204    2 System.Byte[] 
0x04e22a30 0x006c2c3c    8,204    2 System.Byte[] 
0x04e24aec 0x006c2c3c    8,204    2 System.Byte[] 
total 5 objects
------------------------------
total 10 objects

Note: For those of you wondering why there are two heaps here's the answer: Initially I though Heap 0 was the standard Managed Heap (complete with the expected Gen0, Gen1, Gen2), and Heap 1 was the Large Object Heap.

Then I remembered we are running on a dual-core machine, and using the Server Version of the CLR.. The key here is the phrase (from that link): We create a GC thread and a separated heap for each CPU. GC happens on these threads instead of on the allocating thread. So there ya have it, we have 2 heaps - one for each CPU.

If I run this command without the -l 5 option, I see all 400K+ byte arrays. This takes about an hour to scroll by on the screen. However, it's important to pick a "newer" array, so that we can be sure of getting one that I would consider a "leaked" array. If we grabbed one from the top of the list, it's likley to be a legit, pre-allocated array (which we do quite a bit of, in order to prevent heap fragmentation.

So I run the command, get the FULL list of arrays, then pick one from towards the end, and record the address. The one I pick happens to have an address of 0x9c559b60.

Now that I have a byte array isolated, and know it's address, it's time to find out more about it:

0:000> !DumpObj 0x9c559b60
Name: System.Byte[]
MethodTable 0x006c2c3c
EEClass 0x006c2bc4
Size 4108(0x100c) bytes
GC Generation: 2
Array: Rank 1, Type System.Byte
Element Type: System.Byte
Content: 4,096 items

So this byte array is 4108 bytes, and lives in GC Generation 2. This is nice to know, but doesn't tell us enough. 'Who owns this byte array?' is really the question we need answered. Fortunately that's pretty easy to answer:

0:000> !GCRoot 0x9c559b60
Scan Thread 0 (0xbc8)
Scan HandleTable 0x14a430
Scan HandleTable 0x14d348
HANDLE(Strong):6b13c8:Root:0x1338944(System.Net.Sockets.AcceptAsyncResult)->
0x13388d0(System.AsyncCallback)->
0x1318118(Winfessor.Proxy.Server.Socks.XMPPBytestreamsSocksListener)->
0x13180fc(Winfessor.Proxy.Server.DestroyDelegate)->
0x1276d7c(Winfessor.SoapBox.Server.ServerCore.Gateways.FileTransferProxy.FileTransferProxyComponent)->
0x4e0e078(Winfessor.SoapBox.Base.XMPPSocketManager)->
0xe4d5e0(Winfessor.SoapBox.Base.BufferPool)->
0xe4d648(System.Collections.Queue)->
0x912c9e0(System.Object[])->
0x9c559b60(System.Byte[])

We now have the owning call stack for the array. Now we're getting somewhere! Knowing our server as I do, the thing that jumps out at me is the BufferPool (0xe4d5e0). The BufferPool is a class designed to allocate a number of byte arrays at a time. These byte arrays are then passed into methods that cause them to be pinned in memory (such as Socket.BeginReceive, Socket.BeginSend, and a number of LDAP and Active Directory methods). These pinned byte arrays tend to be pinned for a long time, and if they're not carefully managed create all sorts of holes in the heap, which is very problematic.

Let's look a bit closer:

0:000> !DumpObj 0xe4d5e0
Name: Winfessor.SoapBox.Base.BufferPool
MethodTable 0x0c4c0074
EEClass 0x0c4136b8
Size 28(0x1c) bytes
GC Generation: 2
mdToken: 0x02000061  (c:\program files\coversant\soapbox server
 2005\winfessor.soapbox.base.dll)
FieldDesc*: 0x0c38ff7c
        MT      Field     Offset                 Type       Attr      Value Name
0x0c38fe74 0x400013c      0x8         System.Int32   instance 100 _initialCount
0x0c38fe74 0x400013d      0xc         System.Int32   instance 100 _growthCount
0x0c38fe74 0x400013e     0x10       System.Boolean   instance 1 _gc
0x0c38fe74 0x400013f      0x4                CLASS   instance 0x00e4d648 _queue
0x0c4c0074 0x4000109     0x14         System.Int32   instance 4096 _bufferSize

This doesn't tell me anything. But it does give me the address of the Queue. This queue is (by design) full of byte arrays that are continually pinned and unpinned in memory by Win32 calls. Its purpose is to help minimize heap fragmentation, which it does very well. Let's poke at the queue:

0:000> !dumpobj 0x00e4d648
Name: System.Collections.Queue
MethodTable 0x79bca304
EEClass 0x79bca3a8
Size 32(0x20) bytes
GC Generation: 2
mdToken: 0x0200011f  (c:\windows\microsoft.net\framework\v1.1.4322\mscorlib.dll)
FieldDesc*: 0x79bca40c
        MT      Field     Offset                 Type       Attr      Value Name
0x79bca304 0x40003bb      0x4                CLASS   instance 0x0912c9e0 _array
0x79bca304 0x40003bc      0x8         System.Int32   instance 128646 _head
0x79bca304 0x40003bd      0xc         System.Int32   instance 565323 _tail
0x79bca304 0x40003be     0x10         System.Int32   instance 435315 _size
0x79bca304 0x40003bf     0x14         System.Int32   instance 200 _growFactor
0x79bca304 0x40003c0     0x18         System.Int32   instance 15536770 _version

Its a bit weird that Version is 15536770. Version on a CLR Queue is updated for each call into the queue. This means there have been 15 million queue operations so far. This seems a bit high.
Let's take a look at the array (0x0912c9e0)that is used by the queue:

0:000> !DumpObj 0x0912c9e0 
Name: System.Object[]
MethodTable 0x006c209c
EEClass 0x006c2018
Size 3276816(0x320010) bytes
GC Generation: 3
Array: Rank 1, Type CLASS
Element Type: System.Object
Content: 819,200 items

Well, here's our problem. This array has 819,200 items in it. Each of these items is (I just happen to know) an array of bytes. Clearly something is very, very wrong with the algorithm used to grow the queue! At this point I could start looking at the individual array items, but they wouldn't tell me much. For details on how to do this see Arrays and SOS on MSDN.
So now we know problem: We have way, way too many of these byte arrays around. Now that we have the culprit, it's time to figure out why.

Why are we growing Byte Arrays?

Being a process oriented and very methodical company, we have all the source code used for the exact build our customers are running. The gateway into the Queue of object is the ObjectPoolBase class that we have. This offers two methods: CheckIn, and CheckOut. One of these two (or both!) methods must be broken and causing the "Grow" algorithm to fire too often.

Note: Our SoapoBox 2005 product line was written primarily in Visual Basic.Net. Our SoapBox 2007 product line has been converted to C#. This doesn't really have any impact on anything, as the two langauges are pretty much identical in all ways that matter.

Protected Function CheckOutObject() As Object
        Dim emptyQueue As Boolean = False

        SyncLock _queue.SyncRoot
            emptyQueue = (_queue.Count = 0)
        End SyncLock

        If emptyQueue Then
            AddObjects(GrowthCount)
        End If

        Return _queue.Dequeue()
    End Function

There's a race condition there. How on earth did that happen? This code has been tested, retested, and tested again. Yikes!
Even scarier than the race condition around the checking of queue size, is that the _queue.Dequeue method isn't protected inside a lock. I seriously doubt the CLR Queue class is thread safe, so this is probably causing a problem.
Let's take a quick look at the Grow Algorithm to see how that looks:

 Private Sub AddObjects(ByVal count As Integer)
        'force garbage collection to make sure 
        'these new long lived objects cause as 
        'little fragmentation as possible
        If _gc Then
            System.GC.Collect()
            System.GC.WaitForPendingFinalizers()
        End If

        For i As Integer = 1 To count
            Dim obj As Object = GetObjectInstance()
            SyncLock _queue.SyncRoot
                _queue.Enqueue(obj)
            End SyncLock
        Next

        'force garbage collection to make sure these new 
        'long lived objects cause as little fragmentation 
        'as possible
        If _gc Then
            System.GC.Collect()
            System.GC.WaitForPendingFinalizers()
        End If
    End Sub

That's also... far from ideal in terms of threading and locking. When did this happen? Which versions of our server have this code? Does the 2007 version still have this?

Before we figure that out, let's see the details on that call to _queue.DeQueue(). Using Reflector I can pull up the source easily enough:

public virtual object Dequeue()
{
      if (this._size == 0)
      {
            throw new InvalidOperationException(
               Environment.GetResourceString(
                 "InvalidOperation_EmptyQueue"));
      }
      object obj1 = this._array[this._head];
      this._array[this._head] = null;
      this._head = (this._head + 1) % this._array.Length;
      this._size--;
      this._version++;
      return obj1;
} 

Found deep inside the CLR, this code is absolutely, positively, not thread safe. Queue.Dequeue needs to be called from inside a Lock. This is a bug on our part.

Which versions are vulnerable?

Using Source Code control, it's easy to pull up the history for this file. Doing so, I find that it didn't always look like this. In fact, in all versions PRIOR to the SoapBox Server 2005 SR2, it looked like:

Protected Function CheckOutObject() As Object
        Monitor.Enter(_queue.SyncRoot)
        Try
            If _queue.Count = 0 Then
                AddObjects(GrowthCount)
            End If

            Return _queue.Dequeue()
        Finally
            Monitor.Exit(_queue.SyncRoot)
        End Try
    End Function

This code is fine, and does not have the race conditions of the code in question. How about the SoapBox Server 2007?

protected object CheckOutObject()
{
    if (_disposed)
        throw new ObjectDisposedException("ObjectPoolBase");

    object obj;
    bool added = false;
    lock (_queue.SyncRoot)
    {
        if (0 == _queue.Count)
        {
            AddObjects(GrowthCount);
            added = true;
        }

        obj = _queue.Dequeue();
    }

    if (added && _gc)
    {
        //Have to do this outside of any 
        //synchronization or things may hang.
        GC.WaitForPendingFinalizers();
    }

    return obj;
}

Whew! Our current release is clean, and all releases other than the SR2 release of SoapBox 2005 are also fine.

Conculusions

So in summary, there is (was!) a threading bug in the server. Why this bug didn't show up under testing, is somewhat of a mystery.

My best guess is that the action of actually having hundreds of client machines all hitting the server causes this to happen far, far more than having hundreds of clients on a single machine hitting the server does.

In general we're very diligent about testing for multi-threaded bugs. All of our workstation are multi-processor machines, we have dozens of explicitly multi-threaded tests, and our current high-stress test infrastructure includes a heavily multi-threaded XMPP BotNet that can run on any number of machines.

In addition to that, the people working on our server are all very knowledgeable threading experts. In fact the developer who introduced that particular bug shown took one look at it and instantly said, "Who wrote that? There's a race condition there.".

I guess the real answer is a more formal code review process. Unfortunately this is hard to enforce at the process level, as none of the source-code control tools that we use can do this.

Technorati Tags: , , , , ,
Permalink |  Trackback

©2008 Coversant, Inc. | Privacy Policy | About Coversant | Contact Info