I was flipping through Jeff Richter’s book Applied .Net Framework Programming and came across the section on string interning. I hadn’t thought about this topic in a while, and as I read through the relevant pages, it occurred to me that we have a great case for explicit string interning in the SoapBox Server.
The SoapBox Server makes extensive use of JabberID’s. These ID’s look a lot like email addresses, although them often have an XMPP resource appended to the end. They tend to look like “user@server/resource”.
Like any good XMPP development company, the first thing we did, years back, was to make a JabberID class. This class has three primary member variables:
private string _userName = string.Empty;
private string _resource = string.Empty;
private string _server = string.Empty;
We create many, many instances of this class. With 250k simultaneous users connected to our server we have at least 250k live instances of this class, and at any moment in time may have closer to 750k live instances. At 15 characters or so per item, that’s a minimum of 45 bytes (depending on encoding), for around 33 Megs just in JID’s string components. With the constant creation and destruction of JID’s this also causes quite a bit of memory thrash.
Now, from a use-case perspective, User Names tend to be pretty unique. Resources also tend to be somewhat unique, although a case can be made that they’re not. In any case, we see very few unique server names, and end up with a huge number of duplicated “coversant.net” strings floating around the system at any one time.
With all of these “coversant.net” strings floating around, they become ideal candidates for string interning. Interning is the process by which the CLR maintains a huge hashtable of strings, and rather than having duplicate strings floating around, uses a single instance of the string. The string interning hashtable is process wide (not app-domain wide, which seems a bit strange), and can never have items removed from it.
To enable string interning in our JabberID class, I changed the following line:
_server = JabberID.NormalizeDomain(value);
To: _server = string.Intern(JabberID.NormalizeDomain(value));
As a side note, the NormalizeDomain method runs the string throughNamePrep (RFC 3941)which provides case folding Unicode Normalization Form KC, bidirection checks and all the other fun Unicode stuff.
This change means all the strings put through our JID class as domain names (“coversant.net”, “xmpp.org”, etc) will be stored in the intern table, rather than as individual copies.
Testing
Making a change for performance or memory tuning is of no value unless there are metrics around the improvement. Naturally, then, I created some test code.
Buggy Test 1
private void BuggyTest1()
{
string serverName = "coversant.net";
int node = 0, resource = 0;
_jids.Clear();
for (int i = 0; i < 10000000; i++)
{
JabberID j = new JabberID(node.ToString(), serverName, resource.ToString());
_jids.Add(j);
node++;
resource++;
}
MessageBox.Show("done");
}
With this code in place, I tested the Jabber ID Class in both a before & after mode. The results I got were not quite what I expected:
Without Interning:
| JIDs Created | Platform | Used Memory |
| 10,000,000 | X86 | 1.09 Gigabytes |
| 10,000,000 | X64 | 1.84 Gigabytes |
With Interning:
| JIDs Created | Platform | Used Memory |
| 10,000,000 | X86 | 1.09 Gigabytes |
| 10,000,000 | X64 | 1.84 Gigabytes |
These results are exactly the same! Clealy my code is bug-free, so there’s obviosly a bug in the way .Net interns strings (… or not!).
It turns out that all static text the C# compiler sees automatically get automaticlly interned. This means my line is actually creating and using an interned string.
I can work around that – I’m smarter than this darn compiler!
Buggy Test 2
Knowing that the C# compiler will automatically intern my static strings, I created a string dynamically, and used this as my server name.
private void BuggyTest2()
{
string temp = "coversant";
string serverName = temp + ".net";
int node = 0, resource = 0;
_jids.Clear();
for (int i = 0; i < 10000000; i++)
{
_jids.Add( new JabberID(node.ToString(), serverName, resource.ToString()));
node++; resource++;
}
MessageBox.Show("done");
}
I run this, and get the results:
Without Interning:
| JIDs Created | Platform | Used Memory |
| 10,000,000 | X86 | 1.09 Gigabytes |
| 10,000,000 | X64 | 1.84 Gigabytes |
With Interning:
| JIDs Created | Platform | Used Memory |
| 10,000,000 | X86 | 1.09 Gigabytes |
| 10,000,000 | X64 | 1.84 Gigabytes |
Clearly, the CLR bug surrounding string interning is still present (… or not).
The problem with this test, is that I’m only ever creating a single serverName string. This instance passes all the Unicode normalization rules, and is used “as-is”. This means I end up with 10,000,000 JID’s that all reference the same server. Unfortuantly this isn’t a real test case. Back to the drawing board!
The Working Test
In order to actually simulate strings coming in and being parsed, I need to make sure I don’t share a single instance of the serverName string across all the JID’s. Fortunatly, string.Copy(serverName) takes care of this for me, and I don’t need to manually construct a all these strings.
private void WorkingTest()
{
string serverName = "coversant";
serverName += ".net";
int node = 0, resource = 0;
_jids.Clear();
for (int i = 0; i < 10000000; i++)
{
string newServer = string.Copy(serverName);
_jids.Add(new JabberID(node.ToString(), newServer, resource.ToString()));
node++;
resource++;
}
MessageBox.Show("done");
}
With this test, finally, I get some real results:
Without Interning:
| JIDs Created | Platform | Used Memory |
| 10,000,000 | X86 | 1.55 Gigabytes |
| 10,000,000 | X64 | 2.31 Gigabytes |
With Interning:
| JIDs Created | Platform | Used Memory |
| 10,000,000 | X86 | 1.09 Gigabytes |
| 10,000,000 | X64 | 1.84 Gigabytes |
This is a pretty big win, memory wise.
In terms of raw number the gain is:
| Platform | Gain |
| X86 | .46 Gigabytes (460 megabytes) |
| X64 | .47 Gigabytes (470 megabytes) |
Granted, it’s with 10,000,000 instances of these classes, but at the rate we create and destroy these things, that number isn’t nearly as far-fetched as it seems.
Interning Resources
Clearly, the next area to look at is the resource portion of the JabberID. My test is easily updated to cycle between a few resources.
private void WorkingTest()
{
string serverName = "coversant";
serverName += ".net";
int node = 0, resourceCounter = 0;
string[] resources = new string[] {
"home", "work", "laptop", "mobile", "soapbox", "psi" };
_jids.Clear();
for (int i = 0; i < 10000000; i++)
{
string newServer = string.Copy(serverName);
string resource = string.Copy(resources[resourceCounter]);
_jids.Add(new JabberID(node.ToString(), newServer, resource.ToString()));
node++;
resourceCounter = (resourceCounter == resources.Length - 1) ? 0 : resourceCounter + 1;
}
MessageBox.Show("done");
}
This is a fairly realistic case, as the vast majority of resources come from a very small pool. In the not-to-distant future the SoapBox Server will be issuing server generated resource, so this case will become the default.
These test results are even more dramatic. No Interning of strings:
| JIDs Created | Platform | Used Memory |
| 10,000,000 | X86 | 1.55 Gigabytes |
| 10,000,000 | X64 | 2.31 Gigabytes |
With Domain & Resource Interning:
| JIDs Created | Platform | Used Memory |
| 10,000,000 | X86 | 0.78 Gigabytes |
| 10,000,000 | X64 | 1.53 Gigabytes |
These wins are very signifigant in terms of raw memory usage:
| Platform | Gain |
| X86 | .77 Gigabytes (770 megabytes) |
| X64 | .78 Gigabytes (780 megabytes) |
Conclusions
The overall answer becomes pretty clear. String Interning can offer a big win for memory utalization when used in conjunction with commonly occuring strings. Before I make this change to our core products though, I’ve got to test a few more things. I’m greatly worried that if we go with Interned strings, all access to those strings will be synchronized by the CLR.
We simply can’t afford to have the CLR grabbing locks on a hashtable each and every time a string.Intern method is called. The SoapBox Server is very heavily multithreaded, and a lock like this would cripple performance.
A quick look using reflector doesn’t really reveal the answer. String.Intern calls:
Thread.GetDomain().GetOrInternString(str);
Which is an internal call. Further research is clearly necessary…