+1.916.577.1977 | Downloads | Buy | Register | Login
 Search  
Tuesday, July 08, 2008
Search Blogs
 

Available Blogs
 

Previous Blogs
 

Technorati
 
More blogs about coversant.

About Coversant
 

The Case of the Interned String
 
Location: BlogsMullin' with Mullins    
Posted by: Chris Mullins 11/2/2006

I was flipping through Jeff Richter’s book Applied .Net Framework Programming and came across the section on string interning. I hadn’t thought about this topic in a while, and as I read through the relevant pages, it occurred to me that we have a great case for explicit string interning in the SoapBox Server.

The SoapBox Server makes extensive use of JabberID’s. These ID’s look a lot like email addresses, although them often have an XMPP resource appended to the end. They tend to look like “user@server/resource”.

Like any good XMPP development company, the first thing we did, years back, was to make a JabberID class. This class has three primary member variables:

private string _userName = string.Empty;
private string _resource = string.Empty;
private string _server = string.Empty;

We create many, many instances of this class. With 250k simultaneous users connected to our server we have at least 250k live instances of this class, and at any moment in time may have closer to 750k live instances. At 15 characters or so per item, that’s a minimum of 45 bytes (depending on encoding), for around 33 Megs just in JID’s string components. With the constant creation and destruction of JID’s this also causes quite a bit of memory thrash.

Now, from a use-case perspective, User Names tend to be pretty unique. Resources also tend to be somewhat unique, although a case can be made that they’re not. In any case, we see very few unique server names, and end up with a huge number of duplicated “coversant.net” strings floating around the system at any one time.

With all of these “coversant.net” strings floating around, they become ideal candidates for string interning. Interning is the process by which the CLR maintains a huge hashtable of strings, and rather than having duplicate strings floating around, uses a single instance of the string. The string interning hashtable is process wide (not app-domain wide, which seems a bit strange), and can never have items removed from it.

To enable string interning in our JabberID class, I changed the following line:

_server = JabberID.NormalizeDomain(value);
To:
_server = string.Intern(JabberID.NormalizeDomain(value));

As a side note, the NormalizeDomain method runs the string throughNamePrep (RFC 3941)which provides case folding Unicode Normalization Form KC, bidirection checks and all the other fun Unicode stuff.

This change means all the strings put through our JID class as domain names (“coversant.net”, “xmpp.org”, etc) will be stored in the intern table, rather than as individual copies.

Testing

Making a change for performance or memory tuning is of no value unless there are metrics around the improvement. Naturally, then, I created some test code.

Buggy Test 1

private void BuggyTest1()
{
string serverName = "coversant.net";
int node = 0, resource = 0;

_jids.Clear();

for (int i = 0; i < 10000000; i++)
{
JabberID j = new JabberID(node.ToString(), serverName, resource.ToString());
_jids.Add(j);

node++;
resource++;
}

MessageBox.Show("done");
}

With this code in place, I tested the Jabber ID Class in both a before & after mode. The results I got were not quite what I expected:

Without Interning:

JIDs Created

Platform

Used Memory

10,000,000

X86

1.09 Gigabytes

10,000,000

X64

1.84 Gigabytes

With Interning:

JIDs Created

Platform

Used Memory

10,000,000

X86

1.09 Gigabytes

10,000,000

X64

1.84 Gigabytes

These results are exactly the same! Clealy my code is bug-free, so there’s obviosly a bug in the way .Net interns strings (… or not!).

It turns out that all static text the C# compiler sees automatically get automaticlly interned. This means my line is actually creating and using an interned string.
I can work around that – I’m smarter than this darn compiler!

Buggy Test 2

Knowing that the C# compiler will automatically intern my static strings, I created a string dynamically, and used this as my server name.

private void BuggyTest2()
{
string temp = "coversant";
string serverName = temp + ".net";
int node = 0, resource = 0;

_jids.Clear();

for (int i = 0; i < 10000000; i++)
{
_jids.Add( new JabberID(node.ToString(), serverName, resource.ToString()));

node++; resource++;
}

MessageBox.Show("done");
}




I run this, and get the results:

Without Interning:

JIDs Created

Platform

Used Memory

10,000,000

X86

1.09 Gigabytes

10,000,000

X64

1.84 Gigabytes

With Interning:

JIDs Created

Platform

Used Memory

10,000,000

X86

1.09 Gigabytes

10,000,000

X64

1.84 Gigabytes

Clearly, the CLR bug surrounding string interning is still present (… or not).

The problem with this test, is that I’m only ever creating a single serverName string. This instance passes all the Unicode normalization rules, and is used “as-is”. This means I end up with 10,000,000 JID’s that all reference the same server. Unfortuantly this isn’t a real test case. Back to the drawing board!

The Working Test

In order to actually simulate strings coming in and being parsed, I need to make sure I don’t share a single instance of the serverName string across all the JID’s. Fortunatly, string.Copy(serverName) takes care of this for me, and I don’t need to manually construct a all these strings.

private void WorkingTest()
{
string serverName = "coversant";
serverName += ".net";

int node = 0, resource = 0;
_jids.Clear();

for (int i = 0; i < 10000000; i++)
{
string newServer = string.Copy(serverName);
_jids.Add(new JabberID(node.ToString(), newServer, resource.ToString()));

node++;
resource++;
}

MessageBox.Show("done");
}

With this test, finally, I get some real results:


Without Interning:

JIDs Created

Platform

Used Memory

10,000,000

X86

1.55 Gigabytes

10,000,000

X64

2.31 Gigabytes

With Interning:

JIDs Created

Platform

Used Memory

10,000,000

X86

1.09 Gigabytes

10,000,000

X64

1.84 Gigabytes

This is a pretty big win, memory wise.

In terms of raw number the gain is:

Platform

Gain

X86

.46 Gigabytes (460 megabytes)

X64

.47 Gigabytes (470 megabytes)

Granted, it’s with 10,000,000 instances of these classes, but at the rate we create and destroy these things, that number isn’t nearly as far-fetched as it seems.

Interning Resources

Clearly, the next area to look at is the resource portion of the JabberID. My test is easily updated to cycle between a few resources.

private void WorkingTest()
{
string serverName = "coversant";
serverName += ".net";

int node = 0, resourceCounter = 0;
string[] resources = new string[] {
"home", "work", "laptop", "mobile", "soapbox", "psi" };

_jids.Clear();

for (int i = 0; i < 10000000; i++)
{
string newServer = string.Copy(serverName);
string resource = string.Copy(resources[resourceCounter]);
_jids.Add(new JabberID(node.ToString(), newServer, resource.ToString()));

node++;
resourceCounter = (resourceCounter == resources.Length - 1) ? 0 : resourceCounter + 1;
}

MessageBox.Show("done");
}

This is a fairly realistic case, as the vast majority of resources come from a very small pool. In the not-to-distant future the SoapBox Server will be issuing server generated resource, so this case will become the default.

These test results are even more dramatic. No Interning of strings:

JIDs Created

Platform

Used Memory

10,000,000

X86

1.55 Gigabytes

10,000,000

X64

2.31 Gigabytes

With Domain & Resource Interning:

JIDs Created

Platform

Used Memory

10,000,000

X86

0.78 Gigabytes

10,000,000

X64

1.53 Gigabytes

These wins are very signifigant in terms of raw memory usage:

Platform

Gain

X86

.77 Gigabytes (770 megabytes)

X64

.78 Gigabytes (780 megabytes)

Conclusions

The overall answer becomes pretty clear. String Interning can offer a big win for memory utalization when used in conjunction with commonly occuring strings. Before I make this change to our core products though, I’ve got to test a few more things. I’m greatly worried that if we go with Interned strings, all access to those strings will be synchronized by the CLR.

We simply can’t afford to have the CLR grabbing locks on a hashtable each and every time a string.Intern method is called. The SoapBox Server is very heavily multithreaded, and a lock like this would cripple performance.

A quick look using reflector doesn’t really reveal the answer. String.Intern calls:

Thread.GetDomain().GetOrInternString(str);

Which is an internal call. Further research is clearly necessary…

Permalink |  Trackback

Comments (2)  
Re: The Case of the Interned String    By Peter Ritchie on 12/29/2006
The shared source CLI 2.0 shows that the intern hash table is synchronized with a critical section.

Re: The Case of the Interned String    By cmullins on 12/29/2006
After looking into this more, the synchronization around the string internal tables keeps me from really wanting to use it.

Even more so, the lack of ever cleaning strings out of the Intern table worries me. Given that our server should run for a long time between reboots, and the number of usernames that we see during that time may be astronimical, I'm hesitant to use the built-in string interning.

I've just about decided to roll my own version of string interning which would be reference counted and leverage weak references. Because it's only our JID class that would be using this, it's a practical thing to do in our this case.


©2008 Coversant, Inc. | Privacy Policy | About Coversant | Contact Info