+1.916.577.1977 | Downloads | Buy | Register | Login
 Search  
Tuesday, July 08, 2008
Search Blogs
 

Available Blogs
 

Previous Blogs
 

Technorati
 
More blogs about coversant.

About Coversant
 

A .Net Unicode Puzzle (Revised)
 
Location: BlogsMullin' with Mullins    
Posted by: Chris Mullins 2/24/2007

Someone posted an interesting problem the other day: turn 'áäåãòä' into 'AAAAOA'. It doesn't seem too complicated, but nothing that I could think of really solved the problem in an elegant way.

Now, I know Unicode pretty well - I've implemented Stringprep as part of the SoapBox Platform's XMPP Implementation. Stringprep does all sorts of interesting things such as bidirectional checks, IDN, Punycode, and Case Folding, and Form KC Normalization. Still though, nothing came to mind as a solution other than using a crazy table based lookup scheme.

Then I saw a post by JR that suggesting using Unicode Normalization Form KD (NFKD) to decompose the string and the light bulbs went off! (He gets all the credit for this idea - I just implemented it in .Net and wrote it up here).

Normalization Form KD (commonly written as NFKD) will decompose composite characters into their components forms. For example the character 'ẛ' is actually Unicode codepoint 1E9B (written as: U+1E9B) and is named, "LATIN SMALL LETTER LONG S WITH DOT ABOVE". When we apply NFKD, this codepoint decomposes into two codepoints: U+0073 and U+0307. You may recognize U+0073 - it's actually the same as the ASCII code 0x73, which is 's'.

Now that we know Normalization form KD will decompose our string into something more basic, we just need to figure out how to do this in .Net. In .Net 1.0 & 1.1, Microsoft didn' include any deep Unicode support. Fortunatly much of this has been corrected in .Net 2.0, and now normalizing a string is a breeze:

string s = "áäåãòä";
string normalized = s.Normalize(NormalizationForm.FormKD);

Now we have a Form KC Normalized string. If we display the string, say, in a MessageBox, it looks the same, but looks are deceiving. Under the hood, it's fully decomposed. A more full discussion of this would get into topics such as Combining Characters and Text Elements. More information for .Net developers can be found under String Indexing.

.Net programmers know that all .Net strings are UTF-16 encoded. This means our nice decomposed string is sitting in a format that (at this point) isn't really what we want - it's stored as a sequence of combining characters encoded in UTF-16.

At this point, my original blog entry on this topic used the ASCII encoder to rip out the un-wanted marks in what was a pretty clever way. While clever, it was pointed out to me by Michael Kaplan what a fool I was being for taking this approach. He was polite about it, and while I hate being called a fool, he was sure right! Pivoting through the ASCII encoder was just silly. Ah well. Mistakes & public humiliation are how we learn...

The correct way to clean things up from here is shown by Michael Kaplan's blog entry entitled Stripping diacritics.

As Michael points out, the proper way remove the marks is to iterate over the characters in the string, and use the CharUnicodeInfo class to determine if they're Non-Spacing Marks. If they are, they they're skipped - if they are not, they they're appended into our new string. The resulting string has the right results in it - unlike my original solution which only worked for ASCII characters that were marked.

StringBuilder sb = new StringBuilder();
for (int i = 0; i < normalized.Length; i++)
{
    char c = normalized[i];

    UnicodeCategory uc =
        CharUnicodeInfo.GetUnicodeCategory(c);

    if (uc != UnicodeCategory.NonSpacingMark)
        sb.Append(c);
}

The complete source code looks like:

string s = "áäåãòä";
string normalized = s.Normalize(NormalizationForm.FormKD);

StringBuilder sb = new StringBuilder();
for (int i = 0; i < normalized.Length; i++)
{
    char c = normalized[i];

    UnicodeCategory uc =
        CharUnicodeInfo.GetUnicodeCategory(c);

    if (uc != UnicodeCategory.NonSpacingMark)
        sb.Append(c);
}

MessageBox.Show(sb.ToString());

Whenever I drop into doing Unicode related tasks, I'm always amazed at the sheer bredth of the Unicode standard. There is so much information in there, and so many powerfull features that it's easy to quickly become overwhelmed.

It's easy too to forget that everthing we do these days on a computer is leveraging Unicode. Prettymuch everything is encoded in either UTF-8 or UTF-16 - all web pages, all XML documents, all text files stored on your hard disk. Unicode is at the heart of Windows, Linux, .Net & Java. Despite this, very few developers have any real understanding of what Unicode is, or how it works. I've been asking 'What does that UTF-8 or UTF-16 mean that you've typed in a zillion times?" during interviews now for years, and have yet to ever get back the right answer (although I've sure had some creative responses!).

Technorati Tags: , , , ,

My Technorati Profile

I guess the old adage 'The best way to get the right answer on Usenet is to post the wrong one.' is still alive & true and applies equally well to the blog world...

Permalink |  Trackback

Comments (2)  
Re: A .Net Unicode Puzzle    By michka on 3/4/2007
Maybe there is as better answer that doesn't pivot through ASCII? :-)

(ref: http://blogs.msdn.com/michkap/archive/2007/03/04/1802729.aspx)

Re: A .Net Unicode Puzzle (Revised)    By cmullins on 3/5/2007
Thanks Michael for pointing out a much better way to solve this problem. While the underlying "ah-ha" of using NFKD was right, my use of the ASCII encoded was just silly. I've revised this blog entry to use the method you highlight in your blog.


©2008 Coversant, Inc. | Privacy Policy | About Coversant | Contact Info