The Encoding saga for English and non-English languages

There are several approaches in understanding the encoding mechanism.  The approach we will take today is to see data conversion and storage using bytes.  Since most of our communication happens in text (or string), we will look into string-> byte[] conversion.  Later this byte array can be transferred over network or stored in a data store.

Converting a string to a byte (binary) array has always been challenging.  There is no easy solution to this problem if you are working on an application that uses a locale other than English.  And it becomes even more complex when you are saving the byte array to a database with a different character set.

So this article will deal with handling these encodings

Do you need encoding for conversion of English text?

Let’s answer this question with the help of an example

private static void Main(string[] args)
{
    string sample = @"this is a \t string in unicode format";

    byte[] bytes = GetBytes(sample);
    string convertedBack = GetString(bytes);
    Console.WriteLine(bytes.Length + " >> " + convertedBack);

}

static byte[] GetBytes(string str)
{
    var bytes = new byte[str.Length * sizeof(char)];
    Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
    return bytes;
}

static string GetString(byte[] bytes)
{
    var chars = new char[bytes.Length / sizeof(char)];
    System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
    return new string(chars);
}

When you execute this code, it gives you an output

74 >> this is a \t string in unicode format

So this code converts 37 character string into a 74 character byte array.  For small size hobby applications, this method works fine.  But this is not the optimal way of doing it.  This is where the right encoding comes into picture

Now let’s write another code to use Encoding to convert a string

public static byte[] GetBytesWithEncoding(Encoding encoding, string str)
{
    return encoding.GetBytes(str);
}

public static string GetStringWithEncoding(Encoding encoding, byte[] bytes)
{
    return encoding.GetString(bytes);
}

When we pass different encoding objects to this method, we get following result

Unicode (UTF-7) >> 41 >> this is a \t string in unicode format
Unicode (UTF-8) >> 37 >> this is a \t string in unicode format
US-ASCII >> 37 >> this is a \t string in unicode format
Unicode >> 74 >> this is a \t string in unicode format
Unicode (UTF-32) >> 148 >> this is a \t string in unicode format

So your byte array size varies based on the encoding selected.  When building enterprise applications it is important to ensure that your memory footprint is the least and such optimizations definitely help.

Encoding for non-English text

If you are using a non-English locale on your application machine/server using the default encodings will not be helpful.  You may face several issues converting surrogate characters or language specific characters while applying default encoding. 

The best solution requires you to find out the encoding that your data store supports.  If your data store is

  • Database – find out the character set supported. 
  • In-memory – you do not need to worry.

  • Flat-file, – find out your system locale.

So once you have found the locale, you need to map it with the encoding.  .NET supports 140 locale and you get can the list of locale by a small piece of code below or at MSDN

var encodings = Encoding.GetEncodings();
foreach (var encodingInfo in encodings)
{
    Debug.WriteLine(encodingInfo.DisplayName + " > " + encodingInfo.Name 
+
"(" + encodingInfo.CodePage + ")"); }

The next step would be to create an encoding object with the right encoding code.    Below is the code that uses Devnagri (Hindi) and Japanese encodings.  I have used Google translation to convert the text ‘Welcome to Encoding saga’ so please pardon me if the translations are not correct.

// Encoding - Hindi
encoding = Encoding.GetEncoding(57002);
bytes = GetBytesWithEncoding(encoding, @"एनकोडिंग सागा में आपका स्वागत है");
convertedBack = GetStringWithEncoding(encoding, bytes);
Console.WriteLine(encoding.EncodingName + " >> " + bytes.Length + " >> " + convertedBack);

// Encoding - Japanese
encoding = Encoding.GetEncoding(932);
bytes = GetBytesWithEncoding(encoding, @"エンコード佐賀へようこそ");
convertedBack = GetStringWithEncoding(encoding, bytes);
Console.WriteLine(encoding.EncodingName + " >> " + bytes.Length + " >> " + convertedBack);

If you are running this on a system with English locale, the convertedBack values visible on the screen would will be ‘???’

Hope this helps you to understand the importance of right encoding in data conversion and storage.

Punit Ganshani

Punit Ganshani, based out of Singapore, is Microsoft C# MVP and specializes in Microsoft technology stack and performance engineering. He is an open-source contributor at Codeplex, CodeProject, DZone MVB, has several apps on Windows Store, author of book, a gadget fan and an evangelist.

More Posts - Website

Follow Me:
TwitterFacebookLinkedInReddit

%d bloggers like this: