Sunday, November 4, 2007

Marshal UTF8 Strings in .NET

Wow, what a pain in the butt. .NET strings are stored internally as UTF16, not UTF8, so if you're marshaling strings to and from a library that wants strings as UTF8, you have to manually marshal them yourself.

This took me a whole day to figure out why my my .NET wrapper library wasn't working, and a whole other day to figure out how to work around it and debug the code. If this code saves at least one person the amount of time I lost then I'm satisfied.


public class MarshalPtrToUtf8 : ICustomMarshaler
{
static MarshalPtrToUtf8 marshaler = new MarshalPtrToUtf8();

public void CleanUpManagedData(object ManagedObj)
{

}

public void CleanUpNativeData(IntPtr pNativeData)
{
Marshal.Release(pNativeData);
}

public int GetNativeDataSize()
{
return Marshal.SizeOf(typeof(byte));
}

public int GetNativeDataSize(IntPtr ptr)
{
int size = 0;
for (size = 0; Marshal.ReadByte(ptr, size) > 0; size++)
;
return size;
}

public IntPtr MarshalManagedToNative(object ManagedObj)
{
if (ManagedObj == null)
return IntPtr.Zero;
if (ManagedObj.GetType() != typeof(string))
throw new ArgumentException("ManagedObj", "Can only marshal type of System.String");
byte[] array = Encoding.UTF8.GetBytes((string)ManagedObj);
int size = Marshal.SizeOf(array[0]) * array.Length + Marshal.SizeOf(array[0]);
IntPtr ptr = Marshal.AllocHGlobal(size);
Marshal.Copy(array, 0, ptr, array.Length);
Marshal.WriteByte(ptr, size - 1, 0);
return ptr;
}

public object MarshalNativeToManaged(IntPtr pNativeData)
{
if (pNativeData == IntPtr.Zero)
return null;
int size = GetNativeDataSize(pNativeData);
byte[] array = new byte[size - 1];
Marshal.Copy(pNativeData, array, 0, size - 1);
return Encoding.UTF8.GetString(array);
}

public static ICustomMarshaler GetInstance(string cookie)
{
return marshaler;
}
}


You'll notice that there's a lot of data copying going on and there are a few copies of string made. Yep, that's because the .NET framework can't just pin the array in memory that stores the string (remember, strings are stored as UTF16 in the .NET framework) and you have to make the conversion yourself.

4 comments:

Philip said...

This was quite helpful to me. Thanks!

There's a small bug in your MarshalNativeToManaged function. I don't think the size should have 1 subtracted from it when creating the array.

Also, MarshalManagedToNative does not account for the empty string correctly.

Thanks again for this work!

Juho Vähä-Herttua said...

Thanks for the tip. I agree with Philip that the size shouldn't be subtracted when creating the array. Also it's silly to use array[0] when typeof(byte) would do just fine. I am using a modified version that can be found from:

http://code.google.com/p/tapcfg/source/browse/trunk/src/bindings/UTF8Marshaler.cs?spec=svn69&r=68

Feel free to use it how you wish. It has also a dirty hack included that made the marshaller crash when running on mono 1.2.6 on linux.

Rob said...

Yep! A PAIN in the BUTT! But, thanks for your solution. "Marshal.Copy' was what I had missed.

Giancarlo Villanueva said...

After translation required switching our character encoding from ASCII to UTF-8 we were rescued by this custom marshaler. We were able to decorate our imported method with the follow attribute:

[return: MarshalAs(UnmanagedType.CustomMarshaler, MarshalType("namespace.MarshalPtrToUtf8")]

With that, our code "just worked"® again.

Thanks!