Sunday, November 4, 2007

Marshal UTF8 Strings in .NET

Wow, what a pain in the butt. .NET strings are stored internally as UTF16, not UTF8, so if you're marshaling strings to and from a library that wants strings as UTF8, you have to manually marshal them yourself.

This took me a whole day to figure out why my my .NET wrapper library wasn't working, and a whole other day to figure out how to work around it and debug the code. If this code saves at least one person the amount of time I lost then I'm satisfied.


public class MarshalPtrToUtf8 : ICustomMarshaler
{
static MarshalPtrToUtf8 marshaler = new MarshalPtrToUtf8();

public void CleanUpManagedData(object ManagedObj)
{

}

public void CleanUpNativeData(IntPtr pNativeData)
{
Marshal.Release(pNativeData);
}

public int GetNativeDataSize()
{
return Marshal.SizeOf(typeof(byte));
}

public int GetNativeDataSize(IntPtr ptr)
{
int size = 0;
for (size = 0; Marshal.ReadByte(ptr, size) > 0; size++)
;
return size;
}

public IntPtr MarshalManagedToNative(object ManagedObj)
{
if (ManagedObj == null)
return IntPtr.Zero;
if (ManagedObj.GetType() != typeof(string))
throw new ArgumentException("ManagedObj", "Can only marshal type of System.String");
byte[] array = Encoding.UTF8.GetBytes((string)ManagedObj);
int size = Marshal.SizeOf(array[0]) * array.Length + Marshal.SizeOf(array[0]);
IntPtr ptr = Marshal.AllocHGlobal(size);
Marshal.Copy(array, 0, ptr, array.Length);
Marshal.WriteByte(ptr, size - 1, 0);
return ptr;
}

public object MarshalNativeToManaged(IntPtr pNativeData)
{
if (pNativeData == IntPtr.Zero)
return null;
int size = GetNativeDataSize(pNativeData);
byte[] array = new byte[size - 1];
Marshal.Copy(pNativeData, array, 0, size - 1);
return Encoding.UTF8.GetString(array);
}

public static ICustomMarshaler GetInstance(string cookie)
{
return marshaler;
}
}


You'll notice that there's a lot of data copying going on and there are a few copies of string made. Yep, that's because the .NET framework can't just pin the array in memory that stores the string (remember, strings are stored as UTF16 in the .NET framework) and you have to make the conversion yourself.

Friday, September 14, 2007

Grab a webpage in Erlang which is gzipped

First post! Instead of the typical blogging mainstay of a stupid first post that really is no value to anyone, I'm going to post some code on how to grab a web page that's been server side compressed with gzip (html header "content-encoding: gzip").

There are a few tutorials out there showing you how to grab a web page in Erlang, but unfortunately they can only be used if you're going to be doing some very light webpage grabbing. If you're like me and you need to grab a lot of web pages and you pay for your bandwidth with a hosted provider, then you'll quickly go broke.

I'm using the built in inets module rather than the ibrowse module (which it seems everyone uses for historical reasons. inets' http client in the past I suppose was pretty lousy, but it seems ok to me now) because I can't get the raw binary data from ibrowse like I can from inets (if you need to use ibrowse for your project you can convert the response from ibrowse to binary and adapt this code and it should work fine, but this way should be a little more performant).

The parse and parse_http functions I found from some code that Joe Armstrong coded some time ago. The code only accepts gzip encoding, although it would be easy to make it accept deflate, but what web server nowadays actually serves a slower and less efficient compression when it can offer something much better :).


-define(USER_AGENT, "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3").
-define(ACCEPT_ENCODING, "gzip").
-define(ACCEPT_CHARSET, "utf-8").
-define(ACCEPT, "text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8").
-define(ACCEPT_LANGUAGE,"en-us,en;q=0.5").

-export([get_url/1]).

get_url(Url) ->
{_Http, Host, _Port, _File} = parse(Url),
{ok, {_StatusLine, Headers, Body}} = http:request(get,
{Url,
[{"Host", Host},
{"User-Agent", ?USER_AGENT},
{"Accept-Encoding", ?ACCEPT_ENCODING},
{"Accept-Charset", ?ACCEPT_CHARSET},
{"Accept", ?ACCEPT},
{"Accept-Language", ?ACCEPT_LANGUAGE}]},
[],
[{body_format, binary}]),
{utf8,get_body(Headers, Body)}.


get_body(Headers, Body) ->
case lists:keysearch("content-encoding", 1, Headers) of
{value, {Key, Value}} when Value =:= "gzip" -> zlib:gunzip(Body);
_ -> Body
end.

%%----------------------------------------------------------------------
%% parse(URL) -> {http, Site, Port, File} |
%% {file, File} | {error,Why}
%% (primitive)

parse([$h,$t,$t,$p,$:,$/,$/|T]) -> parse_http(T);
parse([$f,$t,$p,$:,$/,$/|_T]) -> {error, no_ftp};
parse([$f,$i,$l,$e,$:,$/,$/|F]) -> {file, F};
parse(_X) -> {error, unknown_url_type}.

parse_http(X) ->
case string:chr(X, $/) of
0 ->
%% not terminated by "/" (sigh)
%% try again
parse_http(X ++ "/");
N ->
%% The Host is up to the first "/"
%% The file is everything else
Host = string:substr(X, 1, N-1),
File = string:substr(X, N, length(X)),
%% Now check to see if the host name contains a colon
%% i.e. there is an explicit port address in the hostname
case string:chr(Host, $:) of
0 ->
%% no colon
Port = 80,
{http, Host, Port, File};
M ->
Site = string:substr(Host,1,M-1),
case (catch list_to_integer(
string:substr(Host, M+1, length(Host)))) of
{'EXIT', _} ->
{http, Site, 80, File};
Port ->
{http, Site, Port, File}
end
end
end.