EncoderFallbackException when passing UTF8 characters

Posted on January 11, 2011 by

0


Symptom

A while ago, in one of our systems which sends text messages over WCF, we encountered a mysterious exception:

System.Text.EncoderFallbackException: Unable to translate Unicode character \uD8B8 at index XX to specified code

The exception’s stack trace is as follows:

   at System.Text.EncoderExceptionFallbackBuffer.Fallback(Char charUnknown, Int32 index)
   at System.Text.EncoderFallbackBuffer.InternalFallback(Char ch, Char*& chars)
   at System.Text.UTF8Encoding.GetByteCount(Char* chars, Int32 count, EncoderNLS baseEncoder)
   at System.Xml.XmlStreamNodeWriter.UnsafeGetUTF8Length(Char* chars, Int32 charCount)
   at System.Xml.XmlBinaryNodeWriter.UnsafeWriteText(Char* chars, Int32 charCount)
   at System.Xml.XmlBinaryNodeWriter.WriteText(String value)
   at System.Xml.XmlBaseWriter.WriteString(String value)
   at System.Xml.XmlBaseWriter.WriteValue(String value)
   at System.Runtime.Serialization.XmlObjectSerializerWriteContext.WriteString(XmlWriterDelegator xmlWriter, String value, XmlDictionaryString name, XmlDictionaryString ns)
   at WriteTextMessageBodyToXml(XmlWriterDelegator , Object , XmlObjectSerializerWriteContext , ClassDataContract )
   at System.Runtime.Serialization.ClassDataContract.WriteXmlValue(XmlWriterDelegator xmlWriter, Object obj, XmlObjectSerializerWriteContext context)
   at System.Runtime.Serialization.XmlObjectSerializerWriteContext.SerializeAndVerifyType(DataContract dataContract, XmlWriterDelegator xmlWriter, Object obj, Boolean verifyKnownType, RuntimeTypeHandle declaredTypeHandle)
   at System.Runtime.Serialization.XmlObjectSerializerWriteContext.SerializeWithXsiType(XmlWriterDelegator xmlWriter, Object obj, RuntimeTypeHandle objectTypeHandle, Type objectType, Int32 declaredTypeID, RuntimeTypeHandle declaredTypeHandle, Type declaredType)
   at System.Runtime.Serialization.XmlObjectSerializerWriteContext.InternalSerialize(XmlWriterDelegator xmlWriter, Object obj, Boolean isDeclaredType, Boolean writeXsiType, Int32 declaredTypeID, RuntimeTypeHandle declaredTypeHandle)
   at System.Runtime.Serialization.XmlObjectSerializerWriteContext.InternalSerializeReference(XmlWriterDelegator xmlWriter, Object obj, Boolean isDeclaredType, Boolean writeXsiType, Int32 declaredTypeID, RuntimeTypeHandle declaredTypeHandle)
   at WriteSystemMessageToXml(XmlWriterDelegator , Object , XmlObjectSerializerWriteContext , ClassDataContract )
   at System.Runtime.Serialization.ClassDataContract.WriteXmlValue(XmlWriterDelegator xmlWriter, Object obj, XmlObjectSerializerWriteContext context)
   at System.Runtime.Serialization.XmlObjectSerializerWriteContext.SerializeAndVerifyType(DataContract dataContract, XmlWriterDelegator xmlWriter, Object obj, Boolean verifyKnownType, RuntimeTypeHandle declaredTypeHandle)
   at System.Runtime.Serialization.XmlObjectSerializerWriteContext.SerializeWithXsiTyp... 

After some digging, we found that the problem was caused by something called surrogates (http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Surrogates).

Surrogate characters are a pair of Unicode characters which together represent a valid special character, but the first character by itself is not a valid Unicode character.

In our system, we got a string containing a surrogate characters, which accidently we split in the middle (we chunked the string to fixed length chunks). This left us with a string containing only the first character of the surrogates pair, which is in the range [0xD800-0xD8FF] and according to the following XML specification http://www.w3.org/TR/2006/REC-xml-20060816/#charsets is not a valid characters for XML:

Char   ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

This causes the WCF encoder to fail encoding this character, and throw the EncoderFallbackException as its default behavior.

Solution

The fix for this problem is composed of two parts:

  1. Installing the hotfix for the problem from http://support.microsoft.com/kb/967090.
  2. Writing a customized fallback implementation which will be called whenever an input character cannot be encoded. The custom implementation will substitute the Unicode character with a different valid XML characters. This is done by setting the WriteEncoding property on the binding element with an Encoding that has your custom fallback algorithm implemented.

Note: Without the hotfix an exception is always thrown whenever a UTF8 character is encountered regardless of the customize fallback implementation

A sample for setting the EncoderFallback can be found here –http://technet.microsoft.com/en-us/library/tt6z1500(d=lightweight,v=VS.90).aspx.

Two other solutions which are less recommended are either go over the string before it is passed on to WCF, and remove invalid characters / make sure you are not splitting surrogate characters (needless to say that this solution might introduce performance issues…), or be aware in your client side that the EncoderFallbackException can be thrown and react appropriately in this case. For example, you can swallow this exception, meaning that in case a string with those characters is sent, it wont reach your server side but it also wont crash your process (This was our solution, since we could not install the hotfix on our production machines Smile…)

Useful links

This problem is also described in this post

http://blogs.msdn.com/b/drnick/archive/2010/02/04/fix-to-allow-an-encoder-fallback-with-utf8.aspx

Advertisements
Posted in: Pitfalls