Skip to main content
Published: March 25 2014, 9:28:00 AMUpdated: July 20 2022, 11:23:12 AM

Working with UTF-8 -- 'Did you think of this' checklist

 

eBay expects UTF-8 from your application

 

Characters sent from your application to eBay must be in the UTF-8 character set (charset).

If your request is not correctly encoded in the UTF-8 character set, then you may get errors such as the 'Invalid request encoding' error with error code 20400.

 

If you are a SOAP programmer and use a toolkit, the toolkit generates an XML stream that that declares the character encoding to be UTF-8. If you are an XML programmer, you must change the XML header to specify UTF-8 as the encoding. However whether you manually construct the XML or use a SOAP toolkit, the content is not necessarily in UTF-8. The characters provided by users (e.g. item descriptions, names of persons) may not have originally been in UTF-8. Did you convert them to UTF-8 before storing them in your database? Do the characters need conversion to UTF-8 when retrieved from the database and before you send them to eBay?

 

Consider the German O-umlaut. Its numerical representation depends on the charset. In the ISO-8859-1 and CP-1252 charsets it is one byte. But it has a completely different numerical representation in the UTF-8 charset and it is two bytes in length. You cannot just 'copy/paste' from document to another if the documents are in different charsets.

 

A key to proper conversion to UTF-8 format is knowing the charset of the input data or the format used when the information was saved to file or database. Where are you getting characters from? What charset was used when the characters were created? transmitted? saved in a file or database? What created the characters--an editor? a web browser? Does your file system and database properly accept, store and display UTF-8 characters? How well can you control the charsets of characters sent to your applications?

 

Web browsers are a key source of data. If you serve up a web page to gather information and intend for the user's browser to encode the data in UTF-8 before your application processes the data, the safest approach is to set the charset to UTF-8 in both the HTTP header and the HTML <meta> tag:

    Content-Type: text/html; charset=UTF-8

    <meta http-equiv='Content-Type' content='text/html; charset=UTF-8'>

(Certainly, the customer can complicate things by overriding the encoding you have set for the browser.)

 

Converting from one charset to another and testing results

 

Java provides new String(byte[], String charset) for converting bytes of a particular charset into UTF-16 and str.getBytes('utf8') for converting from Java's UTF-16 to UTF-8. Perl has the utf8 pragma--use 'use utf8' when reading/writing, for example.

 

How can you know if your ISO-to-UTF-8 conversion is working properly?

 

One simple spot-check is to write non-ASCII characters to a file with an .html suffix, view the results in Internet Explorer with encoding set to UTF-8, and see if the characters display correctly.

 

A more programmatic approach is to convert from ISO-8859-1 format to UTF-8 and back to ISO-8859-1, and then do a binary diff on the original and final iso data to ensure they match. With Java, you could similarly test your UTF-8 data by reading it into UTF-16 (new String(byte[], 'UTF-8')) and writing back out to UTF-8 (str.getBytes('UTF-8')), doing a binary diff on the two versions of UTF-8 data to ensure they match.

 

A word of caution: If the original data was in CP-1252 charset (a superset of ISO-8859-1), the process of converting to UTF-8 and then to Java's UTF-16 and back to UTF-8 will 'lose' those characters that are unique to the CP-1252 charset (and not in ISO-8859-1). (An invalid character may appear, for example, as a '?' or a square.) eBay advises that you do CP-1252 to UTF-8 data conversion (even for ISO-8859-1 data) unless you are confident that your tool implements ISO support as CP-1252 or know for sure that you will only have ISO-range characters. Otherwise, if you only perform ISO-8859-01 to UTF-8 conversion and your data includes a CP-1252 character, such as the Euro symbol, the character will be converted to something other than the Euro in UTF-8.

 

Tools and languages

 

Note that the exact spelling to use for ISO-8859-1 and CP-1252 depends on the tool or programming language. You might, for example, be required to specify “Windows-1252” or to use a space instead of hyphen in “ISO 8859-1”.

Databases: Is your database UTF-8 compliant? Try saving and retrieving a two-byte character.

Oracle 8.1.7: use DBI 1.45 and DBD 1.15. Other version combinations could result in 'DBI ERROR: ORA-06553: PLS-561: character set mismatch on value for parameter'.

Perl: For processing UTF-8 textual data, use Perl 5.8.1 to 5.8.4. (There are bugs with processing UTF-8 with pre-5.8.1 Perl and with version 5.8.5.).  If your request string is generated in WINDOWS-1252 or some other iso flavor, you can also install the Text::Iconv module from CPAN, and use the convert() function to convert the request string from WINDOWS-1252 to UTF-8 before making the call.

How well did this answer your question?
Answers others found helpful