May 19, 2008

Understanding Database Character Set and Encoding Techniques

It looks like, there are thousands of pages documentation on oracle database character set, database encoding, NLS_LANG and Unicode stuffs. At least, in last two days - I realized it. For a long time, about 2 years probably, we have been thinking to change the character set of our databases. The need came in front when our users started writing long comments by copying from MS word.

What’s wrong with this?

Windows uses a different character set and encoding techniques than we do in our applications. We can not convert some special characters available in windows and other character sets. For example, if users write a note using single quote, double quote, currency symbol and things like these, we can not convert those things and store it properly as it is in the database for our current encoding techniques. And next time when users are provided this data - they see some strange symbol/garbage instead.

Stop! If you are thinking what this encoding and character sets are - I would highly recommend you to read this nice article http://www.joelonsoftware.com/articles/Unicode.html

Anyway, currently we are using West European character set in the database. We need to change it to Unicode in order to support special characters from users’ input.There are ways to change character set of database - it’s not a very hard job for a DBA. But there are catches. What will happen for those special characters/symbols those already inserted?

To do the changes in database character set, oracle has some procedures like scanning the current data (with the Oracle CSSCAN tool) to make sure that the data is stored properly in the current character set. When we did this step to check the compatibilities, we had been reported that the we would be loosing some characters during the character set conversion in database. Oracle calls this - Loosy Conversion!!

I was trying to understand the actual behavior of these character encoding/conversion. For this, I went through hundreds of pages to understand the encoding mechanisms and the explanations appropriate for our applications. The findings below -

* ASCII character set are defined in the decimal range from 0 - 127
* WE8ISO8859P1 character set which we are using as database character set, the special characters are defined in the decimal range from 160 to 255
* WE8MSWIN1252 character set are being used by windows users, defines the special characters in the decimal range 128 to 255
* As we do not set any NLS_LANG parameter in jdbc template, it is default UTF (UTF-8/UTF-16, depends on version)
* As we do not set NLS_LANG environment variable, it is default “AMERICAN_AMERICA.US7ASCII” (LANGUAGE_TERRITORY.CHARSET). However, it is only applicable if we interact with sqlplus or when other client’s NLS is not set.

** Case-1 **

Now, for example, user is inserting the copyright character (decimal value 169) from ms word. There will be a series of conversions like -

It will first converted from windows character set WE8MSWIN1252 to UTF8, and finally when it will be inserted to database using jdbc, will be converted to database character set WE8ISO8859P1.
I simulated this behavior setting NLS_LANG to “AMERICAN_AMERICA.UTF8″ (because this is the default settings for jdbc)

SQL> select dump(convert(convert(’©’,'UTF8′,’WE8MSWIN1252′),’WE8ISO8859P1′,’UTF8′)) from dual;

DUMP(CONVERT(CONVERT(’©’,'UTF8′,’WE8MSWIN1252′),
————————————————
Typ=1 Len=1: 169

The value 169 which is the decimal code for copyright symbol will be inserted into database as it is. When selecting the character from database, the original character is retrieved in the following fashion.

SQL> select convert(convert(convert(’©’,'UTF8′,’WE8MSWIN1252′),’WE8ISO8859P1′,’UTF8′),’WE8MSWIN1252′,’WE8ISO8859P1′) from dual;
CON

©

** Case-2 **

If we do the same simulation for Euro Symbol which is represented decimal value 128 in windows character set, the actual inserted value in database will be decimal value 191 instead of 128. This is because, the database character set does not have any representation for value 128 (WE8ISO8859P1 only understands special char value in the range 160-255) and for this it replace this value with a replacement character - inverted question mark, decimal value 191.

SQL> select dump(convert(convert(’€’,'UTF8′,’WE8MSWIN1252′),’WE8ISO8859P1′,’UTF8′)) from dual;

DUMP(CONVERT(CONVERT(’¿’,'UTF8′,’WE8MSWIN1252′),
————————————————
Typ=1 Len=1: 191

Eventually when we retrieve this value from database, we see the inverted question mark from UI.

SQL> select convert(convert(convert(’€’,'UTF8′,’WE8MSWIN1252′), ’WE8ISO8859P1′,’UTF8′),’WE8MSWIN1252′,’WE8ISO8859P1′) from dual;

CON

¿

So, my understanding is, the special characters within the range 128 to 159, inserted from windows clients are not recognizable by the database character set WE8ISO8859P1 and thus replaced by replacement character (inverted question mark) in database. And these characters are NOT recoverable also. That’s why, we would have loosy conversion from character set WE8ISO8859P1 to AL32UTF8.

No comments: