May 19, 2008

Understanding Database Character Set and Encoding Techniques

It looks like, there are thousands of pages documentation on oracle database character set, database encoding, NLS_LANG and Unicode stuffs. At least, in last two days - I realized it. For a long time, about 2 years probably, we have been thinking to change the character set of our databases. The need came in front when our users started writing long comments by copying from MS word.

What’s wrong with this?

Windows uses a different character set and encoding techniques than we do in our applications. We can not convert some special characters available in windows and other character sets. For example, if users write a note using single quote, double quote, currency symbol and things like these, we can not convert those things and store it properly as it is in the database for our current encoding techniques. And next time when users are provided this data - they see some strange symbol/garbage instead.

Stop! If you are thinking what this encoding and character sets are - I would highly recommend you to read this nice article http://www.joelonsoftware.com/articles/Unicode.html

Anyway, currently we are using West European character set in the database. We need to change it to Unicode in order to support special characters from users’ input.There are ways to change character set of database - it’s not a very hard job for a DBA. But there are catches. What will happen for those special characters/symbols those already inserted?

To do the changes in database character set, oracle has some procedures like scanning the current data (with the Oracle CSSCAN tool) to make sure that the data is stored properly in the current character set. When we did this step to check the compatibilities, we had been reported that the we would be loosing some characters during the character set conversion in database. Oracle calls this - Loosy Conversion!!

I was trying to understand the actual behavior of these character encoding/conversion. For this, I went through hundreds of pages to understand the encoding mechanisms and the explanations appropriate for our applications. The findings below -

* ASCII character set are defined in the decimal range from 0 - 127
* WE8ISO8859P1 character set which we are using as database character set, the special characters are defined in the decimal range from 160 to 255
* WE8MSWIN1252 character set are being used by windows users, defines the special characters in the decimal range 128 to 255
* As we do not set any NLS_LANG parameter in jdbc template, it is default UTF (UTF-8/UTF-16, depends on version)
* As we do not set NLS_LANG environment variable, it is default “AMERICAN_AMERICA.US7ASCII” (LANGUAGE_TERRITORY.CHARSET). However, it is only applicable if we interact with sqlplus or when other client’s NLS is not set.

** Case-1 **

Now, for example, user is inserting the copyright character (decimal value 169) from ms word. There will be a series of conversions like -

It will first converted from windows character set WE8MSWIN1252 to UTF8, and finally when it will be inserted to database using jdbc, will be converted to database character set WE8ISO8859P1.
I simulated this behavior setting NLS_LANG to “AMERICAN_AMERICA.UTF8″ (because this is the default settings for jdbc)

SQL> select dump(convert(convert(’©’,'UTF8′,’WE8MSWIN1252′),’WE8ISO8859P1′,’UTF8′)) from dual;

DUMP(CONVERT(CONVERT(’©’,'UTF8′,’WE8MSWIN1252′),
————————————————
Typ=1 Len=1: 169

The value 169 which is the decimal code for copyright symbol will be inserted into database as it is. When selecting the character from database, the original character is retrieved in the following fashion.

SQL> select convert(convert(convert(’©’,'UTF8′,’WE8MSWIN1252′),’WE8ISO8859P1′,’UTF8′),’WE8MSWIN1252′,’WE8ISO8859P1′) from dual;
CON

©

** Case-2 **

If we do the same simulation for Euro Symbol which is represented decimal value 128 in windows character set, the actual inserted value in database will be decimal value 191 instead of 128. This is because, the database character set does not have any representation for value 128 (WE8ISO8859P1 only understands special char value in the range 160-255) and for this it replace this value with a replacement character - inverted question mark, decimal value 191.

SQL> select dump(convert(convert(’€’,'UTF8′,’WE8MSWIN1252′),’WE8ISO8859P1′,’UTF8′)) from dual;

DUMP(CONVERT(CONVERT(’¿’,'UTF8′,’WE8MSWIN1252′),
————————————————
Typ=1 Len=1: 191

Eventually when we retrieve this value from database, we see the inverted question mark from UI.

SQL> select convert(convert(convert(’€’,'UTF8′,’WE8MSWIN1252′), ’WE8ISO8859P1′,’UTF8′),’WE8MSWIN1252′,’WE8ISO8859P1′) from dual;

CON

¿

So, my understanding is, the special characters within the range 128 to 159, inserted from windows clients are not recognizable by the database character set WE8ISO8859P1 and thus replaced by replacement character (inverted question mark) in database. And these characters are NOT recoverable also. That’s why, we would have loosy conversion from character set WE8ISO8859P1 to AL32UTF8.

May 11, 2008

Database Replication with GoldenGate software

For a number of reasons - people replicate data to remote destinations. We do have our own needs. We have been running SharePlex for data replication purpose for the last 3 years. By this time, our system usages increased more than twice in just last year. Initially SharePlex was good for us but now a days we need more flexible configurations and different replication architecture for the system.

We are looking for an alternative database replication mechanism that would suit best for us. We are exploring new technologies available to serve replication purpose.

Now, we found two alternatives to be considered as we need more control for an "Active-Active" mode -

1. Oracle Streams
2. GoldenGate Software

I did several extensive experiments with Oracle Streams for about 1.5 months. You can guess, there are lots of outcomes. This post is for telling a bit about my GoldenGate experience!

GoldenGate Software (GGS) is a popular one used for data replication and keep several database in sync. They are supporting a number of database including Oracle, Teradata, DB2, Sybase, SQL Server. It seems that GGS are in market for long time, probably before Streams introduced by Oracle, now they are running version 9.5.

The basic concepts of GGS are - GGS has EXTRACT process which captures changes in source database. The changes are kept in files called TRAIL. The changes can be directly sent to the target/destination database's REPLICAT process to apply the changes in destination database. GGS EXTRACT reads data from oracle online redo log file - so it's independent of oracle process. I forgot to mention, GGS has several processes, MANAGER is one of the important process which managers all other stuffs.

My first "one way" replication test configuration (obvious these are not all for production db) with GGS were as follows -


Source Neptune: Create EXTRACT which will capture changes and send the changes to remote host's TRAIL files

EDIT PARAMS extprod
EXTRACT extprod
USERID ggs, PASSWORD ggs
RMTHOST saturn, MGRPORT 7820
RMTTRAIL /export/home/oracle/ggs/dirdat/rt
TABLE DEV.LOGIN;
**
ADD EXTRACT extprod, TRANLOG, BEGIN NOW
**
ADD RMTTRAIL /export/home/oracle/ggs/dirdat/rt, EXTRACT extprod, MEGABYTES 50

Destination Saturn: Create REPLICAT which will read changes from TRAIL files and apply the changes in current database

EDIT PARAMS repprod
REPLICAT repprod
ASSUMETARGETDEFS
DISCARDFILE ./dirrpt/REPSOE_DSC.rpt, PURGE
USERID ggs, PASSWORD ggs
MAP DEV.*, TARGET DEV.*;
**
ADD REPLICAT repprod, EXTTRAIL /export/home/oracle/ggs/dirdat/rt, NODBCHECKPOINT


So, these were the basic stuffs that I started with GoldenGate Software (GGS). They have thousand page documents and there are a lot more to test. Earlier, we created spreadsheet putting about 30 evaluation criteria against which we compared SharePlex and Oracle Streams. After experimenting with GGS, I will put the outcomes for GGS and would take the ultimate product decision.

Expecting comments if you have any good/bad experience with the product!

May 4, 2008

Oracle: Index auto dropped while importing. Where my index gone?

One day, someone was telling me that he had been lost in blue, one index was vanished!

We have multiple production servers. So, taking a dump from one server and import it to another is quite common for initial synchronization. Like this, one day, one of my colleague found that schema on p01 database had one less index than s01 database. One index was not created in p01 while imported table "TITLE". I investigated and found that it happened for an interesting feature of oracle!

The "TITLE" table has primary key ID. You know, when primary key is created, one system generated index is created with the primary key column to enforce the constraint.

During the release 7.2, we created a composite index named "TITLE_ID_TITLE" on table "TITLE" in database server s01 using two columns including the primary key column "ID" and another column, "TITLE" (TITLE table has a column named TITLE).

Now, when oracle was importing indexes for table TITLE on p01, it found that the composite index "TITLE_ID_TITLE" is enough to enforce the primary key - so it automatically discarded the system generated index which was created with the primary key.

That's the ghost story behind this!

May 2, 2008

SpringSource launches new Application Platform

SpringSource, the company behind Spring framework, launches new Application Platform without Java EE standerds! The Application Platform has been built on Spring, OSGi, and Apache Tomcat. The new appserver departs from the Java2 EE standards, exposing the Spring programming model natively, along with a new deployment and packaging system.

I was reading the overview of it and was amazed by reading the following development and production benefits. It is our dream to modularize the big applications that we are developing and do the updates in production server in real time without shutting down the system for long time.

I am just pasting the benefits SpringSource are talking about -

Production Benefits
  • Real time application and server updates
  • Better resource utilization
  • Side by side resource versioning
  • Longer server uptime
  • Run new and old web application
Development Benefits
  • Faster iterative development
  • Small server footprint
  • More manageable applications
  • Greater code reuse
  • Resolve transitive dependencies
Nice blogs have been written by the development team about the architecture.

http://blog.springsource.com/main/2008/04/30/introducing-the-springsource-application-platform/

The overall introduction is given on the official site of SpringSource -

http://www.springsource.com/web/guest/products/suite/applicationplatform

I have the strong desire to explore more on this!