Nov 5, 2007

Migration stuffs on whole October

Lots of things were going on for the whole October. We just gave a production release version-7.1 of our web application. I passed another stressful day on that 7.1 judgment day :P

Let me keep some note here about the release. In the point release, we usually do new features and existing features improvement. These stuffs require database changes and also data migration. For data migration, we run DDL, DML and some java programs which basically run DML in database.

configuration due to some bad experiences with We have multiple sites geographically located on multiple places in United States. We need to migrate those databases as well. We use SharePlex (hoping that it will be replaced by GoldenGate replication software or by Oracle Streams soon!) for data replication. We turned off the DDL replication features in our SplexSplex replication process. What we do in each migration-

1. Turn off the Web Application
2. Run DDL statements on each database machines (since DDL replication if turned off from Splex)
3. Activate the new Splex configuration file for that release (which includes new tables, sequences and stuffs like that)
4. Run DMLs in primary site which replicates data on other db sites.
5. Run java program in primary site which are also replicated in the above way.

The time estimation for database migration were 7 hours and total application down time were estimated about 8 hours.

Everything were going well on that day. But suddenly found a new problem, one java migration were taking too much time than estimated. It supposed to be completed within 30 min but I found 50% done after 4 hours! It was very alarming.

Usually, the java migration were done by the respective module developer and it is locally tested and again tested in Beta release. It was running without a delay in both test cases due to having the db on same network. But in production, the scenario was different - we have machines on different networks.

The developer was using extra connection and unnecessary query to get the sequence values from database to perform insert operations which were taking a considerable amount of time in each value fetching. Anyways, I had to look into it at the middle of our migration process (with hotted head) and had to rewrite the code to make it faster. Finally I managed to finish it within 30 min and met the downtime deadline.

Now for last 3 days, production is running fine :)