Dec 10, 2015

Database 12c hiccups!

Human nature - forgets bad things easily if that is not bad enough. Thought to put this in writing so that it may help others and reminds the lessons for the next time. Wanted to write a couple of pages but have a feeling - now a days people doesn't have time to spend on reading. Trying to be "to the point".

About

This is all about the trouble we faced after we upgraded database from 11gR2 to 12cR1.

Platform

Solaris 10, x86-64
From database version: 11.2.0.3
To database version: 12.1.0.2
System: High transactional (305,000+ logical read/sec, 1800+ executes/sec), 14TB OLTP database for n-tier JEE web applications having 10,000+ concurrent online users (out of 350,000+ total users).

When 

The upgrade took place in March-April 2015.

Why 12c

1. The next big database move for us is migrating to RAC architecture sometimes in 2016. Logically, it should be on 12c version. It's out there for long time with so many improvements in all layers. Before we migrate to RAC architecture, how about working on the new database version (12c) in production for sometime to figure out the odds earlier (before RAC) and individually, so that we won't have to introduce multiple new things together, and won't have to fight with too many odds at the same time (12c and RAC).

2. There was a bug in memory and/or RMAN backup process and sometimes terminates abruptly. This triggered when we grew SGA over 28GB. Needed to fix this upgrading to higher version.

3. 12c optimizer got smarter, many cool features are there including adaptive plan, adaptive statistics, enhancements in existing features - our tests also showed most of the queries got faster about 20%!

4. People had been telling good things about 12c since it's out there. It was already 1.5 years old. The 12c multi-tanent architecture would be the ultimate fit for the cloud - and Oracle have been persuading towards that.


The problems

1. The DBUA upgrade process failed half way. The process could not be rolled back. Had a situation to lose the database. It was not a good option to restore the 13TB size database from backups. So, had to go ahead with manual upgrade process which was tedious, time consuming, and a bit risky too. The support could not say clearly what was the cause of the failure, indicated invalid objects in "sys_inv_objs" which can't be the cause we knew- there were 3/4 invalid objects we had which were owned by our user schema (created by us) - those could never trigger the problem - we tested the upgrade process having the same invalid objects in our test environments many times, anyway. But it's good to have all invalid objects eliminated before any upgrade so that people can't put wrong pointers.

Tip: When upgrading to 12c, it's better to do "manual upgrade using SQL scripts and utilities" rather than using DBUA, even if that works fine in test environments. This is specially true if the environment had previous upgrade history. For example, when database already upgraded from some previous versions like- upgraded from 11.1 to 11.2, from 11.2.0.2 to 11.2.0.3 etc.


2. LogMiner Bug 19931709. When we brought up the database after the upgrade and tested a little, we hit this problem. This caused the GoldenGate extract process to fail with the following errors - 

WARNING OGG-00869 Oracle GoldenGate Capture for Oracle, extdf01.prm: ORA-26914: Unable to communicate with GoldenGate capture process "OGG$CAP_EXTDF01" from outbound server "OGG$EXTDF01".

ERROR OGG-02077 Oracle GoldenGate Capture for Oracle, extdf01.prm: Extract encountered a read error in the asynchronous reader thread and is abending: ORA-01343: LogMiner encountered corruption in the log stream


The patch was not available for our platform, Oracle dev immediately worked on it and made the patch available. Strange thing was, we did not have any problem in test environments!

3. GoldenGate process failure after fixing Bug 19931709. Even if we applied the fix for the bug, still receiving errors from GoldenGate process. The thing was - once the corruption had been entered into the Redo stream, it cannot be reversed. The fix for Bug 19931709 only stops the corruption, it does not repair the corruption that already exists. So, we had to reconfigure OGG extract process - confirmed by the one of support engineers at last. Earlier, also got ridiculous suggestion from support saying like "The only solution is to re-instantiate the target database form the source database. So, you will need to re-do the Initial Load, so that GoldenGate starts from a new SCN." Could not believe the suggestion!


4. CLOB data corruption for not setting GoldenGate parameter. We had to upgrade GoldenGate version at destination database from 11.2.1.0.3 to 12.1.2.1.0 since we upgraded the database version. We did not have SOURCECHARSET parameter set in earlier versions - we did not need to set this since source and destination database character set are same, and also we had NLS_LANG set properly. The earlier version worked good but after the upgrade, as we did not set SOURCECHARSET, at the destination, the CLOB data got corrupted. Fortunately we immediately noticed and set the param SOURCECHARSET, and synced corrupted data from the source. 

We were confused reading one of the Oracle docs saying  - 

"Use the SOURCECHARSET parameter to control the conversion of data from the source character set to the target character set by Replicat. Replicat converts character sets by default for versions 11.2.1 and later" 

"Extract versions 11.2.1 and later write information about the source character set to the trail for use by Replicat, and any SOURCECHARSET specification is ignored."

Unfortunately, this was not mentioned clearly the GoldenGate upgrade document. After consulting Oracle support and spending lot of time, we realized that character set features are being changed in almost every GoldenGate release, and there are inconsistencies in the documentations. Anyway, our bad - we should have caught this in our test environments. This is our only fault I think (specially my fault, I relied on others for this part of testing) in the whole upgrade job, the other problems we tested and were not able to reproduce in the test environments.

[Up to this, all are back to back problems - already 90 hours passed since we started upgrading to 12c. During those 90 hours, had to work for about 80 hours (with MOS on multiple SR, on phone, on hangout/go to meeting with global teams, experiments on test environments) - that was too much! ]

5. Error ORA-600 [qksanGetTextStr:1]. Then we hit this 12c optimizer bug when went live with the production. Probably this internal Bug 21680192 - ORA-00600: INTERNAL ERROR CODE, ARGUMENTS: [QKSANGETTEXTSTR:1] - then applied the patch. Our luck - even if we applied the patch, still errors were coming. Fortunately this was only happening for few queries where SQL Plan Baselines were being used, out of our 2000+ total distinct SQLs we have in the system. Then identified those few problem SQLs and removed the plan baselines for those only. That solved the problem. The problem could not be reproduced - so Oracle support could not tell why we still received errors even after applying the patch. I did a separate post for this part earlier -  


6. Error ORA-04031: unable to allocate bytes of shared memory ("large pool","unknown object","large pool","CTWR dba buffer"). This time it happened after running the database for few days. We hit this during the maintenance window. The parallel jobs kicked by weekend maintenance window were way high than the capacity of Large Pool. Then we had to adjust the values of the parameters large_pool_size, parallel_min_servers and parallel_servers_target consulting MOS - because the values the database carried from 11g version was behaving abnormally with 12c logic and took down the server. 

7. ORA-22275: invalid lob locator specified. Next we hit this after few days. This was due to the internal Bug 14044260. Again, the patch was not available for our platform, as we needed it badly, then the patch was built and we applied that in production to fix the issue. The issue was happening at random fashion and was not reproducible in any of the test environments. 

8. High parsed system SQL. After running few days and analyzing the performance, discovered that this system query was on top of parse and execution and causing shared pool concurrency locks when child cursor invalidates - 

select default$ from col$ where rowid=:1 

This was contributing much for negative performance overall. This is new in 12c when a hard parse happens (we never had high hard parses, less than 2% always). This was due to the Bug 20907061: HIGH # OF EXECUTIONS FOR RECURSIVE CALL ON COL$ - then fixed that applying the respective patch. 

9. System query on top of sharable memory: Have been keeping eyes on the shared pool growth since the shared pool has grown unexpectedly (about 4 times than 11.2, we use ASMM). There is definitely memory leaks, there are too many cursors and invalidations than we had in 11.2. Right now, the following SQL (written in short form) has been consuming signifiant system resources and was on top of sharable memory (about 3% of total) - 

SELECT /*+ OPT_PARAM('_fix_control' '16391176:1') */ ...
FROM TABLE(SYS.GV$(CURSOR( SELECT INST_ID, GROUP
WHERE ((SESSION_ID = :B2 AND SESSION_SERIAL# = :B1 AND USERENV('INSTANCE') = :B3 )..


Then discovered the Bug 20476175 : CURSOR IS NOT SHARED BY A DIFFERENT SESSION IF _FIX_CONTROL IS USED IN OPT_PARAM. The fix is in October PSU 21359755. There are many other fixes in that patch also, have been planning to apply the PSU soon in production (God knows what's coming next), for this we probably need to revert a couple of patches those already applied. 


Lessons: Quite a few lessons for us, this would be helpful to choose and shape next sort of direction with database and architecture. Out of few decisions, one would be for database operating system, we have decided to go with "Oracle Linux" since that is Oracle's first and foremost testing platform to test any new feature, new releases, and in most cases first set of patches are released for that platform, and this will be the case in future I believe.


***

2 comments:

Anonymous said...

Thanks for share this!

Noons said...

Funny, I'm hitting exactly the same problem with "dbua" and support is telling me it's a 3113 "eof on comms channel" error.
Which is bleeding obvious and helps nothing!
This thing has been out for how long and dbua still fails with a 3113?
Who's doing QA on this release? Beer drinkers?