Dec 19, 2015

RAC storage option: Why Oracle ASM?

Long story in short - we have been planning for RAC. The database team is in favor of doing ASM for RAC where storage people are trying to push NFS/dNFS. NFS is the filesystem we have for single instance database at the moment, our storage system is NetApp. The DBA team understood the benefits and differentiating factors of ASM for RAC.


I compiled the list of reasons/benefits I had in mind.


Why Oracle ASM?


1. Recommended & predictable: It’s Oracle recommended and most commonly used storage architecture specially for RAC environment. This is mostly exercised option, possibility to hit a bug or any issue is less down this path. Oracle continuously working on improvements, every release comes with new ASM features. Training materials, community discussions, conference sessions, Oracle customer support and Oracle testing most familiar with this option.

2. Understands database file types: ASM is specially designed having database filesystem in mind. It understands the database file types and does striping by types or as instructed. For example, online redo could be striped 128KB where different datafiles could be striped as 1MB or as needed for performance. The redo striping could be vital for concurrency on 'log file sync' for a RAC environment. The other strip size could potentially decrease the database I/O and improve overall performance.

3. Wide range of I/O balancing: ASM rebalances and stripes I/O not only multiple disks but also multiple volume/aggregates removing potential I/O hotspots and ease of space management for large databases. Thus no need to think about I/O balancing and managing storage by volume/file system. Just create a single ASM disk group for data/index combining LUNs from different disk groups and storage controller of same type. Presently in volume based system, we need to worry about which datafiles would need to go to which volume and controller for better space and I/O utilization, which is getting difficult day by day as the database size increases, and will not be possible to do this correctly, even if I overlook the datafiles management overhead. ASM is the solution for this.

4. Convenient for big databases: Oracle Managed Files (OMF) and bigfile tablespaces are best suited for big databases and with ASM.

5. No I/O overhead: ASM does not perform any I/O by itself, so there is no translation layer for Oracle I/O to datafiles into disk block offsets. I/O from databases is directly applied to disk volumes without modification. This reduces overhead and improves performance (point collected from a doc). Something like dNFS I believe.

6. Not tied to underlying storage vendor: If we want to change storage technology (say ZS3) in future, having database on ASM will make things easy and feasible, changing underlying storage technology would be transparent, no visible database migration will be required. After 3 years, the database size would be 25TB+, migrating 25TB+ database to a new storage technology would be a mammoth task, will require tremendous effort and may require several weeks downtime for migration (unless we built another db in parallel). Using ASM won’t lock us down to any storage technology.

7. Optimized by default: 

When we use ASM for RAC, a lot of things are optimized by default and works pretty good without further time and resource investments thiking about I/O performance and all other config tweaking. We found I/O throughput gains about 4 times than NFS on the same storage!

8. Common references: We have not seen mentionable RAC production system without ASM, not from the few dozen people we know. Never seen any Database Architect or Database Administrator discussing RAC and they don’t have ASM on the storage side. It’s very conventional having ASM in storage layer with RAC - that’s the proven path and inlined with RAC.

From the studies of RAC customers (from publicly available sources), it’s clear that most people selected ASM for RAC implementations.

http://www.oracle.com/technetwork/database/features/ha-casestudies-098033.html
http://www.oracle.com/technetwork/documentation/ha-casestudies-083953.html
http://www.slideshare.net/MarkusMichalewicz/oracle-rac-customer-proven-scalalbility
http://www.oracle.com/us/products/middleware/data-integration/odi-customer-reference-booklet-2184014.pdf


***


Dec 10, 2015

Database 12c hiccups!

Human nature - forgets bad things easily if that is not bad enough. Thought to put this in writing so that it may help others and reminds the lessons for the next time. Wanted to write a couple of pages but have a feeling - now a days people doesn't have time to spend on reading. Trying to be "to the point".

About

This is all about the trouble we faced after we upgraded database from 11gR2 to 12cR1.

Platform

Solaris 10, x86-64
From database version: 11.2.0.3
To database version: 12.1.0.2
System: High transactional (305,000+ logical read/sec, 1800+ executes/sec), 14TB OLTP database for n-tier JEE web applications having 10,000+ concurrent online users (out of 350,000+ total users).

When 

The upgrade took place in March-April 2015.

Why 12c

1. The next big database move for us is migrating to RAC architecture sometimes in 2016. Logically, it should be on 12c version. It's out there for long time with so many improvements in all layers. Before we migrate to RAC architecture, how about working on the new database version (12c) in production for sometime to figure out the odds earlier (before RAC) and individually, so that we won't have to introduce multiple new things together, and won't have to fight with too many odds at the same time (12c and RAC).

2. There was a bug in memory and/or RMAN backup process and sometimes terminates abruptly. This triggered when we grew SGA over 28GB. Needed to fix this upgrading to higher version.

3. 12c optimizer got smarter, many cool features are there including adaptive plan, adaptive statistics, enhancements in existing features - our tests also showed most of the queries got faster about 20%!

4. People had been telling good things about 12c since it's out there. It was already 1.5 years old. The 12c multi-tanent architecture would be the ultimate fit for the cloud - and Oracle have been persuading towards that.


The problems

1. The DBUA upgrade process failed half way. The process could not be rolled back. Had a situation to lose the database. It was not a good option to restore the 13TB size database from backups. So, had to go ahead with manual upgrade process which was tedious, time consuming, and a bit risky too. The support could not say clearly what was the cause of the failure, indicated invalid objects in "sys_inv_objs" which can't be the cause we knew- there were 3/4 invalid objects we had which were owned by our user schema (created by us) - those could never trigger the problem - we tested the upgrade process having the same invalid objects in our test environments many times, anyway. But it's good to have all invalid objects eliminated before any upgrade so that people can't put wrong pointers.

Tip: When upgrading to 12c, it's better to do "manual upgrade using SQL scripts and utilities" rather than using DBUA, even if that works fine in test environments. This is specially true if the environment had previous upgrade history. For example, when database already upgraded from some previous versions like- upgraded from 11.1 to 11.2, from 11.2.0.2 to 11.2.0.3 etc.


2. LogMiner Bug 19931709. When we brought up the database after the upgrade and tested a little, we hit this problem. This caused the GoldenGate extract process to fail with the following errors - 

WARNING OGG-00869 Oracle GoldenGate Capture for Oracle, extdf01.prm: ORA-26914: Unable to communicate with GoldenGate capture process "OGG$CAP_EXTDF01" from outbound server "OGG$EXTDF01".

ERROR OGG-02077 Oracle GoldenGate Capture for Oracle, extdf01.prm: Extract encountered a read error in the asynchronous reader thread and is abending: ORA-01343: LogMiner encountered corruption in the log stream


The patch was not available for our platform, Oracle dev immediately worked on it and made the patch available. Strange thing was, we did not have any problem in test environments!

3. GoldenGate process failure after fixing Bug 19931709. Even if we applied the fix for the bug, still receiving errors from GoldenGate process. The thing was - once the corruption had been entered into the Redo stream, it cannot be reversed. The fix for Bug 19931709 only stops the corruption, it does not repair the corruption that already exists. So, we had to reconfigure OGG extract process - confirmed by the one of support engineers at last. Earlier, also got ridiculous suggestion from support saying like "The only solution is to re-instantiate the target database form the source database. So, you will need to re-do the Initial Load, so that GoldenGate starts from a new SCN." Could not believe the suggestion!


4. CLOB data corruption for not setting GoldenGate parameter. We had to upgrade GoldenGate version at destination database from 11.2.1.0.3 to 12.1.2.1.0 since we upgraded the database version. We did not have SOURCECHARSET parameter set in earlier versions - we did not need to set this since source and destination database character set are same, and also we had NLS_LANG set properly. The earlier version worked good but after the upgrade, as we did not set SOURCECHARSET, at the destination, the CLOB data got corrupted. Fortunately we immediately noticed and set the param SOURCECHARSET, and synced corrupted data from the source. 

We were confused reading one of the Oracle docs saying  - 

"Use the SOURCECHARSET parameter to control the conversion of data from the source character set to the target character set by Replicat. Replicat converts character sets by default for versions 11.2.1 and later" 

"Extract versions 11.2.1 and later write information about the source character set to the trail for use by Replicat, and any SOURCECHARSET specification is ignored."

Unfortunately, this was not mentioned clearly the GoldenGate upgrade document. After consulting Oracle support and spending lot of time, we realized that character set features are being changed in almost every GoldenGate release, and there are inconsistencies in the documentations. Anyway, our bad - we should have caught this in our test environments. This is our only fault I think (specially my fault, I relied on others for this part of testing) in the whole upgrade job, the other problems we tested and were not able to reproduce in the test environments.

[Up to this, all are back to back problems - already 90 hours passed since we started upgrading to 12c. During those 90 hours, had to work for about 80 hours (with MOS on multiple SR, on phone, on hangout/go to meeting with global teams, experiments on test environments) - that was too much! ]

5. Error ORA-600 [qksanGetTextStr:1]. Then we hit this 12c optimizer bug when went live with the production. Probably this internal Bug 21680192 - ORA-00600: INTERNAL ERROR CODE, ARGUMENTS: [QKSANGETTEXTSTR:1] - then applied the patch. Our luck - even if we applied the patch, still errors were coming. Fortunately this was only happening for few queries where SQL Plan Baselines were being used, out of our 2000+ total distinct SQLs we have in the system. Then identified those few problem SQLs and removed the plan baselines for those only. That solved the problem. The problem could not be reproduced - so Oracle support could not tell why we still received errors even after applying the patch. I did a separate post for this part earlier -  


6. Error ORA-04031: unable to allocate bytes of shared memory ("large pool","unknown object","large pool","CTWR dba buffer"). This time it happened after running the database for few days. We hit this during the maintenance window. The parallel jobs kicked by weekend maintenance window were way high than the capacity of Large Pool. Then we had to adjust the values of the parameters large_pool_size, parallel_min_servers and parallel_servers_target consulting MOS - because the values the database carried from 11g version was behaving abnormally with 12c logic and took down the server. 

7. ORA-22275: invalid lob locator specified. Next we hit this after few days. This was due to the internal Bug 14044260. Again, the patch was not available for our platform, as we needed it badly, then the patch was built and we applied that in production to fix the issue. The issue was happening at random fashion and was not reproducible in any of the test environments. 

8. High parsed system SQL. After running few days and analyzing the performance, discovered that this system query was on top of parse and execution and causing shared pool concurrency locks when child cursor invalidates - 

select default$ from col$ where rowid=:1 

This was contributing much for negative performance overall. This is new in 12c when a hard parse happens (we never had high hard parses, less than 2% always). This was due to the Bug 20907061: HIGH # OF EXECUTIONS FOR RECURSIVE CALL ON COL$ - then fixed that applying the respective patch. 

9. System query on top of sharable memory: Have been keeping eyes on the shared pool growth since the shared pool has grown unexpectedly (about 4 times than 11.2, we use ASMM). There is definitely memory leaks, there are too many cursors and invalidations than we had in 11.2. Right now, the following SQL (written in short form) has been consuming signifiant system resources and was on top of sharable memory (about 3% of total) - 

SELECT /*+ OPT_PARAM('_fix_control' '16391176:1') */ ...
FROM TABLE(SYS.GV$(CURSOR( SELECT INST_ID, GROUP
WHERE ((SESSION_ID = :B2 AND SESSION_SERIAL# = :B1 AND USERENV('INSTANCE') = :B3 )..


Then discovered the Bug 20476175 : CURSOR IS NOT SHARED BY A DIFFERENT SESSION IF _FIX_CONTROL IS USED IN OPT_PARAM. The fix is in October PSU 21359755. There are many other fixes in that patch also, have been planning to apply the PSU soon in production (God knows what's coming next), for this we probably need to revert a couple of patches those already applied. 


Lessons: Quite a few lessons for us, this would be helpful to choose and shape next sort of direction with database and architecture. Out of few decisions, one would be for database operating system, we have decided to go with "Oracle Linux" since that is Oracle's first and foremost testing platform to test any new feature, new releases, and in most cases first set of patches are released for that platform, and this will be the case in future I believe.


***

Sep 11, 2015

The "ORA-600 [qksanGetTextStr:1]" in 12.1.0.2 and workaround

One of the situations we had after we upgraded to 12c - we had been receiving ORA 600 [qksanGetTextStr:1] errors. After looking at MOS, found that this was a bug and the patch for that is 19141838. This is due to some glitch in SQL Plan Baselines. One workaround is disabling that completely (ALTER SYSTEM SET OPTIMIZER_USE_SQL_PLAN_BASELINES=false).  However,  it made no sense for us, we could not do that, we have been relying heavily on SQL Plan Baselines since 11gR2.

After we applied the patch (19141838), the problem gone for one database but had been hitting errors from the other database which we also upgraded to 12c. This was mostly happened for two SQL fortunately out of thousands. While support trying to understand why patch did not work for the other database, my workaround for this (other than completely disabling SQL Plan Baselines) was below -

1. Find the error in alert* file (Oracle alert log file, which should be in $ORACLE_BASE/diag/rdbms/.... )

2. Locate the Incident details trace file from alert log

3. Check the trace file and find the Current SQL for the session (the problem SQL)

4. Find the SQL Handle for the problem SQL - have to match by text most of the time (be careful about picking the right SQL) - if you don't find the SQL by SQL ID. For example -

SELECT signature, sql_handle, plan_name, sql_text
FROM dba_sql_plan_baselines
WHERE signature IN (SELECT exact_matching_signature FROM v$sql WHERE sql_id = '&sql_id' );

OR, (if no output for the above)

SELECT signature, sql_handle, plan_name, sql_text
FROM dba_sql_plan_baselines
WHERE sql_text LIKE 'SELECT rec.id, rec.mar_id, rec.client, rec.client_display_name%'
ORDER BY signature, sql_handle;



5. Drop all plans (using SQL Handle) from the baselines associated with the SQL -


SET SERVEROUTPUT ON;

DECLARE
l_plans_dropped  PLS_INTEGER;
BEGIN
l_plans_dropped := DBMS_SPM.drop_sql_plan_baseline (
sql_handle => '&sql_handle',
plan_name  => NULL);
DBMS_OUTPUT.put_line(l_plans_dropped);
END;
/



***

Jul 24, 2015

When SQL suddenly ends up with wrong execution plan and Tuning Advisor fails to help

If you suddenly notice that a SQL runs bad but it did not cause any problem earlier, this could be addressed as follows when Tuning Advisor can't suggest any fix -

1. Look at some statistics to check how bad is the query looking at various values like below -


SELECT sql_text, executions, DECODE(executions,0,0,round(((elapsed_time/1000000)/executions),2)) "Elapsed/Exec", version_count, invalidations, parse_calls
FROM v$sqlarea where sql_id='&sql_id'
AND last_active_time BETWEEN (sysdate - 15/1440) AND sysdate;


2. Look at the current plan if it looks bad/wrong -

SELECT * FROM TABLE(dbms_xplan.display_cursor('&sql_id'));


3. Flush the cursor for the bad plan and let it reload -


SET SERVEROUT ON
DECLARE
stmt varchar2(1000);
BEGIN
FOR dr IN (SELECT sql_text, address||','||hash_value ah from v$sqlarea where sql_id='&sql_id')
LOOP
stmt := 'dbms_shared_pool.purge('|| dr.ah||',''c'')';
dbms_shared_pool.purge(dr.ah,'c');
dbms_output.put_line(stmt);
END LOOP;
END;
/

4. Check the current execution plan -


SELECT * FROM TABLE(dbms_xplan.display_cursor('&sql_id'));


5. If the plan is still same and wrong, look at the old execution plan from AWR snapshots for the query and see what's the difference -


@$ORACLE_HOME/rdbms/admin/awrsqrpt.sql


Or,

Look at all the plans for the SQL in AWR snapshots -


SELECT plan_table_output FROM table (dbms_xplan.display_awr('&sql_id'));

6. Collect statistics for all the associated tables for the problem query (if you know statistics is not recent)


7. Flush the cursor for the bad plan and let it reload (step 3 above)


* Note that a permanent solution could be setting up SQL Plan Baselines for the SQL if wrong execution plan was the problem.
* There may be high spikes/lock in shared memory when the plan is being purged - depends on system type and activities.

***