Dec 19, 2015

RAC storage option: Why Oracle ASM?

Long story in short - we have been planning for RAC. The database team is in favor of doing ASM for RAC where storage people are trying to push NFS/dNFS. NFS is the filesystem we have for single instance database at the moment, our storage system is NetApp. It is very unfortunate that people are saying 'ASM is not necessary' without understanding the benefits and differentiating factors of ASM for RAC.

Anyway, I compiled the list of reasons I had in mind, I've been kind of playing the same record for quite some times now.


Why Oracle ASM?


1. Recommended & predictable: It’s Oracle recommended and most commonly used storage architecture specially for RAC environment. This is mostly exercised option, possibility to hit a bug or any issue is less down this path. Oracle continuously working on improvements, every release comes with new ASM features. Training materials, community discussions, conference sessions, Oracle customer support and Oracle testing most familiar with this option.

2. Understands database file types: ASM is specially designed having database filesystem in mind. It understands the database file types and does striping by types or as instructed. For example, online redo could be striped 128KB where different datafiles could be striped as 1MB or as needed for performance. The redo striping could be vital for concurrency on 'log file sync' for a RAC environment. The other strip size could potentially decrease the database I/O and improve overall performance.

3. Wide range of I/O balancing: ASM rebalances and stripes I/O not only multiple disks but also multiple volume/aggregates removing potential I/O hotspots and ease of space management for large databases. Thus no need to think about I/O balancing and managing storage by volume/file system. Just create a single ASM disk group for data/index combining LUNs from different disk groups and storage controller of same type. Presently in volume based system, we need to worry about which datafiles would need to go to which volume and controller for better space and I/O utilization, which is getting difficult day by day as the database size increases, and will not be possible to do this correctly, even if I overlook the datafiles management overhead. ASM is the solution for this.

4. Convenient for big databases: Oracle Managed Files (OMF) and bigfile tablespaces are best suited for big databases and with ASM.

5. No I/O overhead: ASM does not perform any I/O by itself, so there is no translation layer for Oracle I/O to datafiles into disk block offsets. I/O from databases is directly applied to disk volumes without modification. This reduces overhead and improves performance (point collected from a doc). Something like dNFS I believe.

6. Not tied to underlying storage vendor: If we want to change storage technology (say ZS3) in future, having database on ASM will make things easy and feasible, changing underlying storage technology would be transparent, no visible database migration will be required. After 3 years, the database size would be 25TB+, migrating 25TB+ database to a new storage technology would be a mammoth task, will require tremendous effort and may require several weeks downtime for migration (unless we built another db in parallel). Using ASM won’t lock us down to any storage technology.

7. Common references: We have not seen mentionable RAC production system without ASM, not from the few dozen people we know. Never seen any Database Architect or Database Administrator discussing RAC and they don’t have ASM on the storage side. It’s very conventional having ASM in storage layer with RAC - that’s the proven path and inlined with RAC.

From the studies of RAC customers (from publicly available sources), it’s clear that most people selected ASM for RAC implementations.

http://www.oracle.com/technetwork/database/features/ha-casestudies-098033.html
http://www.oracle.com/technetwork/documentation/ha-casestudies-083953.html
http://www.slideshare.net/MarkusMichalewicz/oracle-rac-customer-proven-scalalbility
http://www.oracle.com/us/products/middleware/data-integration/odi-customer-reference-booklet-2184014.pdf


***


Dec 10, 2015

Database 12c hiccups!

Human nature - forgets bad things easily if that is not bad enough. Thought to put this in writing so that it may help others and reminds the lessons for the next time. Wanted to write a couple of pages but have a feeling - now a days people doesn't have time to spend on reading. Trying to be "to the point".

About

This is all about the trouble we faced after we upgraded database from 11gR2 to 12cR1.

Platform

Solaris 10, x86-64
From database version: 11.2.0.3
To database version: 12.1.0.2
System: High transactional (305,000+ logical read/sec, 1800+ executes/sec), 14TB OLTP database for n-tier JEE web applications having 10,000+ concurrent online users (out of 350,000+ total users).

When 

The upgrade took place in March-April 2015.

Why 12c

1. The next big database move for us is migrating to RAC architecture sometimes in 2016. Logically, it should be on 12c version. It's out there for long time with so many improvements in all layers. Before we migrate to RAC architecture, how about working on the new database version (12c) in production for sometime to figure out the odds earlier (before RAC) and individually, so that we won't have to introduce multiple new things together, and won't have to fight with too many odds at the same time (12c and RAC).

2. There was a bug in memory and/or RMAN backup process and sometimes terminates abruptly. This triggered when we grew SGA over 28GB. Needed to fix this upgrading to higher version.

3. 12c optimizer got smarter, many cool features are there including adaptive plan, adaptive statistics, enhancements in existing features - our tests also showed most of the queries got faster about 20%!

4. People had been telling good things about 12c since it's out there. It was already 1.5 years old. The 12c multi-tanent architecture would be the ultimate fit for the cloud - and Oracle have been persuading towards that.


The problems

1. The DBUA upgrade process failed half way. The process could not be rolled back. Had a situation to lose the database. It was not a good option to restore the 13TB size database from backups. So, had to go ahead with manual upgrade process which was tedious, time consuming, and a bit risky too. The support could not say clearly what was the cause of the failure, indicated invalid objects in "sys_inv_objs" which can't be the cause we knew- there were 3/4 invalid objects we had which were owned by our user schema (created by us) - those could never trigger the problem - we tested the upgrade process having the same invalid objects in our test environments many times, anyway. But it's good to have all invalid objects eliminated before any upgrade so that people can't put wrong pointers.

Tip: When upgrading to 12c, it's better to do "manual upgrade using SQL scripts and utilities" rather than using DBUA, even if that works fine in test environments. This is specially true if the environment had previous upgrade history. For example, when database already upgraded from some previous versions like- upgraded from 11.1 to 11.2, from 11.2.0.2 to 11.2.0.3 etc.


2. LogMiner Bug 19931709. When we brought up the database after the upgrade and tested a little, we hit this problem. This caused the GoldenGate extract process to fail with the following errors - 

WARNING OGG-00869 Oracle GoldenGate Capture for Oracle, extdf01.prm: ORA-26914: Unable to communicate with GoldenGate capture process "OGG$CAP_EXTDF01" from outbound server "OGG$EXTDF01".

ERROR OGG-02077 Oracle GoldenGate Capture for Oracle, extdf01.prm: Extract encountered a read error in the asynchronous reader thread and is abending: ORA-01343: LogMiner encountered corruption in the log stream


The patch was not available for our platform, Oracle dev immediately worked on it and made the patch available. Strange thing was, we did not have any problem in test environments!

3. GoldenGate process failure after fixing Bug 19931709. Even if we applied the fix for the bug, still receiving errors from GoldenGate process. The thing was - once the corruption had been entered into the Redo stream, it cannot be reversed. The fix for Bug 19931709 only stops the corruption, it does not repair the corruption that already exists. So, we had to reconfigure OGG extract process - confirmed by the one of support engineers at last. Earlier, also got ridiculous suggestion from support saying like "The only solution is to re-instantiate the target database form the source database. So, you will need to re-do the Initial Load, so that GoldenGate starts from a new SCN." Could not believe the suggestion!


4. CLOB data corruption for not setting GoldenGate parameter. We had to upgrade GoldenGate version at destination database from 11.2.1.0.3 to 12.1.2.1.0 since we upgraded the database version. We did not have SOURCECHARSET parameter set in earlier versions - we did not need to set this since source and destination database character set are same, and also we had NLS_LANG set properly. The earlier version worked good but after the upgrade, as we did not set SOURCECHARSET, at the destination, the CLOB data got corrupted. Fortunately we immediately noticed and set the param SOURCECHARSET, and synced corrupted data from the source. 

We were confused reading one of the Oracle docs saying  - 

"Use the SOURCECHARSET parameter to control the conversion of data from the source character set to the target character set by Replicat. Replicat converts character sets by default for versions 11.2.1 and later" 

"Extract versions 11.2.1 and later write information about the source character set to the trail for use by Replicat, and any SOURCECHARSET specification is ignored."

Unfortunately, this was not mentioned clearly the GoldenGate upgrade document. After consulting Oracle support and spending lot of time, we realized that character set features are being changed in almost every GoldenGate release, and there are inconsistencies in the documentations. Anyway, our bad - we should have caught this in our test environments. This is our only fault I think (specially my fault, I relied on others for this part of testing) in the whole upgrade job, the other problems we tested and were not able to reproduce in the test environments.

[Up to this, all are back to back problems - already 90 hours passed since we started upgrading to 12c. During those 90 hours, had to work for about 80 hours (with MOS on multiple SR, on phone, on hangout/go to meeting with global teams, experiments on test environments) - that was too much! ]

5. Error ORA-600 [qksanGetTextStr:1]. Then we hit this 12c optimizer bug when went live with the production. Probably this internal Bug 21680192 - ORA-00600: INTERNAL ERROR CODE, ARGUMENTS: [QKSANGETTEXTSTR:1] - then applied the patch. Our luck - even if we applied the patch, still errors were coming. Fortunately this was only happening for few queries where SQL Plan Baselines were being used, out of our 2000+ total distinct SQLs we have in the system. Then identified those few problem SQLs and removed the plan baselines for those only. That solved the problem. The problem could not be reproduced - so Oracle support could not tell why we still received errors even after applying the patch. I did a separate post for this part earlier -  


6. Error ORA-04031: unable to allocate bytes of shared memory ("large pool","unknown object","large pool","CTWR dba buffer"). This time it happened after running the database for few days. We hit this during the maintenance window. The parallel jobs kicked by weekend maintenance window were way high than the capacity of Large Pool. Then we had to adjust the values of the parameters large_pool_size, parallel_min_servers and parallel_servers_target consulting MOS - because the values the database carried from 11g version was behaving abnormally with 12c logic and took down the server. 

7. ORA-22275: invalid lob locator specified. Next we hit this after few days. This was due to the internal Bug 14044260. Again, the patch was not available for our platform, as we needed it badly, then the patch was built and we applied that in production to fix the issue. The issue was happening at random fashion and was not reproducible in any of the test environments. 

8. High parsed system SQL. After running few days and analyzing the performance, discovered that this system query was on top of parse and execution and causing shared pool concurrency locks when child cursor invalidates - 

select default$ from col$ where rowid=:1 

This was contributing much for negative performance overall. This is new in 12c when a hard parse happens (we never had high hard parses, less than 2% always). This was due to the Bug 20907061: HIGH # OF EXECUTIONS FOR RECURSIVE CALL ON COL$ - then fixed that applying the respective patch. 

9. System query on top of sharable memory: Have been keeping eyes on the shared pool growth since the shared pool has grown unexpectedly (about 4 times than 11.2, we use ASMM). There is definitely memory leaks, there are too many cursors and invalidations than we had in 11.2. Right now, the following SQL (written in short form) has been consuming signifiant system resources and was on top of sharable memory (about 3% of total) - 

SELECT /*+ OPT_PARAM('_fix_control' '16391176:1') */ ...
FROM TABLE(SYS.GV$(CURSOR( SELECT INST_ID, GROUP
WHERE ((SESSION_ID = :B2 AND SESSION_SERIAL# = :B1 AND USERENV('INSTANCE') = :B3 )..


Then discovered the Bug 20476175 : CURSOR IS NOT SHARED BY A DIFFERENT SESSION IF _FIX_CONTROL IS USED IN OPT_PARAM. The fix is in October PSU 21359755. There are many other fixes in that patch also, have been planning to apply the PSU soon in production (God knows what's coming next), for this we probably need to revert a couple of patches those already applied. 


Lessons: Quite a few lessons which would be helping us to choose and shape next sort of direction with database/architecture. Out of few decisions, one would be for database operating system, we have decided to go with "Oracle Linux" since that is Oracle's first and foremost testing platform to test any feature, new releases - also first set of patches are released for that platform, and this will be the same in future for sure. So why not us?


***

Sep 11, 2015

The "ORA-600 [qksanGetTextStr:1]" in 12.1.0.2 and workaround

One of the situations we had after we upgraded to 12c - we had been receiving ORA 600 [qksanGetTextStr:1] errors. After looking at MOS, found that this was a bug and the patch for that is 19141838. This is due to some glitch in SQL Plan Baselines. One workaround is disabling that completely (ALTER SYSTEM SET OPTIMIZER_USE_SQL_PLAN_BASELINES=false).  However,  it made no sense for us, we could not do that, we have been relying heavily on SQL Plan Baselines since 11gR2.

After we applied the patch (19141838), the problem gone for one database but had been hitting errors from the other database which we also upgraded to 12c. This was mostly happened for two SQL fortunately out of thousands. While support trying to understand why patch did not work for the other database, my workaround for this (other than completely disabling SQL Plan Baselines) was below -

1. Find the error in alert* file (Oracle alert log file, which should be in $ORACLE_BASE/diag/rdbms/.... )

2. Locate the Incident details trace file from alert log

3. Check the trace file and find the Current SQL for the session (the problem SQL)

4. Find the SQL Handle for the problem SQL - have to match by text most of the time (be careful about picking the right SQL) - if you don't find the SQL by SQL ID. For example -

SELECT signature, sql_handle, plan_name, sql_text
FROM dba_sql_plan_baselines
WHERE signature IN (SELECT exact_matching_signature FROM v$sql WHERE sql_id = '&sql_id' );

OR, (if no output for the above)

SELECT signature, sql_handle, plan_name, sql_text
FROM dba_sql_plan_baselines
WHERE sql_text LIKE 'SELECT rec.id, rec.mar_id, rec.client, rec.client_display_name%'
ORDER BY signature, sql_handle;



5. Drop all plans (using SQL Handle) from the baselines associated with the SQL -


SET SERVEROUTPUT ON;

DECLARE
l_plans_dropped  PLS_INTEGER;
BEGIN
l_plans_dropped := DBMS_SPM.drop_sql_plan_baseline (
sql_handle => '&sql_handle',
plan_name  => NULL);
DBMS_OUTPUT.put_line(l_plans_dropped);
END;
/



***

Jul 24, 2015

When SQL suddenly ends up with wrong execution plan and Tuning Advisor fails to help

If you suddenly notice that a SQL runs bad but it did not cause any problem earlier, this could be addressed as follows when Tuning Advisor can't suggest any fix -

1. Look at some statistics to check how bad is the query looking at various values like below -


SELECT sql_text, executions, DECODE(executions,0,0,round(((elapsed_time/1000000)/executions),2)) "Elapsed/Exec", version_count, invalidations, parse_calls
FROM v$sqlarea where sql_id='&sql_id'
AND last_active_time BETWEEN (sysdate - 15/1440) AND sysdate;


2. Look at the current plan if it looks bad/wrong -

SELECT * FROM TABLE(dbms_xplan.display_cursor('&sql_id'));


3. Flush the cursor for the bad plan and let it reload -


SET SERVEROUT ON
DECLARE
stmt varchar2(1000);
BEGIN
FOR dr IN (SELECT sql_text, address||','||hash_value ah from v$sqlarea where sql_id='&sql_id')
LOOP
stmt := 'dbms_shared_pool.purge('|| dr.ah||',''c'')';
dbms_shared_pool.purge(dr.ah,'c');
dbms_output.put_line(stmt);
END LOOP;
END;
/

4. Check the current execution plan -


SELECT * FROM TABLE(dbms_xplan.display_cursor('&sql_id'));


5. If the plan is still same and wrong, look at the old execution plan from AWR snapshots for the query and see what's the difference -


@$ORACLE_HOME/rdbms/admin/awrsqrpt.sql


Or,

Look at all the plans for the SQL in AWR snapshots -


SELECT plan_table_output FROM table (dbms_xplan.display_awr('&sql_id'));

6. Collect statistics for all the associated tables for the problem query (if you know statistics is not recent)


7. Flush the cursor for the bad plan and let it reload (step 3 above)


* Note that a permanent solution could be setting up SQL Plan Baselines for the SQL if wrong execution plan was the problem.
* There may be high spikes/lock in shared memory when the plan is being purged - depends on system type and activities.

***

May 4, 2014

Scaling application with paginated query using rownum

There was a case where one of the application modules getting slower day by day as the usage and data increased. The module is a mail like system where we show 20 items per page. The initial design was not that good, we fetched all the items from database and then paginated in application layer - 20 list item per page. This caused slower response for loading the inbox, specially for those people who had a large number of items.

Thought this can be improved if we fetch selective number of records from database. We show 20 records per page, so we should only try to bring 20 records, not all at the very beginning. For the first page, we will bring records from 1 to 20, for the next page (2nd page), we will bring records from 21 to 40 and so on. This helped a lot. The initial response time decreased from several minutes to just few seconds! I loved the response time graphs (see below) after we made the changes.


/** Old List query **/
SELECT mbox.id mbox_id, mbox.mailbox_type mbox_mailbox_type, mbox.read_date mbox_read_date, mbox.status mbox_status,...
FROM mailbox mbox
LEFT JOIN message msg on (mbox.message_id = msg.id)
LEFT JOIN user u ON (u.id = msg.sender_id )
LEFT JOIN recipient r on (r.message_id = msg.id AND r.idx=0)
LEFT JOIN user ur ON (ur.id = r.target_user_id)
WHERE mbox.user_id= :user
AND mbox.folder_id= :folder
AND msg.test = :test
AND msg.updated >= :dt
ORDER BY msg_updated DESC;



/** New paginated List query **/
SELECT *
  FROM (select /*+ FIRST_ROWS(20) */
  a.*, ROWNUM rnum
      from ( SELECT mbox.id mbox_id, mbox.mailbox_type mbox_mailbox_type, mbox.read_date mbox_read_date, mbox.status mbox_status,...
              FROM mailbox mbox
              LEFT JOIN message msg on (mbox.message_id = msg.id)
              LEFT JOIN user u ON (u.id = msg.sender_id )
              LEFT JOIN recipient r on (r.message_id = msg.id AND r.idx=0)
              LEFT JOIN user ur ON (ur.id = r.target_user_id)
              WHERE mbox.user_id= :user
              AND mbox.folder_id= :folder
              AND msg.test = :test
              AND msg.updated >= :dt
              ORDER BY msg_updated DESC ) a
      where ROWNUM <= 20 )
WHERE rnum  >= 1;


Similarly, the Count queries were improved fixing the count to 100. In that case, if there is more than 100 new items in inbox, we say 100+.

/** Old Count query **/
SELECT COUNT(id) inbox_lo
FROM mailbox mbx 
JOIN message msg ON (mbx.message_id = msg.id)
LEFT JOIN user l ON (l.id = msg.sender_id)
WHERE mbx.login_id = :login
AND mbx.folder_id = :folder
AND mbx.mailbox_type = :mailbox
AND mbx.status= :status
AND msg.test = :test
AND mbx.created >= :dt;


/** New Count query **/
SELECT /*+ FIRST_ROWS(100) */ COUNT(id) inbox_lo
FROM mailbox mbx 
JOIN message msg ON (mbx.message_id = msg.id)
LEFT JOIN user l ON (l.id = msg.sender_id)
WHERE mbx.login_id = :login
AND mbx.folder_id = :folder
AND mbx.mailbox_type = :mailbox
AND mbx.status= :status
AND msg.test = :test
AND mbx.created >= :dt
AND ROWNUM <= 100;


The response time improved amazingly - users now feel the difference!

***

Apr 28, 2014

TNS-12599: TNS:cryptographic checksum mismatch

After upgrading to 11.2.0.3.0, we had been receiving this error "TNS-12599: TNS:cryptographic checksum mismatch" quite frequently.

Found related information in Oracle support. The thing is - when Enterprise Manager logs in or out from a 11.1 (11gR1) or 11.2 (11gR2) target database or as the OMS database repository, it causes several TNS-12599 errors in the alert.log of the respective database.

This is for the bug 9953045.

Workaround:

1. On the target database side, set SQLNET.ENCRYPTION_TYPES_SERVER= (3DES168)  in sqlnet.ora

2. Patches for bug 9953045 will fix the problem.



Apr 2, 2014

Replacing SQL literals with Global Temporary Table backfired at the beginning

I always emphasize on understanding data and business domain for query optimization. One must have technical skills in combination with the domain knowledge. Other than the blend, who could tune at this level!

We planned to tune some of the queries where we use literals and can't use bind variables as those are inside 'IN clause', and we don't exactly know how many values would be there (dynamically added by the application). We decided to replace the literals in 'IN clause' by Global Temporary Table (GTT). That change made every other queries perform better or at least same as before other than the SQL I am going to talk below. Changing literals in 'IN clause' caused to SQL to come up with very wrong execution plan and blundered our system's stability - the frequency of execution was very high for the SQL, we could not afford the bad plan.

After analyzing the plan, I found that the SQL with Global Temporary Table (GTT) doing unexpected Nested Loops at the very early state of execution order which was causing more data to process in other steps for the result. We know that, the join with GTT should be done at the later or last stage of execution order. And the best way (for the best selectivity) is to start execution with the underlying table mv_caseload_priv (through the index MV_PGM_CASE_I) we have in the client_access_role view. I drew the plan by hand and attached below for a clear understanding what was going on. The Nested Loop (NL, marked as red circle) was very unexpected, also the execution start point (marked as red cross) was not optimum.


The original SQL and the plan, only problem was not having bind variables, sometimes this 
ended up with terrible execution plan for some bind values in the SQL, was uncontrollable.

SQL > SELECT DISTINCT login_id
FROM client_access_role car
WHERE car.pgm_id = :pgm
AND car.role IN (21,22,112,......);


Elapsed: 00:00:00.11
------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                    | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                         | 11900 |   429K|   678   (1)| 00:00:01 |       |       |
|   1 |  HASH UNIQUE                             |                         | 11900 |   429K|   678   (1)| 00:00:01 |       |       |
|*  2 |   HASH JOIN                              |                         | 11900 |   429K|   676   (1)| 00:00:01 |       |       |
|   3 |    INLIST ITERATOR                       |                         |       |       |            |          |       |       |
|*  4 |     INDEX RANGE SCAN                     | AR_ROLE_ACL_I           | 16724 |   130K|    31   (0)| 00:00:01 |       |       |
|   5 |    NESTED LOOPS                          |                         |  9589 |   271K|   643   (1)| 00:00:01 |       |       |
|   6 |     MAT_VIEW ACCESS BY GLOBAL INDEX ROWID| MV_CASELOAD_PRIV        |   279 |  3348 |    83   (0)| 00:00:01 | ROWID | ROWID |
|*  7 |      INDEX RANGE SCAN                    | MV_CASE_PRIV_PGM_I      |   279 |       |     3   (0)| 00:00:01 |       |       |
|*  8 |     INDEX RANGE SCAN                     | CA_CASELOAD_LOGIN_ACL_I |    34 |   578 |     2   (0)| 00:00:01 |       |       |
------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("CA"."ACL_ID"="R"."ACL_ID")
   4 - access("R"."ROLE"=21 OR "R"."ROLE"=22 OR "R"."ROLE"=112)
   7 - access("CP"."PGM_ID"=TO_NUMBER(:PGM))
   8 - access("CP"."CASELOAD_ID"="CA"."CASELOAD_ID")
Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
         97  consistent gets


Replaced 'IN clause' with a Global Temporary Table for generating fixed sort of plan, which 
backfired, caused severe performance problem for the unexpected terrible plan.

SQL> SELECT DISTINCT login_id
FROM client_access_role car
WHERE car.pgm_id = :pgm
AND car.role IN (SELECT value FROM gtt_in_list_2);


Elapsed: 00:00:18.98
Execution Plan
----------------------------------------------------------
Plan hash value: 1525097291
-------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name            | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |                 |     1 |    50 |    34   (3)| 00:00:01 |       |       |
|   1 |  HASH UNIQUE                    |                 |     1 |    50 |    34   (3)| 00:00:01 |       |       |
|   2 |   NESTED LOOPS                  |                 |     1 |    50 |    33   (0)| 00:00:01 |       |       |
|   3 |    NESTED LOOPS                 |                 |     1 |    38 |    32   (0)| 00:00:01 |       |       |
|   4 |     NESTED LOOPS                |                 |     1 |    21 |     1   (0)| 00:00:01 |       |       |
|   5 |      INDEX FULL SCAN            | GTT2_VAL_I      |     1 |    13 |     0   (0)| 00:00:01 |       |       |
|   6 |      TABLE ACCESS BY INDEX ROWID| ACL_ROLE        |     1 |     8 |     1   (0)| 00:00:01 |       |       |
|*  7 |       INDEX RANGE SCAN          | AR_ROLE         |  3787 |       |     1   (0)| 00:00:01 |       |       |
|   8 |     TABLE ACCESS BY INDEX ROWID | CASELOAD_ACCESS |   137 |  2329 |    31   (0)| 00:00:01 |       |       |
|*  9 |      INDEX RANGE SCAN           | CASE_ACC_ACL    |   137 |       |     2   (0)| 00:00:01 |       |       |
|  10 |    PARTITION HASH ITERATOR      |                 |     1 |    12 |     1   (0)| 00:00:01 |   KEY |   KEY |
|* 11 |     INDEX RANGE SCAN            | MV_PGM_CASE_I   |     1 |    12 |     1   (0)| 00:00:01 |   KEY |   KEY |
-------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   7 - access("R"."ROLE"="VALUE")
   9 - access("CA"."ACL_ID"="R"."ACL_ID")
  11 - access("CP"."PGM_ID"=TO_NUMBER(:PGM) AND "CP"."CASELOAD_ID"="CA"."CASELOAD_ID")
Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
    2817221  consistent gets


The tuned SQL by 'ORDERED' hint to guide the optimizer doing correct join order as we know best.

SQL> SELECT /*+ ORDERED */ DISTINCT login_id
FROM client_access_role c
JOIN gtt_in_list_2 x ON (c.role = x.value)
WHERE c.pgm_id = :pgm;



Elapsed: 00:00:00.08
Execution Plan
----------------------------------------------------------
Plan hash value: 3611759149
-------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                 | Name                    | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                          |                         |     1 |    50 |   771   (7)| 00:00:01 |       |       |
|   1 |  HASH UNIQUE                              |                         |     1 |    50 |   771   (7)| 00:00:01 |       |       |
|*  2 |   HASH JOIN                               |                         |     1 |    50 |   770   (7)| 00:00:01 |       |       |
|   3 |    INDEX FULL SCAN                        | GTT2_VAL_I              |     1 |    13 |     0   (0)| 00:00:01 |       |       |
|*  4 |    HASH JOIN                              |                         |   311K|    10M|   762   (6)| 00:00:01 |       |       |
|   5 |     NESTED LOOPS                          |                         |  9589 |   271K|   643   (1)| 00:00:01 |       |       |
|   6 |      MAT_VIEW ACCESS BY GLOBAL INDEX ROWID| MV_CASELOAD_PRIV        |   279 |  3348 |    83   (0)| 00:00:01 | ROWID | ROWID |
|*  7 |       INDEX RANGE SCAN                    | MV_CASE_PRIV_PGM_I      |   279 |       |     3   (0)| 00:00:01 |       |       |
|*  8 |      INDEX RANGE SCAN                     | CA_CASELOAD_LOGIN_ACL_I |    34 |   578 |     2   (0)| 00:00:01 |       |       |
|   9 |     TABLE ACCESS FULL                     | ACL_ROLE                |   571K|  4467K|   105  (23)| 00:00:01 |       |       |
-------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("R"."ROLE"="X"."VALUE")
   4 - access("CA"."ACL_ID"="R"."ACL_ID")
   7 - access("CP"."PGM_ID"=TO_NUMBER(:PGM))
   8 - access("CP"."CASELOAD_ID"="CA"."CASELOAD_ID")
Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
        705  consistent gets






And we are back with the super performance!
***

Mar 31, 2014

What happened when we received ORA-04030: out of process memory - trying to allocate 824504 bytes (pga heap,kco buffer)


Here goes the story in short -

We had to bump up Oracle memory to address buffer cache growth. So, in our Solaris database box, we increased Oracle project memory in /etc/project from 32 GB to 64 GB - we had enough physical memory in the server - 128 GB.

Then we increased SGA_MAX_SIZE from 28 GB to 36 GB and MEMORY_MAX_SIZE 28 GB to 42 GB. We did not set the target parameters as we had not been not using Automatic Memory Management (AMM) or Automatic Shared Memory Management (ASMM).

Then from the following day, we started receiving errors while running the RMAN backup jobs -


ORA-00603: ORACLE server session terminated by fatal error
ORA-27103: internal error
Solaris-AMD64 Error: 9: Bad file number
Additional information: 108
Additional information: 4456448

Also, receiving ORA-04030 error in alert_log file.

ORA-04030: out of process memory when trying to allocate 824504 bytes (pga heap,kco buffer)
ORA-27103: internal error
Solaris-AMD64 Error: 9: Bad file number

It was clear to me that some memory allocation issue we hit as auto allocation of PGA memory was failing. It's production box, and internal error, so interpretation should come from Oracle Support. We opened SR. Sent several trace  and alert logs. At the beginning, they thought that it's OS resource allocation problem, then support thought it's a RMAN backup problem and possibly a bug, and suggested several alternatives. That did not help at all. Finally they forwarded the SR to Solaris OS Memory Engineers. They found something interesting for us.

After 3 weeks back and forth information with Oracle Support, we have been told that, by setting SGA_MAX_SIZE & MEMORY_MAX_SIZE we had enabled Dynamic Intimate Shared Memory (DISM) feature of Oracle Solaris. And this may not work in Solaris 10 x86 based system, also not recommended. So, we need to disable DISM by NOT setting SGA_MAX_SIZE, SGA_TARGET, MEMORY_MAX_SIZE and MEMORY_MAX_TARGET.

The steps we followed to disable DISM -

1. Setting individual SGA components (like shared pool, buffer cache, large pool etc.)

2. Create a PFILE from the SPFILE

3. Removed all the four parameters SGA_MAX_SIZE, SGA_TARGET, MEMORY_MAX_SIZE and MEMORY_MAX_TARGET from PFILE and let Oracle compute the SGA_MAX_SIZE aggregating all individual SGA components

4. Startup using the edited PFILE

5. Creating SPFILE from the edited PFILE

6. Shutdown & restart again so that SPFILE takes effect


Details of DISM could be found in the following doc - Oracle support referred this.

http://www.oracle.com/technetwork/articles/systems-hardware-architecture/using-dynamic-intimate-memory-sparc-168402.pdf


***