Mar 31, 2014

What happened when we received ORA-04030: out of process memory - trying to allocate 824504 bytes (pga heap,kco buffer)


Here goes the story in short -

We had to bump up Oracle memory to address buffer cache growth. So, in our Solaris database box, we increased Oracle project memory in /etc/project from 32 GB to 64 GB - we had enough physical memory in the server - 128 GB.

Then we increased SGA_MAX_SIZE from 28 GB to 36 GB and MEMORY_MAX_SIZE 28 GB to 42 GB. We did not set the target parameters as we had not been not using Automatic Memory Management (AMM) or Automatic Shared Memory Management (ASMM).

Then from the following day, we started receiving errors while running the RMAN backup jobs -


ORA-00603: ORACLE server session terminated by fatal error
ORA-27103: internal error
Solaris-AMD64 Error: 9: Bad file number
Additional information: 108
Additional information: 4456448

Also, receiving ORA-04030 error in alert_log file.

ORA-04030: out of process memory when trying to allocate 824504 bytes (pga heap,kco buffer)
ORA-27103: internal error
Solaris-AMD64 Error: 9: Bad file number

It was clear to me that some memory allocation issue we hit as auto allocation of PGA memory was failing. It's production box, and internal error, so interpretation should come from Oracle Support. We opened SR. Sent several trace  and alert logs. At the beginning, they thought that it's OS resource allocation problem, then support thought it's a RMAN backup problem and possibly a bug, and suggested several alternatives. That did not help at all. Finally they forwarded the SR to Solaris OS Memory Engineers. They found something interesting for us.

After 3 weeks back and forth information with Oracle Support, we have been told that, by setting SGA_MAX_SIZE & MEMORY_MAX_SIZE we had enabled Dynamic Intimate Shared Memory (DISM) feature of Oracle Solaris. And this may not work in Solaris 10 x86 based system, also not recommended. So, we need to disable DISM by NOT setting SGA_MAX_SIZE, SGA_TARGET, MEMORY_MAX_SIZE and MEMORY_MAX_TARGET.

The steps we followed to disable DISM -

1. Setting individual SGA components (like shared pool, buffer cache, large pool etc.)

2. Create a PFILE from the SPFILE

3. Removed all the four parameters SGA_MAX_SIZE, SGA_TARGET, MEMORY_MAX_SIZE and MEMORY_MAX_TARGET from PFILE and let Oracle compute the SGA_MAX_SIZE aggregating all individual SGA components

4. Startup using the edited PFILE

5. Creating SPFILE from the edited PFILE

6. Shutdown & restart again so that SPFILE takes effect


Details of DISM could be found in the following doc - Oracle support referred this.

http://www.oracle.com/technetwork/articles/systems-hardware-architecture/using-dynamic-intimate-memory-sparc-168402.pdf


***

Mar 22, 2014

Database query runs slow in second execution - why?

When a query returns result quickly in the first run but the same query runs slow in second execution or successive executions, it is probably for the cardinality feedback the query used for the other executions. We had a scenario recently.

The first execution returns in less than a second - 78 records. When the query goes for successive executions, it takes over 5 minutes to return the same result. This slow execution of the query slowed down one of the application servers and caused huge number of thread stuck and bog down the server at the end. I thought, this happened for the wrong cardinality feedback from the first execution of the query.

To become very sure about this, I did the following tests - 

SQL> alter session set statistics_level=all;
SQL> set autotrace traceon;

SQL> SELECT l.id, l.login_name, l.first_name, l.last_name, l.title_id, l.title 
FROM table_x l, view_y pr 
WHERE l.id = pr.login_id
AND pr.pgm_id IN (68894) 
AND l.enroll = 0 
AND l.status = 1 
ORDER BY l.last_name, l.first_name, l.login_name, l.title;

Elapsed: 00:00:00.78

Statistics
----------------------------------------------------------
          8  recursive calls
          3  db block gets
      16074  consistent gets
        629  physical reads
          0  redo size
       5783  bytes sent via SQL*Net to client
        575  bytes received via SQL*Net from client
          7  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
         79  rows processed

SQL> /

Elapsed: 00:05:49.29

Statistics
----------------------------------------------------------
       4069  recursive calls
          8  db block gets
      47259  consistent gets
     518490  physical reads
        272  redo size
       5783  bytes sent via SQL*Net to client
        575  bytes received via SQL*Net from client
          7  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
         79  rows processed

Note
-----
   - cardinality feedback used for this statement


To get rid of the cardinality feedback for this test case, I used hint with the query to tell the optimizer not to use the feedback. FYI - cardinality feedback could be turned off in session level and instance level by using ALTER SESSION SET "_OPTIMIZER_USE_FEEDBACK"=FALSE;


SQL> SELECT /*+ OPT_PARAM('_OPTIMIZER_USE_FEEDBACK' 'FALSE') */ l.id, l.login_name, l.first_name, l.last_name, l.title_id, l.title 
FROM table_x l, view_y pr 
WHERE l.id = pr.login_id
AND pr.pgm_id IN (68894) 
AND l.enroll = 0 
AND l.status = 1 
ORDER BY l.last_name, l.first_name, l.login_name, l.title;


Elapsed: 00:00:00.65

Statistics
----------------------------------------------------------
          0  recursive calls
          3  db block gets
      16069  consistent gets
          0  physical reads
         64  redo size
       5783  bytes sent via SQL*Net to client
        575  bytes received via SQL*Net from client
          7  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
         79  rows processed


A word about cardinality feedback

Cardinality feedback is introduced in Oracle 11g to make the optimizer working better. To describe the feature in short - when a query is being executed repetitively, it sends cardinality feedback if the estimated and actual rows have been revealed different in earlier executions. Anyways, we had a scenario related to cardinality feedback. We had complex SQLs which had been running in 1 to 2 seconds for the first run then suddenly it started taking several minutes!

Why cardinality feedback was being used?

* We have very complex underlying predicates in the view - that's could be one reason for not estimating correct cardinality in the execution phase. For the complex view, when many rows returns in intermediate steps in the executions, that makes the cardinality estimation even worse in our case. 

* Stale statistics could be another reason for wrong cardinality estimation. But that was not in our case, I checked that, there was no stale statistics for the tables used in the query.


So, for the pr.pgm_id 68894, the first execution luckily ends up with best execution plan. Oracle optimizer is intelligent - it reports a mismatch in cardinality estimate in the first run and gives feedback for successive runs. The cardinality feedback drove the query to go for full table scans on the next run on several tables for the underlying complex view "view_y". For the pr.pgm_id 68894, the query had to process about 10 GB of data (detailed could be found in SQL Monitoring in OEM) when it went for full table scans. As it had to process that amount of data, it did not use regular buffer cache, instead it used available memory in PGA and temporary tablespace to hold and process those huge amount of data. That aggravated the situation slowing down for doing excessive physical reads. 

Now we'll fix this in the application code - we don't want to turn the cardinality feedback feature off at instance level for such discrete instance for a single query. 


***