Jun 1, 2012

Concurrency waits analysis - a case of our production system

We have been experiencing shared pool concurrency issue (mostly "library cache: mutex x") now a days. Ones again everything slowed down for such shared pool concurrency waits. There were no single/particular SQLs which caused this. I see a number of things which contributing to this concurrency problems from time to time. Here is the descriptive presentation -

1. There is an insert SQL on a table. Most of the times (almost every time I look at production during peak hours), we see a shared pool concurrency waits ("library cache: mutex x") on this statement. I did a lot of research on this to understand whether this is because of us for doing something wrong. Oracle made the mutex almost a black box for outsiders. Using web resources, I managed to dig to some extent from some internal details (mutex sleep history, mutex ASH history etc). I found, sometimes it shows the waits are on the sequence we use in this SQL, and sometimes it is not understandable - the mutex sleep locations are basically Oracle code internals. Probably we hit the bug on this - "Bug ID 12819613: HIGH CURSOR: MUTEX S AND LIBRARY CACHE LOCK WAITS" or something similar - there are tons of mutex bug reported to Oracle support.

Related article: http://andreynikolaev.wordpress.com/2011/05/01/divide-and-conquer-the-true-mutex-contention/#comment-343 - I also shared our problem with the writer (who is an expert on mutex problems) in comment section.

Action Items:


- We could contact Oracle Support for their opinion.
- We could drop the ID column which we don't need now and get rid of the waits on sequence.
- We batch the inserts so that it does a bulk of inserts and does not require soft parse again and again.

2. Sometimes, there are random types concurrency waits pops up. There are no fixed pattern for this or no fixed set of SQLs which contributes to this particularly. Here, the important thing is, during this time, hard parse increases in database (around 2%, normal figure is around 0.5%). This stress out shared pool memory and resources, finally slows down everything a little bit as the waits are inside shared pool. These are all years old SQLs.

Action Items:


- We need to reduce hard parse for those SQLs. We have identified some SQLs we need to look again and find a way to use bind values. Sometimes it's not possible to use bind values inside "IN clause" as the number of parameters varies for user inputs.

3. We do a lot of soft parse. The "execute to parse" ratio is very poor. This might happen for a typical web application where we can't always use statement caching. Too many soft parses are also expensive and stress out shared pool for big applications.

Action Items:
- I don't know exactly what we could do in short term. May be in future we can do it in a better way when we use some better programming model (with JSF and JBoss Seam) - we are on that direction.


- We had problems few years back when we tried to use statement caching in our application. So far we remember, it consumed huge memory and bogged down the system. However, we can do some testing again and see what is the behavior for a small number of statement caching.

4. We have another insert SQL. This statement consumes the highest amount of shared pool memory. This is because, here we are not able to reuse cursors even though we use bind variables and thus produces a huge number of child cursors. During yesterday problem, the number of child cursors were 1351. My understanding is, the main problem with this SQL - there are bind mismatch in most cases when the SQL comes into database for an execution with all its 42 bind variables. Producing so many child cursors (and searching within the list) causes shared pool mutex waits. This is an issue with 11g where database does not automatically obsolete a parent cursor "anymore" when it reaches 1024 child cursors.

Related article: http://www.usn-it.de/index.php/2010/08/04/oracle112-mutex-s-too-many-child-cursors

Action Items:


- As I saw the reason for generating so many child cursors was for bind mismatch, I think, we need to change the jdbc prepared statement coding and need to use "java.sql.Types" to set nulll values so that it understands the bind type and doesn't create a child cursor for type mismatch caused by setting 'null' explicitly.


For example, for a column we use - preparedStatement.setObject(
index++, record.getAdministeredBy() != null ? record.getAdministeredBy().getId() : null);


We need to program like "preparedStatement.setNull(index++, java.sql.Types.INTEGER);" when the object is null so that the cursor understand the datatype for the bind variable.


I have talked to the programmers to make the changes. There are few other queries like this producing high number of child cursors - but those are not that severe, anyways.


- Another thing we could do now for the time being using a scheduled job periodically, we could purge the particular cursor when it produces many child cursors (say more than 100). This will keep the child cursor list short and reduce the waits on that sql when it searches through so many child cursors for bind matches. 

These were my assessments for the database.

***

No comments: