Thinking Database Performance: September 2009

Sep 19, 2009

Database performance problem - some common areas to look at!

Where there is a head, there is headache - similarly, if there is a database, there are performance issues! I accepted the truth long ago :-)

I experienced performance problems several times with our ever growing systems and the busy database which is also evolving at this stage.

Over the years, I have seen that, there are few areas which causes the performance problems most of the time. I thought I would write few things on that, for my reference specially. Those common areas are probably -

1. Problem queries: Most of the time, I have seen, some queries were eating up the system, taking significant amount of database time and resources. Tuning those top 3/4 queries, solves the problem.

2. Outdated/missing statistics: Up to date statistics are very vital for Oracle cost based optimizer. Stale statistics might cause database to crawl and lead to death!

3. Wait events: Waits are big threat for database health. There could be Row lock waits, Network waits, I/O waits, Concurrency waits and so on that could hold the resources and lead to a death situation at the end. We need to figure out how database is spending most of its time.

4. Full scans for big tables/missing indexes: There could be cases, queries are fetching too many rows by full tables scans and there are huge "consistent gets" causing the use of more cpu cycles. It might be the case that we missed an index or need to repartition tables.

5. Hard parsing: Developers are not from this planet (specially freshers), most of the time they don't think that there is a database. Developers could easily forget to use bind variables and that's why database is experiencing hard parses (look at hard parse %). So, this area should not be overlooked.

6. Bad data model: Rigid normalize model and too many 'OR' clauses in logic might also slow down systems. We should be careful about what we are doing and why.

7. Poor application/process design: The application design and flows could be bad which causes unnecessary query executions and database access. Reviewing those areas and correcting things helps a lot.

For example, due to poor process flows, there might be cases that users are bound to execute queries, say goes to dashboard (where dashboard contains a lot of queries) after each save operation - which could be eliminated. Also, we may execute static kind of queries (the queries those return same result most of the time) quite frequently which are good candidate for caching, and thins like that.

8. Poor Connection Management: Frequent connection establishment has a negative impact on database. This process is a resource intensive one. Efficient and correct use of connection pooling is very important for a stateless application system.

9. No data pruning: If we don't have partitioning and/or data pruning mechanism in the system, and deal with huge number of records grew over the years - this is not a good practice. Dealing with big tables without partitioning has always been a bad technique and causes performance problems for select queries, for index updates, unique constraints and maintenance.

Anything else - don't remember right at the moment!

Sep 14, 2009

Wait event "resmgr:cpu quantum" :: resource manager settings and performance impact on database

On Monday morning, the first working day after the cutover from primary site to secondary site, we were experiencing unexpected problems on database server which was sitting idle for long time.

The load avg went high over 10. We are running on 8 core cpu box, this amount of load is totally unexpected, our regular load avg is 2-3 on a busy day.

I looked at the top queries and top events. There were waits associated with every queries. Queries were waiting for some scheduler activities... and the top queries are not the "actual" top ones I knew - the top list were being changed frequently based on users' current activities. The db server was crawling!

I found that the "resmgr:cpu quantum" was listed as top wait events and after running ADDM, I was confirmed seeing this - "Wait class Scheduler was consuming significant database time". It had become almost clear that some resource management settings are wrong/different on this database server.

Top User Events

Event	Event Class	% Activity	Avg Active Sessions
resmgr:cpu quantum	Scheduler	92.26	149.47
CPU + Wait for CPU	CPU	6.63	10.73

ADDM analysis revealed - "Wait class Scheduler was consuming significant database time".

Now it's time to check those parameters. Yes!! I guessed it correct - two servers resource management parameters were not identical! Shocking!! The maintenance window was still running which activated the resource manager plan - this should not be the correct settings.

On the well performed server (where the application was running before the cutover), the values were -

SQL> select name, value
from v$parameter
where name in ('resource_manager_plan', 'resource_limit');

NAME VALUE
-------------------------------------- ----------
resource_limit TRUE
resource_manager_plan

And on the "problem server", the settings were like -

SQL> select name, value
from v$parameter
where name in ('resource_manager_plan', 'resource_limit');

NAME VALUE
-------------------------------------- ----------
resource_limit FALSE
resource_manager_plan SCHEDULER[0x22F0]:SYSTEM_PLAN

Ah - completely different, resource manager got activated here by setting the value for "resource_manager_plan" by the maintenance window which had been configured to run long hours. So, identified the root cause, after changing the value (no manager plan), all those scheduler wait events disappeared and database was behaving according to expectation - load average dropped!

SQL> alter system set resource_manager_plan='' scope=both;

System altered.

This SYSTEM_PLAN was not appropriate for the application schema. The maintenance window should have been finished by the time and we should not run on this resource manager plan after 8 AM in the morning in any circumstances.

Let's take a look what those parameter are all about -

RESOURCE_LIMIT: This parameter must be set to true to enforce resource limits assigned to a user through profiles.

RESOURCE_MANAGER_PLAN: Setting this parameter activates the resource manager. The resource manager is enabled by setting the resource_manager_plan initialization parameter to an existing plan. Note: When enabled, the DBMS Scheduler can automatically change the Resource Manager plan at Scheduler window boundaries. So keeping the window only open for a specified period of time (idle time) is a good thing.

Luck! always get volunteered to investigate such critical problems outside office hours (and very late at night for timezone difference) which is not appreciated by my family members always. I knew that these things were handled by the other database guy. Communication had always been a big issue for companies and being a global company, we are not exceptional!

Sep 11, 2009

How database response time could slow down web application responses

Ours is a large scale n-tier web application suite with a very busy database behind it. Recently, tech boss gave a nice lecture to another guy about the relations of database response and http request/response for a typical web application - I was loving the writeup!

It explains how slow database responses could cause increased load on web servers.

Look at the whole system as a black box. A user sends and HTTP
request to it, and then receives a response. Each request has to
generate a response. Users are independent of each other. i.e. if one
sees his browser not responding, the other will not be affected by it.

Before the response is generated, the application server has to do
some work and then send the response back. Not all requests will take
the same amount of time to process. But here is the crucial thing:
over a given time period, the number of requests (input) has to be
equal to the number of responses (output). Otherwise, you will see a
build-up of requests.

Now, during the peak hours of a high usage day, we have at least 1200
active users. They send 50 requests per second. So we need to
process 50 requests per second. On average, we are taking 20ms to
process each request. Now let's see what happens when the system
slows down. Assume the system becomes 5x slower. We process 10
requests per second.

After 1 second, we will have 40 in-process requests. After 5 seconds
we will have 200 in-process requests. Users who sent those requests
will see their browser "hung up". For simplicity, suppose they just
sit tight and do nothing. Remember, we have 1200 users. 200 of them
are now stuck. But the rest of the users are not aware of it and they
will continue their work.

After 10 seconds, we will end up with 400 in-process requests.

Now let's dig a little deeper into the black box. Here is the app
server and the DB server. For each request, the app server needs a
connection to the DB server. But there is an upper limit on the
number of connections between the app and DB servers. Let's assume
the upper limit is 300. We have accumulated 400 requests so far. 300
of them are being worked on in the DB. The rest 100 are "held up" by
the app server, waiting for a free ticket (connection) to the DB.

After 20 seconds, we have 800 pending requests. Of them 300 are being
worked in the DB, 500 are held up by the app server.

After 30 seconds, we have 1200 pending requests, one from each user.
The app server is holding up 900 requests.

How long can the app server hold up requests? Not for too long.
Within about 2 minutes, the browsers will start reporting errors to
the user saying something like "The server is not responding ...".

You see, the situation is already not good, even if users don't do
anything at all except their normal work.

In reality, users will not sit down tight. After 10-20 seconds of
irresponsiveness, they will start hitting reload. Each reload will
send another new request.

We enjoyed reading it :-)

Sep 10, 2009

Madness for Stale Statistics

Yesterday evening (my time zone), which is Wednesday morning for the application users on a different time zone, database was behaving strangely - literally crawling! As it was not my office hour, got an unofficial phone call from the monitoring guy. I gave a quick look at email threads and found that others were on the issue and they were looking and discussing things. Seeing that, I offloaded myself from the issue and went to bed - it was late at night on the month of holy "Ramadan". I hoped that we would survive and I would look at things on my next office hour.

Because of Murphy's law probably - I got phone call from my tech boss at around 2:20 AM :-)

Things were running slower than the other days. For last few days, database was consuming about 50% CPU on avg. But today, it was 70-80% and overall response time had been increased. But we did not do any changes in database recently.

I looked at AWR, ASH reports - ran my custom queries... everything were very confusing. Queries were taking longer time than previous. I was looking here and there... in the middle of the night, brain was functioning slowly.

Finally, after 15-20 min, to verify if we had statistics correct, I ran the following script -

SET SERVEROUT ON;

DECLARE
ObjList DBMS_STATS.ObjectTab;
BEGIN
DBMS_STATS.GATHER_SCHEMA_STATS(ownname=>'PROD', objlist=>ObjList, options=>'LIST STALE');

FOR i IN ObjList.FIRST..ObjList.LAST
LOOP
DBMS_OUTPUT.PUT_LINE(ObjList(i).ObjName || ' - ' || ObjList(i).ObjType || ' ' || ObjList(i).partname);
END LOOP;
END;

/

What I saw, was not very expected, there were about 75 tables with outdated statistics!

I started collecting statistics table by table - the important ones first. Server load gradually decreased. Quickly, I collected stats with no histogram and 50% sample size - for us, big sample size works better.

SQL> exec dbms_stats.gather_table_stats(ownname=>'PROD',tabname=>'USER', cascade=>true, degree=>2, estimate_percent=>50, method_opt=>'FOR ALL COLUMNS SIZE 1');

It took about 2.5 hours to collect stats for about 150 tables. Another sleepless night!

What happened?

Statistics had been collected with the default dbms scheduler jobs with 1 day interval by weekday maintenance window. The sample size was also low - 5% only. Those were not very appropriate for us in these days. Tables were becoming stale quite frequently.