Sep 10, 2009

Madness for Stale Statistics

Yesterday evening (my time zone), which is Wednesday morning for the application users on a different time zone, database was behaving strangely - literally crawling! As it was not my office hour, got an unofficial phone call from the monitoring guy. I gave a quick look at email threads and found that others were on the issue and they were looking and discussing things. Seeing that, I offloaded myself from the issue and went to bed - it was late at night on the month of holy "Ramadan". I hoped that we would survive and I would look at things on my next office hour.

Because of Murphy's law probably - I got phone call from my tech boss at around 2:20 AM :-)


Things were running slower than the other days. For last few days, database was consuming about 50% CPU on avg. But today, it was 70-80% and overall response time had been increased. But we did not do any changes in database recently.

I looked at AWR, ASH reports - ran my custom queries... everything were very confusing. Queries were taking longer time than previous. I was looking here and there... in the middle of the night, brain was functioning slowly.

Finally, after 15-20 min, to verify if we had statistics correct, I ran the following script -

SET SERVEROUT ON;


DECLARE
ObjList DBMS_STATS.ObjectTab;
BEGIN
DBMS_STATS.GATHER_SCHEMA_STATS(ownname=>'PROD', objlist=>ObjList, options=>'LIST STALE');
FOR i IN ObjList.FIRST..ObjList.LAST
LOOP
DBMS_OUTPUT.PUT_LINE(ObjList(i).ObjName || ' - ' || ObjList(i).ObjType || ' ' || ObjList(i).partname);
END LOOP;
END;
/

What I saw, was not very expected, there were about 75 tables with outdated statistics!

I started collecting statistics table by table - the important ones first. Server load gradually decreased. Quickly, I collected stats with no histogram and 50% sample size - for us, big sample size works better.

SQL> exec dbms_stats.gather_table_stats(ownname=>'PROD',tabname=>'USER', cascade=>true, degree=>2, estimate_percent=>50, method_opt=>'FOR  ALL COLUMNS SIZE 1');


It took about 2.5 hours to collect stats for about 150 tables. Another sleepless night!


What happened?

Statistics had been collected with the default dbms scheduler jobs with 1 day interval by weekday maintenance window. The sample size was also low - 5% only. Those were not very appropriate for us in these days. Tables were becoming stale quite frequently.

No comments: