Oct 8, 2009

Sample size (estimate_percent) affects correctness of database statistics

There are many arguments we can pass in dbms_stats.gather_table_stats procedure - "method_opt" and "estimate_percent" are two very important ones.

The default values (in 10g) for "method_opt" and "estimate_percent" are  AUTO - meaning that Oracle will decide if Histogram Stats or No Histogram is appropriate for columns.

Sometimes, we customizes sample size to make stats collection job faster. This customization should be perfect otherwise a low sample size could result in non-representative statistics.

For example, when I collect statistics with 5% sample size, it does not create histograms for the two following columns.


SQL> exec DBMS_STATS.GATHER_TABLE_STATS(ownname=>'TEST', tabname=>'TAB1', degree=>2, estimate_percent=>5);


SQL> SELECT column_name, num_buckets, num_distinct, sample_size, histogram
FROM user_tab_col_statistics
WHERE table_name='TAB1';


COLUMN_NAME                    NUM_BUCKETS NUM_DISTINCT SAMPLE_SIZE HISTOGRAM
------------------------------ ----------- ------------ ----------- ---------------
UPDATED                                  1      6382137      633875 NONE
CREATED                                  1      7858623      633875 NONE

VERSION                                 15           15      633875 FREQUENCY
ID                                               1     12677500      633875 NONE
TRASH_DATE                           1       294401      221149 NONE



Now, when I increased the sample size to 50%, the procedure collected Height Balanced Histograms for the two columns. See below -

SQL> exec DBMS_STATS.GATHER_TABLE_STATS(ownname=>'TEST', tabname=>'TAB1', degree=>2, estimate_percent=>50);

SQL> SELECT column_name, num_buckets, num_distinct, sample_size, histogram
FROM user_tab_col_statistics
WHERE table_name='TAB1';

COLUMN_NAME                    NUM_BUCKETS NUM_DISTINCT SAMPLE_SIZE HISTOGRAM
------------------------------ ----------- ------------ ----------- ---------------
UPDATED                                254      8929376     6347812 HEIGHT BALANCED
CREATED                                254      9793629     6345628 HEIGHT BALANCED

VERSION                                 21           21     6341908 FREQUENCY
ID                                              1     12692852     6346426 NONE
TRASH_DATE                          1      1217543     2208858 NONE


The different statistics could affect the execution plan for the queries where CREATED/UPDATED columns are used. So, these things should be carefully tested and sample size should not be underestimated. In fact, in my experience I have seen, larger sample size works better for big tables.

P.S. I planned to write a post on this probably in January this year, when I did the experiments on database statistics with histogram and no histogram options. Later became busy with other stuffs and  forgot. Few days ago, I worked again to write custom scripts for collecting database stats. The sample size problem stroked - I set the sample size 50% in my scripts - working perfectly!


***

No comments: