Aug 3, 2009

Understanding Oracle Histograms

Before I forget, want to write a few words as I did some experiments recently with the histogram statistics and our complex queries.

Let me do it as a question-answer type approach.

What is Histogram in Oracle?

Histogram is an option for database statistics collection (introduced in 10g). It contains information about how the data of a table column is distributed. Histogram helps Oracle optimizer to determine whether certain values occur frequently, rarely or not at all. And based on those information, it can select an optimum access path for the query. For a big database, I found histogram stats are really really important.

If there are no histograms, it is assumed that the column values are equally distributed over the tables.


What does it do? What's the use?

I am aware of the two direct benefits -

* Histogram helps the optimizer to improve table join order: For example, if we have several table joins will million of rows whose final result set will be only few rows, Oracle tries to join tables together in such an order to make the result set cardinality (rows returned) of the first joins as small as possible.This has a huge performance impact on performing the rest of the operations since will it have to deal with less amount of rows in subsequent operations.


* Histogram helps the optimizer to find optimum access path - full table scan vs accessing via index: For example, there is an index on "user_type" (assume that we have 5% admins, 15% managers, 80% support type users). If we are looking for most of the support type users and the query returns 60% of the table rows, it will be more efficient to do a full table scan rather than accessing the 'support' users via "user_type_i" index. Histogram keeps those statistics to make a choice.


What are different types of Histogram?

There are three - two basic types + None.

* Height-balanced histograms: In a height-balanced histogram, the column values are divided into bands so that each band contains approximately the same number of rows. The useful information that the histogram provides is where in the range of values the endpoints fall.

If there are more different values in the column than the amount of buckets, a height balanced histogram is created. The column values are divided into intervals of the same size. In this case, an interval is assigned to each bucket.

The maximum number of histogram buckets is 254.

* Frequency histograms: For this type, each value of the column corresponds to a single bucket of the histogram. Each bucket contains the number of occurrences of that single value. Frequency histograms are when the number of distinct values is less than or equal to the number of histogram buckets specified.

If there are fewer different values in the column than the amount of buckets, this type of histogram is created. Each bucket contains one column value and the frequency of this column value.

* None, meaning no histogram statistics for the column


Two important views to find information about the histograms in database -

- USER_TAB_COL_STATISTICS
- USER_TAB_HISTOGRAMS


In our big system, for over 2,000 queries, I have found that oracle automatically collected histogram stats for some columns -

SQL> select count(column_name)
from user_tab_col_statistics
where histogram <> 'NONE';

COUNT(COLUMN_NAME)
------------------
1247

Some queries run much faster without any histogram information. For those, I collected statistics using the following options -

SQL> exec dbms_stats.gather_table_stats(ownname=>'PROD'
,tabname=> 'USER_INFO'
,method_opt=> 'FOR ALL COLUMN SIZE 1'
,estimate_percent=>50, cascade=>true);


Also, there are good number of queries those performs better if histogram stats available. For those kind of queries, I collected stats letting oracle decide the bucket size specifying 'AUTO' -


SQL> exec dbms_stats.gather_table_stats(ownname=>'PROD'
,tabname=> 'USER_INFO'
,method_opt=> 'FOR ALL COLUMN SIZE AUTO'
,estimate_percent=>50, cascade=>true);


Statistics are vital for database performance. If there are problem queries, one should analyze the execution plans throughly and tryout the statistics options suitable for those - case by case.

1 comment:

Anonymous said...

veeery informative ... waste of space.