Tuesday 26 May 2015

Selectivity Calculation in SQL query



This article explains how the CBO determines the selectivity for various
predicates.



Selectivity
~~~~~~~~~~~
Selectivity is a measure of the proportion of a row source retrieved by
application of a particular predicate or combination of predicates.

Within the Oracle kernel it is expressed as a value between 0 and 1.
The closer the value is to 0 the more selective the predicate is.
Selectivity is only used by the CBO.

Basic Selectivity formula:
~~~~~~~~~~~~~~~~~~~~~~~~~~

                     Number of records satisfying a condition
Selectivity = -----------------------------------------
                     Total Number of records

In the optimizer, selectivity is used to compare the usefulness of various
predicates in combination with base object costs.

Knowing the proportion of the total data set that a column predicate defines
is very helpful in defining actual access costs.

By default, column selectivity is based on the high and low values and the
number of values in the column with an assumption of even distribution of
data between these two points.

Histogram data can give better selectivity estimates for unevenly distributed
column data. There is more discussion regarding Histograms later.

Selectivity is also used to define the cardinality of a particular row source
once predicates have been applied. Cardinality is the expected number of rows
that will be retrieved from a row source. Cardinality is useful in determining
nested loop join and sort costs. Application of selectivity to the original
cardinality of the row source will produce the expected (computed) cardinality
for the row source.


Glossary of Terms:
~~~~~~~~~~~~~~~~~~

NDV Number of Distinct Values
Cardinality Number of rows
Selectivity Proportion of a dataset returned by a particular predicate(or
group of predicates)

In the following illustrations there are 2 tables (T1 & T2) with columns (c1) and (c2) respectively.

Selectivities:
~~~~~~~~~~~~~~
Without histograms
~~~~~~~~~~~~~~~~~~
c1 = '4076'              1/NDV
c1 > '4076'              1 - (High - Value / High - Low)
c1 >= '4076'             1 - (High - Value / High - Low) + 1/NDV
c1 like '4076'           1/NDV

Join selectivity
~~~~~~~~~~~~~~~~

The selectivity of a join is defined as the selectivity of the most selective
join column adjusted by the proportion of not null values in each join column.


 Sel = 1/max[NDV(t1.c1),NDV(t2.c2)] *
( (Card t1 - # t1.c1 NULLs) / Card t1) *
( (Card t2 - # t2.c2 NULLs) / Card t2)

Bind Variable selectivity
~~~~~~~~~~~~~~~~~~~~~~~~~

Bind variables present a special case because the optimizer has no idea what
the bind variable value is prior to query optimization. This does not present
a problem with equality predicates since a uniform distribution of data is
assumed and the selectivity is taken as 1/NDV for the column. However for
range predicates it presents a major issue because the optimizer does not
know where the range starts or stops. Because of this the optimizer has to
make some assumptions as follows:

c1 =    :bind1           1/NDV
c1 >    :bind1           Default of 5%
c1 >=   :bind1           Default of 5%
c1 like :bind1           Default of 25%

For more information on bind variables see Note:70075.1

Selectivity With Histograms
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Histograms provide additional information about column selectivity for
columns whose distribution is non uniform. Histograms store information about
column data value ranges. Each range is stored in a single row and is often
called a 'bucket'. There are 2 different methods for storing histograms in
Oracle. If there are a small number of distinct
column values (i.e. less than the number of buckets), the column value
and the count of that value is stored. If not then a series of endpoints
are stored to enable more accurate selectivity to be determined.

The first method allows the accurate figures to be used. However with
inexact histograms the terms popular and non-popular value are introduced
and are used to help determine selectivity. A popular value is a value that
spans multiple endpoints whereas a non-popular value does not.
See Note:72539.1 for more information on histograms.

Exact histograms
~~~~~~~~~~~~~~~~
c1 = '4706'         count of value '4076' / Total Number of Rows
c1 > value          count of values > '4076' / Total Number of Rows

InExact Histograms
~~~~~~~~~~~~~~~~~~
col = pop value         # popular buckets / # buckets
col = non pop           (Density)
col > value             # buckets > value / # buckets


Rules for combining selectivity
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let P1 and P2 be 2 distinct predicates of query Q

 P1 AND P2
       S(P1&P2) = S(P1) * S(P2)
 P1 OR P2
       S(P1|P2) = S(P1) + S(P2) -[S(P1) * S(P2)]

Index Selectivity for concatenated indexes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Starting with 10.2, when a concatenated index, with all its columns having
equality predicates, is used as an access path, the optimizer uses 1/NDK as
the selectivity (where NDK is the number of distinct keys in the index).

On 9.2.0.7 and 9.2.0.8 this may be enabled with event 38060 level 1.
On 10.1.0.4 and 10.1.0.5 this may be enabled with event 38059 level 1.
On 10.2 adjustments will be made to the selectivity to account for nulls
in the index keys. This also occurs on 10.1.0.5 (with event 38059.)


Join cardinality
~~~~~~~~~~~~~~~~

 Card(Pj) = Card(T1) * Card(T2) * Sel(Pj)




Reference:  https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=334064325845090&parent=DOCUMENT&sourceId=740052.1&id=68992.1&_afrWindowMode=0&_adf.ctrl-state=rjetmzhiv_144