Home All Groups Group Topic Archive Search About
Author
28 Aug 2006 1:49 PM
Arjen
Hi,

In my stored procedures I make selections based on a 'combined' key.
I.e.: WHERE [Postcode] = @PostCode AND [No] = @No


1) When there are less [No]'s is it then smart to select first the [No] and
then [PostCode]?
I.e.: WHERE [No] = @No AND [Postcode] = @PostCode

2) How do I make the index for this?
I.e.: for both one? Or one and then first select [No] and then [Postcode]?

3) Do I need to 'index' uniqueidentifier columns?

Thanks!

Author
28 Aug 2006 2:01 PM
Tom Moreau
You would create an index on (PostCode, No) or (No, PostCode).  Make the
more-selective column the first in the index.  The order in which you
reference the columns in the WHERE clause is immaterial.

You *may* need to index a uniqueidentifier column if you are going to be
making retrievals based on that column.

--
   Tom

----------------------------------------------------
Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
SQL Server MVP
Toronto, ON   Canada
..
"Arjen" <boah***@hotmail.com> wrote in message
news:ecus9m$rio$1@news5.zwoll1.ov.home.nl...
Hi,

In my stored procedures I make selections based on a 'combined' key.
I.e.: WHERE [Postcode] = @PostCode AND [No] = @No


1) When there are less [No]'s is it then smart to select first the [No] and
then [PostCode]?
I.e.: WHERE [No] = @No AND [Postcode] = @PostCode

2) How do I make the index for this?
I.e.: for both one? Or one and then first select [No] and then [Postcode]?

3) Do I need to 'index' uniqueidentifier columns?

Thanks!
Author
28 Aug 2006 8:10 PM
boah123
Okay, thanks all.

If SQL server is only using one index then I will create one index for
(a part of) the where statement. Sounds like this is the best.

@Tom: what do you mean exactly with *more-selective column*?
(My selection is always based on the same columns, so in this case it
does not matter?)

Thanks,
Arjen


Tom Moreau wrote:
Show quote
> You would create an index on (PostCode, No) or (No, PostCode).  Make the
> more-selective column the first in the index.  The order in which you
> reference the columns in the WHERE clause is immaterial.
>
> You *may* need to index a uniqueidentifier column if you are going to be
> making retrievals based on that column.
>
> --
>    Tom
>
> ----------------------------------------------------
> Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
> SQL Server MVP
> Toronto, ON   Canada
> .
> "Arjen" <boah***@hotmail.com> wrote in message
> news:ecus9m$rio$1@news5.zwoll1.ov.home.nl...
> Hi,
>
> In my stored procedures I make selections based on a 'combined' key.
> I.e.: WHERE [Postcode] = @PostCode AND [No] = @No
>
>
> 1) When there are less [No]'s is it then smart to select first the [No] and
> then [PostCode]?
> I.e.: WHERE [No] = @No AND [Postcode] = @PostCode
>
> 2) How do I make the index for this?
> I.e.: for both one? Or one and then first select [No] and then [Postcode]?
>
> 3) Do I need to 'index' uniqueidentifier columns?
>
> Thanks!
Author
28 Aug 2006 8:57 PM
Tom Moreau
Check out p 885 of Inside SQL Server 2000 by Kalen Delaney for all the
details.  www.insidesqlserver.com.

--
   Tom

----------------------------------------------------
Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
SQL Server MVP
Toronto, ON   Canada
..
<boah***@hotmail.com> wrote in message
news:1156795808.230974.251760@i3g2000cwc.googlegroups.com...
Okay, thanks all.

If SQL server is only using one index then I will create one index for
(a part of) the where statement. Sounds like this is the best.

@Tom: what do you mean exactly with *more-selective column*?
(My selection is always based on the same columns, so in this case it
does not matter?)

Thanks,
Arjen


Tom Moreau wrote:
Show quote
> You would create an index on (PostCode, No) or (No, PostCode).  Make the
> more-selective column the first in the index.  The order in which you
> reference the columns in the WHERE clause is immaterial.
>
> You *may* need to index a uniqueidentifier column if you are going to be
> making retrievals based on that column.
>
> --
>    Tom
>
> ----------------------------------------------------
> Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
> SQL Server MVP
> Toronto, ON   Canada
> .
> "Arjen" <boah***@hotmail.com> wrote in message
> news:ecus9m$rio$1@news5.zwoll1.ov.home.nl...
> Hi,
>
> In my stored procedures I make selections based on a 'combined' key.
> I.e.: WHERE [Postcode] = @PostCode AND [No] = @No
>
>
> 1) When there are less [No]'s is it then smart to select first the [No]
> and
> then [PostCode]?
> I.e.: WHERE [No] = @No AND [Postcode] = @PostCode
>
> 2) How do I make the index for this?
> I.e.: for both one? Or one and then first select [No] and then [Postcode]?
>
> 3) Do I need to 'index' uniqueidentifier columns?
>
> Thanks!
Author
28 Aug 2006 10:29 PM
Gert-Jan Strik
Hi Tom,

But Kalen doesn't explain why defining the most selective column first
would be beneficial for queries where all indexed columns are filtered
in the WHERE clause, like in the example query of the OP (and in the
example below).

To test things, I created a 10M row table (2.2 GB) with a clustered
index on a unique integer column.
Next I created two nonclustered indexes. One on (low, high) and one on
(high, low). Low is a tinyint column with 103 distinct values. High is
an int column with some 8.3M distinct values. There is no relation
between the two column's values.

Next I ran queries like this on SQL Server 2000 EE:

  SELECT *
  FROM MyTestTable (index=my_index)
  WHERE low  = 46
  AND   high = 376607327

I used values for low in "the middle" range, near "the end" and near
"the beginning". I did the same for high, and in combination. Each query
was run against both indexes.

In all cases, the statistics io showed the same amount of logical reads
for the two indexes. In other words, they had the same performance. The
index on (high, low) did not outperform the index on (low, high).


Basically, all that this experiment proves is that index pages do not
contain incomplete compound index keys. This is also proved by the fact
that both indexes use the exact same amount of disk space (otherwise the
index on (low, high) should have been smaller). Because of that, some
smart mathematician should be able to prove that there can be no
performance difference between the two indexes (in the above scenario).


From personal experience, I must admit that I know of one situation in
SQL Server 7 where putting the most selective column first is more
efficient (an outer join on a compound primary key), but that is
basically because of a flaw, which has been fixed in SQL Server 2000.


Arjen, my advice would be to put the column that you are least likely to
filter on at the end of the index definition. In your case you don't
really have to choose based on which theory you believe, because column
No is probably the least selective column as well as the column you are
least likely to filter on. So you should make Postcode the first column
in the index's definition.

Gert-Jan



Tom Moreau wrote:
Show quote
>
> Check out p 885 of Inside SQL Server 2000 by Kalen Delaney for all the
> details.  www.insidesqlserver.com.
>
> --
>    Tom
>
> ----------------------------------------------------
> Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
> SQL Server MVP
> Toronto, ON   Canada
> .
> <boah***@hotmail.com> wrote in message
> news:1156795808.230974.251760@i3g2000cwc.googlegroups.com...
> Okay, thanks all.
>
> If SQL server is only using one index then I will create one index for
> (a part of) the where statement. Sounds like this is the best.
>
> @Tom: what do you mean exactly with *more-selective column*?
> (My selection is always based on the same columns, so in this case it
> does not matter?)
>
> Thanks,
> Arjen
>
> Tom Moreau wrote:
> > You would create an index on (PostCode, No) or (No, PostCode).  Make the
> > more-selective column the first in the index.  The order in which you
> > reference the columns in the WHERE clause is immaterial.
> >
> > You *may* need to index a uniqueidentifier column if you are going to be
> > making retrievals based on that column.
> >
> > --
> >    Tom
> >
> > ----------------------------------------------------
> > Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
> > SQL Server MVP
> > Toronto, ON   Canada
> > .
> > "Arjen" <boah***@hotmail.com> wrote in message
> > news:ecus9m$rio$1@news5.zwoll1.ov.home.nl...
> > Hi,
> >
> > In my stored procedures I make selections based on a 'combined' key.
> > I.e.: WHERE [Postcode] = @PostCode AND [No] = @No
> >
> >
> > 1) When there are less [No]'s is it then smart to select first the [No]
> > and
> > then [PostCode]?
> > I.e.: WHERE [No] = @No AND [Postcode] = @PostCode
> >
> > 2) How do I make the index for this?
> > I.e.: for both one? Or one and then first select [No] and then [Postcode]?
> >
> > 3) Do I need to 'index' uniqueidentifier columns?
> >
> > Thanks!
Author
29 Aug 2006 4:00 PM
Alexander Kuznetsov
Gert-Jan,

I would agree. More to the point, in some cases placing a less
selective column first is actually a better choice. Suppose
ProvinceCode is less selective than Postal. Suppose you have a
frequently run query

SELECT ProvinceCode, Postal FROM yourTable WHERE ProvinceCode =
@ProvinceCode

An index on (ProvinceCode, Postal) will be more useful *for this
particular query* than the one on (Postal, ProvinceCode), right? So, if
this query is a significant part of your workload, you *might* be
better off breaking the rule of thumb and having an index on
(ProvinceCode, Postal).
Author
29 Aug 2006 4:22 PM
Conor Cunningham [MS]
For some cases, you will find that if you have a (very) non-selective
leading column (for example, gender), then you can end up negating much of
the benefit of an index for seeking since you have to seek and then scan
many more pages.  If you want to do equality searching for multiple columns
(as in this example), putting the most selective field first will cause the
seek to have to scan fewer pages when the second column is a range
predicate.

Another side effect that is not as easy to see is that SQL Server only has
histograms for the first column of an index (or any multi-column statistic,
which can also be created manually).  A non-selective leading key will cause
the histogram to not be very useful to the optimizer, and the cardinality
estimations may be further off as a result.

All of this assumes that you are filtering on those columns - as you point
out, if you don't ever filter on a column, putting it as a leading column in
an index makes less sense (it still could be useful, but not as much)

You should also check the query plans for this case.  The "*" may be causing
bookmark lookups if your index is not covering.  (I can't tell from what you
posted).

Conor

Show quote
"Gert-Jan Strik" <sorry@toomuchspamalready.nl> wrote in message
news:44F36E4A.27DD83AA@toomuchspamalready.nl...
> Hi Tom,
>
> But Kalen doesn't explain why defining the most selective column first
> would be beneficial for queries where all indexed columns are filtered
> in the WHERE clause, like in the example query of the OP (and in the
> example below).
>
> To test things, I created a 10M row table (2.2 GB) with a clustered
> index on a unique integer column.
> Next I created two nonclustered indexes. One on (low, high) and one on
> (high, low). Low is a tinyint column with 103 distinct values. High is
> an int column with some 8.3M distinct values. There is no relation
> between the two column's values.
>
> Next I ran queries like this on SQL Server 2000 EE:
>
>  SELECT *
>  FROM MyTestTable (index=my_index)
>  WHERE low  = 46
>  AND   high = 376607327
>
> I used values for low in "the middle" range, near "the end" and near
> "the beginning". I did the same for high, and in combination. Each query
> was run against both indexes.
>
> In all cases, the statistics io showed the same amount of logical reads
> for the two indexes. In other words, they had the same performance. The
> index on (high, low) did not outperform the index on (low, high).
>
>
> Basically, all that this experiment proves is that index pages do not
> contain incomplete compound index keys. This is also proved by the fact
> that both indexes use the exact same amount of disk space (otherwise the
> index on (low, high) should have been smaller). Because of that, some
> smart mathematician should be able to prove that there can be no
> performance difference between the two indexes (in the above scenario).
>
>
> From personal experience, I must admit that I know of one situation in
> SQL Server 7 where putting the most selective column first is more
> efficient (an outer join on a compound primary key), but that is
> basically because of a flaw, which has been fixed in SQL Server 2000.
>
>
> Arjen, my advice would be to put the column that you are least likely to
> filter on at the end of the index definition. In your case you don't
> really have to choose based on which theory you believe, because column
> No is probably the least selective column as well as the column you are
> least likely to filter on. So you should make Postcode the first column
> in the index's definition.
>
> Gert-Jan
>
>
>
> Tom Moreau wrote:
>>
>> Check out p 885 of Inside SQL Server 2000 by Kalen Delaney for all the
>> details.  www.insidesqlserver.com.
>>
>> --
>>    Tom
>>
>> ----------------------------------------------------
>> Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
>> SQL Server MVP
>> Toronto, ON   Canada
>> .
>> <boah***@hotmail.com> wrote in message
>> news:1156795808.230974.251760@i3g2000cwc.googlegroups.com...
>> Okay, thanks all.
>>
>> If SQL server is only using one index then I will create one index for
>> (a part of) the where statement. Sounds like this is the best.
>>
>> @Tom: what do you mean exactly with *more-selective column*?
>> (My selection is always based on the same columns, so in this case it
>> does not matter?)
>>
>> Thanks,
>> Arjen
>>
>> Tom Moreau wrote:
>> > You would create an index on (PostCode, No) or (No, PostCode).  Make
>> > the
>> > more-selective column the first in the index.  The order in which you
>> > reference the columns in the WHERE clause is immaterial.
>> >
>> > You *may* need to index a uniqueidentifier column if you are going to
>> > be
>> > making retrievals based on that column.
>> >
>> > --
>> >    Tom
>> >
>> > ----------------------------------------------------
>> > Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
>> > SQL Server MVP
>> > Toronto, ON   Canada
>> > .
>> > "Arjen" <boah***@hotmail.com> wrote in message
>> > news:ecus9m$rio$1@news5.zwoll1.ov.home.nl...
>> > Hi,
>> >
>> > In my stored procedures I make selections based on a 'combined' key.
>> > I.e.: WHERE [Postcode] = @PostCode AND [No] = @No
>> >
>> >
>> > 1) When there are less [No]'s is it then smart to select first the [No]
>> > and
>> > then [PostCode]?
>> > I.e.: WHERE [No] = @No AND [Postcode] = @PostCode
>> >
>> > 2) How do I make the index for this?
>> > I.e.: for both one? Or one and then first select [No] and then
>> > [Postcode]?
>> >
>> > 3) Do I need to 'index' uniqueidentifier columns?
>> >
>> > Thanks!
Author
29 Aug 2006 9:36 PM
Gert-Jan Strik
"Conor Cunningham [MS]" wrote:
>
> For some cases, you will find that if you have a (very) non-selective
> leading column (for example, gender), then you can end up negating much of
> the benefit of an index for seeking since you have to seek and then scan
> many more pages.

If all columns in the index are filtered (for equality), for example
"WHERE low = 8 AND high = 376607327" on a table with index on (low,
high) then there is no scanning involved. Even if the low column is the
low selectivy column with many index pages for one particular low value.
The seek will be on both columns. Each index page will know exactly
which (low,high) ranges are covered. So an exact match seek will
navigate down the index tree and only read one page of each level. It
involves no scanning. A seek based on an index on (high, low) will do
exactly the same, the index will have the same index depth, for each
level one index page will be read, and so the result will be the same
number of reads.

>  If you want to do equality searching for multiple columns
> (as in this example), putting the most selective field first will cause the
> seek to have to scan fewer pages when the second column is a range
> predicate.

Yes, if the second column is a range predicate, then it is an entirely
different story, because then it would involve (partial) index scanning.

> Another side effect that is not as easy to see is that SQL Server only has
> histograms for the first column of an index (or any multi-column statistic,
> which can also be created manually).  A non-selective leading key will cause
> the histogram to not be very useful to the optimizer, and the cardinality
> estimations may be further off as a result.

That is a good point.
When the optimizer is determining whether or not to use the index, then
*in general* the estimation based on the index that starts with the most
selective column will be more accurate.

But a bit more in-depth, it depends.

In the experiment I mentioned, the column "low" had 103 distinct values.
In a 10 million row table, that means very poor selectivity (especially
when compared to the 8.3 million distinct values in column "high"). But
the statistics histogram of index (high, low) has no more than 124
groups. That is just 20 percent more than the 103 for the (low, high)
index. Now the AVG_RANGE_ROWS value adds a lot of value to the "high"
histogram (because it gives an estimate of the density within the
group), but will only be accurate if the data distribution in that group
is uniform.

So the (high, low) index would still give better estimates ...

.... unless the low column has a non-uniform data distribution and the
high column a uniform data distribution. In that case an index on (low,
high) should do better. The histogram of the low column should take care
of any distribution oddities and the density estimate for the related
high values would be accurate.
If the (high, low) index were used in this example, the high value would
probably only be found in a range group in the histogram and the use of
the density / AVG_RANGE_ROWS information would result in bad estimates
for many values because it does not (cannot) take the skewed data
distribution of the low column into account.

Gert-Jan

Show quote
> All of this assumes that you are filtering on those columns - as you point
> out, if you don't ever filter on a column, putting it as a leading column in
> an index makes less sense (it still could be useful, but not as much)
>
> You should also check the query plans for this case.  The "*" may be causing
> bookmark lookups if your index is not covering.  (I can't tell from what you
> posted).
>
> Conor
[snip]
Author
28 Aug 2006 2:02 PM
ML
1 and 2: the order of conditions in the WHERE clause is irrelevant to whether
the index will be used. Whether an index is going to be used depends on the
selectivity of the data (among other things).

3. If you need to query the data by specifying a particular column in a
condition (WHERE, ON) an index may improve performance. Again, this depends
on other factors.

Indexes are explained in more detail in Books Online.


ML

---
http://milambda.blogspot.com/
Author
28 Aug 2006 4:37 PM
mEmENT0m0RI
My guess is that the question was not whether the index is going to be
used or not, but more like if there is going to be a performance
difference by switching the order of the parameters in the where clause
assuming that the index is (or is not) used in both scenarios.
And also what would produce better results: defining a composite index
on both columns or defining two separate ones on each column?


*** Sent via Developersdex http://www.developersdex.com ***
Author
28 Aug 2006 4:45 PM
ML
The simplest way to find out would be to test it.


ML

---
http://milambda.blogspot.com/
Author
28 Aug 2006 5:06 PM
Jim Underwood
How the index is or is not used is dependent on a lot of things, so testing
is the only way to be certain of what works and what does not, but in
general...

If your criteria is on two columns, and both criteria are equalities, then
using one index on both columns would be the most efficient way to retrieve
the results.  SQL Server will determine on its own whether or not to use the
index, but it will use only one index when searching the table, not two.
If you create two indexes, one on each column, then SQL Server may use one
or the other, or neither, but never both.

Of course, if you change the query to use ORs or unions, SQL Server could
decide to query the table twice, once with each index, but this has nothing
to do with the proposed scenario.

Show quote
"mEmENT0m0RI" <nospam@devdex.com> wrote in message
news:%23GBsTBsyGHA.4844@TK2MSFTNGP04.phx.gbl...
> My guess is that the question was not whether the index is going to be
> used or not, but more like if there is going to be a performance
> difference by switching the order of the parameters in the where clause
> assuming that the index is (or is not) used in both scenarios.
> And also what would produce better results: defining a composite index
> on both columns or defining two separate ones on each column?
>
>
> *** Sent via Developersdex http://www.developersdex.com ***
Author
28 Aug 2006 7:04 PM
Tom Cooper
SQL Server will use both indexes if it decides it is more efficient.  For
example, if I run:
Use Northwind
go
Set ShowPlan_Text On
go
Select CustomerID, PostalCode, Region
From Customers
Where Region = 'SP'
Or PostalCode = '1010'
go
Set ShowPlan_Text Off
go

The query plan generated is:

StmtText
--------------------------------------------------------------------------------------------------------------------------------------------
  |--Bookmark Lookup(BOOKMARK:([Bmk1000]),
OBJECT:([Northwind].[dbo].[Customers]))
       |--Sort(DISTINCT ORDER BY:([Bmk1000] ASC))
            |--Concatenation
                 |--Index
Seek(OBJECT:([Northwind].[dbo].[Customers].[Region]),
SEEK:([Customers].[Region]='SP') ORDERED FORWARD)
                 |--Index
Seek(OBJECT:([Northwind].[dbo].[Customers].[PostalCode]),
SEEK:([Customers].[PostalCode]='1010') ORDERED FORWARD)

which is using both the Region and PostalCode indexes on the Customers
table.

Tom

Show quote
"Jim Underwood" <james.underwoodATfallonclinic.com> wrote in message
news:uCRPbRsyGHA.2036@TK2MSFTNGP05.phx.gbl...
> How the index is or is not used is dependent on a lot of things, so
> testing
> is the only way to be certain of what works and what does not, but in
> general...
>
> If your criteria is on two columns, and both criteria are equalities, then
> using one index on both columns would be the most efficient way to
> retrieve
> the results.  SQL Server will determine on its own whether or not to use
> the
> index, but it will use only one index when searching the table, not two.
> If you create two indexes, one on each column, then SQL Server may use one
> or the other, or neither, but never both.
>
> Of course, if you change the query to use ORs or unions, SQL Server could
> decide to query the table twice, once with each index, but this has
> nothing
> to do with the proposed scenario.
>
> "mEmENT0m0RI" <nospam@devdex.com> wrote in message
> news:%23GBsTBsyGHA.4844@TK2MSFTNGP04.phx.gbl...
>> My guess is that the question was not whether the index is going to be
>> used or not, but more like if there is going to be a performance
>> difference by switching the order of the parameters in the where clause
>> assuming that the index is (or is not) used in both scenarios.
>> And also what would produce better results: defining a composite index
>> on both columns or defining two separate ones on each column?
>>
>>
>> *** Sent via Developersdex http://www.developersdex.com ***
>
>
Author
28 Aug 2006 7:34 PM
Jim Underwood
yup, that falls into the category of using ORs/unions which allows SQL
server to break the query into two queries then concatenate the results.
When using two criteria with AND, however, I can't think of a case where
this would be possible.

Show quote
"Tom Cooper" <tomcooper@comcast.no.spam.please.net> wrote in message
news:IMednXOqvvKlo27ZnZ2dnUVZ_sudnZ2d@comcast.com...
> SQL Server will use both indexes if it decides it is more efficient.  For
> example, if I run:
> Use Northwind
> go
> Set ShowPlan_Text On
> go
> Select CustomerID, PostalCode, Region
> From Customers
> Where Region = 'SP'
> Or PostalCode = '1010'
> go
> Set ShowPlan_Text Off
> go
>
> The query plan generated is:
>
> StmtText
> --------------------------------------------------------------------------
------------------------------------------------------------------
>   |--Bookmark Lookup(BOOKMARK:([Bmk1000]),
> OBJECT:([Northwind].[dbo].[Customers]))
>        |--Sort(DISTINCT ORDER BY:([Bmk1000] ASC))
>             |--Concatenation
>                  |--Index
> Seek(OBJECT:([Northwind].[dbo].[Customers].[Region]),
> SEEK:([Customers].[Region]='SP') ORDERED FORWARD)
>                  |--Index
> Seek(OBJECT:([Northwind].[dbo].[Customers].[PostalCode]),
> SEEK:([Customers].[PostalCode]='1010') ORDERED FORWARD)
>
> which is using both the Region and PostalCode indexes on the Customers
> table.
>
> Tom
>
> "Jim Underwood" <james.underwoodATfallonclinic.com> wrote in message
> news:uCRPbRsyGHA.2036@TK2MSFTNGP05.phx.gbl...
> > How the index is or is not used is dependent on a lot of things, so
> > testing
> > is the only way to be certain of what works and what does not, but in
> > general...
> >
> > If your criteria is on two columns, and both criteria are equalities,
then
> > using one index on both columns would be the most efficient way to
> > retrieve
> > the results.  SQL Server will determine on its own whether or not to use
> > the
> > index, but it will use only one index when searching the table, not two.
> > If you create two indexes, one on each column, then SQL Server may use
one
> > or the other, or neither, but never both.
> >
> > Of course, if you change the query to use ORs or unions, SQL Server
could
> > decide to query the table twice, once with each index, but this has
> > nothing
> > to do with the proposed scenario.
> >
> > "mEmENT0m0RI" <nospam@devdex.com> wrote in message
> > news:%23GBsTBsyGHA.4844@TK2MSFTNGP04.phx.gbl...
> >> My guess is that the question was not whether the index is going to be
> >> used or not, but more like if there is going to be a performance
> >> difference by switching the order of the parameters in the where clause
> >> assuming that the index is (or is not) used in both scenarios.
> >> And also what would produce better results: defining a composite index
> >> on both columns or defining two separate ones on each column?
> >>
> >>
> >> *** Sent via Developersdex http://www.developersdex.com ***
> >
> >
>
>
Author
28 Aug 2006 7:42 PM
Alexander Kuznetsov
Jim Underwood wrote:
> yup, that falls into the category of using ORs/unions which allows SQL
> server to break the query into two queries then concatenate the results.
> When using two criteria with AND, however, I can't think of a case where
> this would be possible.

Not exactly - I think it merges 2 sets of bookmarks, then does bookmark
lookup just once per row. It is called index ORing. Index ANDing is
also quite possible - it joins 2 sets of bookmarks first, then uses the
result set to perform bookmark lookups.
Author
28 Aug 2006 8:02 PM
Jim Underwood
Are you certain about the index ANDing?  It seems to me that it would work
something like this...

Select column1, column2 --(each column has its own index)
from SomeTable
where column1 = 10
and column2 = 'blue'

Lets say the result is 50 rows of data, but 100 rows match the 1st criteria
and 100 match the second criteria.

The optimizer would join 200 rows of bookmarks, rather than filter only 100
from one index or the other?

It seems that this would always be slower than using only one index.

Or am I simply missing another situation in which this would be efficient?

Show quote
"Alexander Kuznetsov" <AK_TIREDOFSPAM@hotmail.COM> wrote in message
news:1156794157.300651.223530@74g2000cwt.googlegroups.com...
> Jim Underwood wrote:
> > yup, that falls into the category of using ORs/unions which allows SQL
> > server to break the query into two queries then concatenate the results.
> > When using two criteria with AND, however, I can't think of a case where
> > this would be possible.
>
> Not exactly - I think it merges 2 sets of bookmarks, then does bookmark
> lookup just once per row. It is called index ORing. Index ANDing is
> also quite possible - it joins 2 sets of bookmarks first, then uses the
> result set to perform bookmark lookups.
>
Author
28 Aug 2006 8:16 PM
Alexander Kuznetsov
Jim,

the optimizer will join bookmarks from one index (selectivity S1)
against bookmarks from the other (selectivity S2). It will expect to
get S1*S2 selectivity

SET SHOWPLAN_TEXT ON
go
SELECT Make, Color, SalePrice FROM Vehicle WHERE Color = 'Red' AND Make
= 'Honda'

------------------

StmtText

--------------------------------------------------------------------------------------------------------------------------------

  |--Bookmark Lookup(BOOKMARK:([Bmk1000]),
OBJECT:([Sandbox].[dbo].[Vehicle]))
       |--Hash Match(Inner Join, HASH:([Bmk1000])=([Bmk1000]),
RESIDUAL:([Bmk1000]=[Bmk1000]))
            |--Index
Seek(OBJECT:([Sandbox].[dbo].[Vehicle].[VehicleColor]),
SEEK:([Vehicle].[Color]='Red') ORDERED FORWARD)
            |--Index
Seek(OBJECT:([Sandbox].[dbo].[Vehicle].[VehicleMake]),
SEEK:([Vehicle].[Make]='Honda') ORDERED FORWARD)

In this particular case the selectivity of Color = 'Red' alone is low,

SELECT Make, Color, SalePrice FROM Vehicle WHERE Color = 'Red'

executes as a clustered index scan. Similarly, the selectivity of Make
= 'Honda' alone is low,

SELECT Make, Color, SalePrice FROM Vehicle WHERE Make = 'Honda'

also executes as a clustered index scan.
Author
28 Aug 2006 8:35 PM
Jim Underwood
Thanks for the example, I think I just figured out why this is faster...

The problem was I couldn't figure out how joining two sets of bookmarks
would be more efficient than filtering a single set, but I was not
considering the timing of each action and the resources involved.

If I understand correctly, it can efficiently check two indexes and join the
bookmarks together without ever reading a row of the actual table.  This
could result in a lot less IO than having to read every row in order to
apply the filter.

Do I have that right?

Show quote
"Alexander Kuznetsov" <AK_TIREDOFSPAM@hotmail.COM> wrote in message
news:1156796206.350067.294390@i3g2000cwc.googlegroups.com...
> Jim,
>
> the optimizer will join bookmarks from one index (selectivity S1)
> against bookmarks from the other (selectivity S2). It will expect to
> get S1*S2 selectivity
>
> SET SHOWPLAN_TEXT ON
> go
> SELECT Make, Color, SalePrice FROM Vehicle WHERE Color = 'Red' AND Make
> = 'Honda'
>
> ------------------
>
> StmtText
>
> --------------------------------------------------------------------------
------------------------------------------------------
>
>   |--Bookmark Lookup(BOOKMARK:([Bmk1000]),
> OBJECT:([Sandbox].[dbo].[Vehicle]))
>        |--Hash Match(Inner Join, HASH:([Bmk1000])=([Bmk1000]),
> RESIDUAL:([Bmk1000]=[Bmk1000]))
>             |--Index
> Seek(OBJECT:([Sandbox].[dbo].[Vehicle].[VehicleColor]),
> SEEK:([Vehicle].[Color]='Red') ORDERED FORWARD)
>             |--Index
> Seek(OBJECT:([Sandbox].[dbo].[Vehicle].[VehicleMake]),
> SEEK:([Vehicle].[Make]='Honda') ORDERED FORWARD)
>
> In this particular case the selectivity of Color = 'Red' alone is low,
>
> SELECT Make, Color, SalePrice FROM Vehicle WHERE Color = 'Red'
>
> executes as a clustered index scan. Similarly, the selectivity of Make
> = 'Honda' alone is low,
>
> SELECT Make, Color, SalePrice FROM Vehicle WHERE Make = 'Honda'
>
> also executes as a clustered index scan.
>
Author
28 Aug 2006 8:43 PM
Tibor Karaszi
> Do I have that right?

Yep.

Show quote
"Jim Underwood" <james.underwoodATfallonclinic.com> wrote in message
news:eOBdzFuyGHA.1256@TK2MSFTNGP04.phx.gbl...
> Thanks for the example, I think I just figured out why this is faster...
>
> The problem was I couldn't figure out how joining two sets of bookmarks
> would be more efficient than filtering a single set, but I was not
> considering the timing of each action and the resources involved.
>
> If I understand correctly, it can efficiently check two indexes and join the
> bookmarks together without ever reading a row of the actual table.  This
> could result in a lot less IO than having to read every row in order to
> apply the filter.
>
> Do I have that right?
>
> "Alexander Kuznetsov" <AK_TIREDOFSPAM@hotmail.COM> wrote in message
> news:1156796206.350067.294390@i3g2000cwc.googlegroups.com...
>> Jim,
>>
>> the optimizer will join bookmarks from one index (selectivity S1)
>> against bookmarks from the other (selectivity S2). It will expect to
>> get S1*S2 selectivity
>>
>> SET SHOWPLAN_TEXT ON
>> go
>> SELECT Make, Color, SalePrice FROM Vehicle WHERE Color = 'Red' AND Make
>> = 'Honda'
>>
>> ------------------
>>
>> StmtText
>>
>> --------------------------------------------------------------------------
> ------------------------------------------------------
>>
>>   |--Bookmark Lookup(BOOKMARK:([Bmk1000]),
>> OBJECT:([Sandbox].[dbo].[Vehicle]))
>>        |--Hash Match(Inner Join, HASH:([Bmk1000])=([Bmk1000]),
>> RESIDUAL:([Bmk1000]=[Bmk1000]))
>>             |--Index
>> Seek(OBJECT:([Sandbox].[dbo].[Vehicle].[VehicleColor]),
>> SEEK:([Vehicle].[Color]='Red') ORDERED FORWARD)
>>             |--Index
>> Seek(OBJECT:([Sandbox].[dbo].[Vehicle].[VehicleMake]),
>> SEEK:([Vehicle].[Make]='Honda') ORDERED FORWARD)
>>
>> In this particular case the selectivity of Color = 'Red' alone is low,
>>
>> SELECT Make, Color, SalePrice FROM Vehicle WHERE Color = 'Red'
>>
>> executes as a clustered index scan. Similarly, the selectivity of Make
>> = 'Honda' alone is low,
>>
>> SELECT Make, Color, SalePrice FROM Vehicle WHERE Make = 'Honda'
>>
>> also executes as a clustered index scan.
>>
>
>
Author
28 Aug 2006 8:50 PM
Jim Underwood
As always, thank you for the insight.  I would never learn half as much in
any course as I do in this newsgroup.

Show quote
"Tibor Karaszi" <tibor_please.no.email_kara***@hotmail.nomail.com> wrote in
message news:euQGtKuyGHA.4816@TK2MSFTNGP06.phx.gbl...
> > Do I have that right?
>
> Yep.
>
> --
> Tibor Karaszi, SQL Server MVP
> http://www.karaszi.com/sqlserver/default.asp
> http://www.solidqualitylearning.com/
>
>
> "Jim Underwood" <james.underwoodATfallonclinic.com> wrote in message
> news:eOBdzFuyGHA.1256@TK2MSFTNGP04.phx.gbl...
> > Thanks for the example, I think I just figured out why this is faster...
> >
> > The problem was I couldn't figure out how joining two sets of bookmarks
> > would be more efficient than filtering a single set, but I was not
> > considering the timing of each action and the resources involved.
> >
> > If I understand correctly, it can efficiently check two indexes and join
the
> > bookmarks together without ever reading a row of the actual table.  This
> > could result in a lot less IO than having to read every row in order to
> > apply the filter.
> >
> > Do I have that right?
> >
> > "Alexander Kuznetsov" <AK_TIREDOFSPAM@hotmail.COM> wrote in message
> > news:1156796206.350067.294390@i3g2000cwc.googlegroups.com...
> >> Jim,
> >>
> >> the optimizer will join bookmarks from one index (selectivity S1)
> >> against bookmarks from the other (selectivity S2). It will expect to
> >> get S1*S2 selectivity
> >>
> >> SET SHOWPLAN_TEXT ON
> >> go
> >> SELECT Make, Color, SalePrice FROM Vehicle WHERE Color = 'Red' AND Make
> >> = 'Honda'
> >>
> >> ------------------
> >>
> >> StmtText
> >>
>
>> -------------------------------------------------------------------------
-
> > ------------------------------------------------------
> >>
> >>   |--Bookmark Lookup(BOOKMARK:([Bmk1000]),
> >> OBJECT:([Sandbox].[dbo].[Vehicle]))
> >>        |--Hash Match(Inner Join, HASH:([Bmk1000])=([Bmk1000]),
> >> RESIDUAL:([Bmk1000]=[Bmk1000]))
> >>             |--Index
> >> Seek(OBJECT:([Sandbox].[dbo].[Vehicle].[VehicleColor]),
> >> SEEK:([Vehicle].[Color]='Red') ORDERED FORWARD)
> >>             |--Index
> >> Seek(OBJECT:([Sandbox].[dbo].[Vehicle].[VehicleMake]),
> >> SEEK:([Vehicle].[Make]='Honda') ORDERED FORWARD)
> >>
> >> In this particular case the selectivity of Color = 'Red' alone is low,
> >>
> >> SELECT Make, Color, SalePrice FROM Vehicle WHERE Color = 'Red'
> >>
> >> executes as a clustered index scan. Similarly, the selectivity of Make
> >> = 'Honda' alone is low,
> >>
> >> SELECT Make, Color, SalePrice FROM Vehicle WHERE Make = 'Honda'
> >>
> >> also executes as a clustered index scan.
> >>
> >
> >
>
Author
28 Aug 2006 8:56 PM
Tibor Karaszi
> I would never learn half as much in
> any course as I do in this newsgroup.

As a trainer, how should I respond to that? <g>

I hope all learn as much as I do in these newsgroups, with the possible exception of Joe Celko, who
pick up material for his books.

Show quote
"Jim Underwood" <james.underwoodATfallonclinic.com> wrote in message
news:OfgfdOuyGHA.4596@TK2MSFTNGP02.phx.gbl...
> As always, thank you for the insight.  I would never learn half as much in
> any course as I do in this newsgroup.
>
> "Tibor Karaszi" <tibor_please.no.email_kara***@hotmail.nomail.com> wrote in
> message news:euQGtKuyGHA.4816@TK2MSFTNGP06.phx.gbl...
>> > Do I have that right?
>>
>> Yep.
>>
>> --
>> Tibor Karaszi, SQL Server MVP
>> http://www.karaszi.com/sqlserver/default.asp
>> http://www.solidqualitylearning.com/
>>
>>
>> "Jim Underwood" <james.underwoodATfallonclinic.com> wrote in message
>> news:eOBdzFuyGHA.1256@TK2MSFTNGP04.phx.gbl...
>> > Thanks for the example, I think I just figured out why this is faster...
>> >
>> > The problem was I couldn't figure out how joining two sets of bookmarks
>> > would be more efficient than filtering a single set, but I was not
>> > considering the timing of each action and the resources involved.
>> >
>> > If I understand correctly, it can efficiently check two indexes and join
> the
>> > bookmarks together without ever reading a row of the actual table.  This
>> > could result in a lot less IO than having to read every row in order to
>> > apply the filter.
>> >
>> > Do I have that right?
>> >
>> > "Alexander Kuznetsov" <AK_TIREDOFSPAM@hotmail.COM> wrote in message
>> > news:1156796206.350067.294390@i3g2000cwc.googlegroups.com...
>> >> Jim,
>> >>
>> >> the optimizer will join bookmarks from one index (selectivity S1)
>> >> against bookmarks from the other (selectivity S2). It will expect to
>> >> get S1*S2 selectivity
>> >>
>> >> SET SHOWPLAN_TEXT ON
>> >> go
>> >> SELECT Make, Color, SalePrice FROM Vehicle WHERE Color = 'Red' AND Make
>> >> = 'Honda'
>> >>
>> >> ------------------
>> >>
>> >> StmtText
>> >>
>>
>>> -------------------------------------------------------------------------
> -
>> > ------------------------------------------------------
>> >>
>> >>   |--Bookmark Lookup(BOOKMARK:([Bmk1000]),
>> >> OBJECT:([Sandbox].[dbo].[Vehicle]))
>> >>        |--Hash Match(Inner Join, HASH:([Bmk1000])=([Bmk1000]),
>> >> RESIDUAL:([Bmk1000]=[Bmk1000]))
>> >>             |--Index
>> >> Seek(OBJECT:([Sandbox].[dbo].[Vehicle].[VehicleColor]),
>> >> SEEK:([Vehicle].[Color]='Red') ORDERED FORWARD)
>> >>             |--Index
>> >> Seek(OBJECT:([Sandbox].[dbo].[Vehicle].[VehicleMake]),
>> >> SEEK:([Vehicle].[Make]='Honda') ORDERED FORWARD)
>> >>
>> >> In this particular case the selectivity of Color = 'Red' alone is low,
>> >>
>> >> SELECT Make, Color, SalePrice FROM Vehicle WHERE Color = 'Red'
>> >>
>> >> executes as a clustered index scan. Similarly, the selectivity of Make
>> >> = 'Honda' alone is low,
>> >>
>> >> SELECT Make, Color, SalePrice FROM Vehicle WHERE Make = 'Honda'
>> >>
>> >> also executes as a clustered index scan.
>> >>
>> >
>> >
>>
>
>
Author
28 Aug 2006 9:17 PM
Tom Cooper
Right, the optimizer AND's the bookmarks retrieved from each indexes and
only goes to get the actual row if the rows appears in the result from both
indexes.  In some cases, it can even do better than that.  If, between the
two indexes (and the columns in the clustered index), the query is covered,
it nevered needs to go to the table at all.  So, if in Alexander's example,
the clustered index was VIN (the vehicle id number) and the query was
SELECT VIN FROM Vehicle WHERE Color = 'Red AND Make = 'Honda'
than that query is covered by the two indexes and access to the table is not
needed.

For example, try running the following script (WARNING, it creates a
database named Foo, loads data into it and then drops that database, so be
careful where you run it - if for example, you already have a database named
Foo, this script would drop it).  Running this script takes about 3 minutes
on my server.

It creates a database, loads a table in that database with 1,000,000 or so
rows and a  clustered index on the primary key (PK), sets up two columns
(Status and Region) each with evenly distributed values of about 0-1000,
then creates nonclustered indexes on those two columns.  It then shows the
query play that the optimizer would pick if you did

Select PK, Status, Region
From FooTable
Where Region = 54
And Status = 123

It then actually does this query using index hints - once using both
indexes, once using the clustered index, and once using only one of the
indexes.  As you will see, using both indexes is the fastest.

Set NoCount On
Create Database Foo
go
Use Foo
go
Create Table FooTable
  (PK int Primary Key Identity,
  Status int Null,
  Region int Null,
  CharColumn char(100))
go
Declare @Loop int
Insert FooTable (Status, Region, CharColumn) Values (Null, Null, '')
Set @Loop = 0
While @Loop < 20
Begin
  Insert FooTable (Status, Region, CharColumn)
    Select Region, Status, CharColumn From FooTable
  Set @Loop = @Loop + 1
End
Select Count(*) From FooTable
Update FooTable
  Set Status = PK % 1000,
  Region = PK % 1001
Create Index Status On FooTable (Status)
Create Index Region On FooTable (Region)
go
Set ShowPlan_Text On
go
Select PK, Status, Region
From FooTable
Where Region = 54
And Status = 123
go
Set ShowPlan_Text Off
go
Declare @Time1 datetime
Declare @Time2 datetime
Declare @Time3 datetime
Declare @Time4 datetime
Select @Time1 = GetDate()
Select PK, Status, Region
From FooTable (index (Region, Status))
Where Region = 54
And Status = 123
Select @Time2 = GetDate()
Select PK, Status, Region
From FooTable (index=1)
Where Region = 54
And Status = 123
Select @Time3 = GetDate()
Select PK, Status, Region
From FooTable (index (Region))
Where Region = 54
And Status = 123
Select @Time4 = GetDate()
Select DateDiff(ms, @Time1, @Time2) As BothIndexes,
       DateDiff(ms, @Time2, @Time3) As ClusteredIndex,
       DateDiff(ms, @Time3, @Time4) As OneIndex
go
Use master
go
Drop Database Foo

Tom

Show quote
"Jim Underwood" <james.underwoodATfallonclinic.com> wrote in message
news:eOBdzFuyGHA.1256@TK2MSFTNGP04.phx.gbl...
> Thanks for the example, I think I just figured out why this is faster...
>
> The problem was I couldn't figure out how joining two sets of bookmarks
> would be more efficient than filtering a single set, but I was not
> considering the timing of each action and the resources involved.
>
> If I understand correctly, it can efficiently check two indexes and join
> the
> bookmarks together without ever reading a row of the actual table.  This
> could result in a lot less IO than having to read every row in order to
> apply the filter.
>
> Do I have that right?
>
> "Alexander Kuznetsov" <AK_TIREDOFSPAM@hotmail.COM> wrote in message
> news:1156796206.350067.294390@i3g2000cwc.googlegroups.com...
>> Jim,
>>
>> the optimizer will join bookmarks from one index (selectivity S1)
>> against bookmarks from the other (selectivity S2). It will expect to
>> get S1*S2 selectivity
>>
>> SET SHOWPLAN_TEXT ON
>> go
>> SELECT Make, Color, SalePrice FROM Vehicle WHERE Color = 'Red' AND Make
>> = 'Honda'
>>
>> ------------------
>>
>> StmtText
>>
>> --------------------------------------------------------------------------
> ------------------------------------------------------
>>
>>   |--Bookmark Lookup(BOOKMARK:([Bmk1000]),
>> OBJECT:([Sandbox].[dbo].[Vehicle]))
>>        |--Hash Match(Inner Join, HASH:([Bmk1000])=([Bmk1000]),
>> RESIDUAL:([Bmk1000]=[Bmk1000]))
>>             |--Index
>> Seek(OBJECT:([Sandbox].[dbo].[Vehicle].[VehicleColor]),
>> SEEK:([Vehicle].[Color]='Red') ORDERED FORWARD)
>>             |--Index
>> Seek(OBJECT:([Sandbox].[dbo].[Vehicle].[VehicleMake]),
>> SEEK:([Vehicle].[Make]='Honda') ORDERED FORWARD)
>>
>> In this particular case the selectivity of Color = 'Red' alone is low,
>>
>> SELECT Make, Color, SalePrice FROM Vehicle WHERE Color = 'Red'
>>
>> executes as a clustered index scan. Similarly, the selectivity of Make
>> = 'Honda' alone is low,
>>
>> SELECT Make, Color, SalePrice FROM Vehicle WHERE Make = 'Honda'
>>
>> also executes as a clustered index scan.
>>
>
>
Author
28 Aug 2006 9:53 PM
Alexander Kuznetsov
> If I understand correctly, it can efficiently check two indexes and join the
> bookmarks together without ever reading a row of the actual table.  This
> could result in a lot less IO than having to read every row in order to
> apply the filter.
>
> Do I have that right?

Jim,

Yes, you got it. Using your own words, "it *could* result in a lot less
IO", with a very strong emphasis on could.
Suppose the selectivities for both conditions are 0.1. The optimizer
will assume that the selectivity for both conditions together
is 0.01 (derived as 0.1*0.1). Suppose Make and COlor are not
correlated, then the estimate of selectivity is more or less correct
and the plan involving index ANDing is good. Very frequently that's
just the case.

But. SOmetimes such optimizer's logic just does not work right.

For instance, Make and Model are strongly correlated. Suppose that
selectivity of condition (Model = 'Accord') is 0.05. The optimizer will
assume that the selectivity for
(Make = 'Honda' AND Model = 'Accord') is 0.1 times 0.05 is 0.005 when
in fact it is 0.05. (We humans know that all Accords are Hondas,
the optimizer does not know it). The consequenses are dire: selecting
5% rows using index ANDing is much slower as compared to
selecting the same set of rows by just scanning the whole clustered
index.
The optimizer's estimate of selectivity was a magnitude of order wrong
- 0.005 instead of 0.05.

Optimizing queries with condition on correlated columns is a very
interesting challenge.
If both columns are in one table, sometimes I use an index on a
computed column MakeModel AS (Make + ' '+ Model)
In other cases sometimes saving intermediate results in a temporary
table is an easy way out.
Author
28 Aug 2006 10:52 PM
Gert-Jan Strik
Alexander Kuznetsov wrote:
[snip]
> Optimizing queries with condition on correlated columns is a very
> interesting challenge.
> If both columns are in one table, sometimes I use an index on a
> computed column MakeModel AS (Make + ' '+ Model)
> In other cases sometimes saving intermediate results in a temporary
> table is an easy way out.

If you know you will be querying this combination frequently, then why
not create a compound index? A compound index will outperform single
index AND'ing any day of the week.

In fact, for index AND'ing to be worthwile, each index needs to be in a
specific selectivity range (quite selective, but not too selective). I
see use in it if you have many columns indexed, and you cannot predict
which combinations will be queries on.

Just my 5 cents,
Gert-Jan
Author
29 Aug 2006 4:13 PM
Alexander Kuznetsov
Gert-Jan,

I oversimplified my example. In fact I was selecting from several
tables, for instance:

SELECT Make, Color, SalePrice, RepairType FROM Vehicle JOIN Repair ON
Vehicle.VIN = Repair.VIN
WHERE Model = 'Accord' AND Make = 'Honda'

It should be a merge join or a hash join, but the optimizer was
expecting just a handful of rows and went for nested loops.
Selecting a lot of rows with a nested loops plan was very slow. I was
not able to have the optimizer use a compound index and
correctly estimate number of rows matching both conditions (Model =
'Accord' AND Make = 'Honda').
I handled such situations several times. In some cases an index on
computed column 'Honda Accord' helped.
In other cases the solution was to save intermediate results in a
temporary table and build an index on it - that means statistics,
therefore a very good estimate of cardinality.

SELECT VIN, Make, Color, SalePrice
INTO #t
FROM Vehicle WHERE Mileage BETWEEN 30000 AND 50000 AND Price BETWEEN
7000 AND 15000

Unlike the previous example with Make and Model, Mileage and Price are
also strongly correlated, but they have a lot of distinct values. In
this case the only approach which I know that works is to save
intermediate results in a temporary table and build an index on it.

Makes sense?
Author
29 Aug 2006 4:29 PM
Gert-Jan Strik
See inline

Alexander Kuznetsov wrote:
Show quote
>
> Gert-Jan,
>
> I oversimplified my example. In fact I was selecting from several
> tables, for instance:
>
> SELECT Make, Color, SalePrice, RepairType FROM Vehicle JOIN Repair ON
> Vehicle.VIN = Repair.VIN
> WHERE Model = 'Accord' AND Make = 'Honda'
>
> It should be a merge join or a hash join, but the optimizer was
> expecting just a handful of rows and went for nested loops.
> Selecting a lot of rows with a nested loops plan was very slow. I was
> not able to have the optimizer use a compound index and
> correctly estimate number of rows matching both conditions (Model =
> 'Accord' AND Make = 'Honda').
> I handled such situations several times. In some cases an index on
> computed column 'Honda Accord' helped.
> In other cases the solution was to save intermediate results in a
> temporary table and build an index on it - that means statistics,
> therefore a very good estimate of cardinality.

If the table has a skewed distribution this can happen. Especially with
stored procedures in combination with parameter sniffing. In those
cases, creating the stored procedure WITH RECOMPILE can help.

Also, the use of the compound index requires that its statistics are up
to date.


> SELECT VIN, Make, Color, SalePrice
> INTO #t
> FROM Vehicle WHERE Mileage BETWEEN 30000 AND 50000 AND Price BETWEEN
> 7000 AND 15000
>
> Unlike the previous example with Make and Model, Mileage and Price are
> also strongly correlated, but they have a lot of distinct values. In
> this case the only approach which I know that works is to save
> intermediate results in a temporary table and build an index on it.
>
> Makes sense?

Yes. Although you if the values are fixed, then you should consider an
indexed view instead of a temporary table. This way the data will always
be up to date, and saves you a lot of coding and debugging.

Gert-Jan
Author
29 Aug 2006 5:56 PM
Alexander Kuznetsov
> If the table has a skewed distribution this can happen. Especially with
> stored procedures in combination with parameter sniffing. In those
> cases, creating the stored procedure WITH RECOMPILE can help.
>
> Also, the use of the compound index requires that its statistics are up
> to date.

No! The reason is entirely different. Suppose the threshold between
non-clustered index seek and clustered index scan is 0.015 selectivity.
Suppose you have 10 different (Make, Model) pairs, 10% rows each, and
different models have different makes.
There is no data skew at all - each Make has selectivity 0.1, and each
Model also has selectivity 0.1.
The problem is that the optimizer does not know that there are only 10
different (Make, Model) pairs -
it assumes that there are 100 different (Make, Model) pairs. It assumes
that 1% rows are (Honda, Accord),
and 1% rows are (Toyota, Accord). Let me copy and paste from Conor
Cunningham's post in the same thread:

"SQL Server only has histograms for the first column of an index (or
any multi-column statistic,
which can also be created manually)"

This means that the optimizer cannot derive from statistics on an index
on (Model, Make) that there are 10% Honda Accords.
This is why I am using a computed column on Model+' '+Make.

>
> Yes. Although you if the values are fixed, then you should consider an
> indexed view instead of a temporary table. This way the data will always
> be up to date, and saves you a lot of coding and debugging.
>

I am always considering indexed views, but in OLTP encironments I am
usually deciding against them, because they introduce lock contention.
Suppose Vehicle.Model and Repair.Type are correlated. Without an
indexed view containing both columns, 2 connections can simultaneously
modify

Connection 1: Vehicle with VIN=123456
Connection 2: Repairs for VIN=123456

Once you have created an indexed view containing both Vehicle.Model and
Repair.Type, Connection 1 will acquire an update lock on the
corresponding row in
the indexed view, adn Connection 2 will have to wait. Also in many
cases the query I need to optimize is not run too frequently.
As such, it might be overall cheaper to not maintain an indexed view
necessary to speed up one query only.
Author
29 Aug 2006 9:53 PM
Gert-Jan Strik
Alexander Kuznetsov wrote:
Show quote
>
> > If the table has a skewed distribution this can happen. Especially with
> > stored procedures in combination with parameter sniffing. In those
> > cases, creating the stored procedure WITH RECOMPILE can help.
> >
> > Also, the use of the compound index requires that its statistics are up
> > to date.
>
> No! The reason is entirely different. Suppose the threshold between
> non-clustered index seek and clustered index scan is 0.015 selectivity.
> Suppose you have 10 different (Make, Model) pairs, 10% rows each, and
> different models have different makes.
> There is no data skew at all - each Make has selectivity 0.1, and each
> Model also has selectivity 0.1.
> The problem is that the optimizer does not know that there are only 10
> different (Make, Model) pairs -
> it assumes that there are 100 different (Make, Model) pairs. It assumes
> that 1% rows are (Honda, Accord),
> and 1% rows are (Toyota, Accord). Let me copy and paste from Conor
> Cunningham's post in the same thread:
>
> "SQL Server only has histograms for the first column of an index (or
> any multi-column statistic,
> which can also be created manually)"
>
> This means that the optimizer cannot derive from statistics on an index
> on (Model, Make) that there are 10% Honda Accords.
> This is why I am using a computed column on Model+' '+Make.
[snip]

There is some truth to this, but it is not as clear cut as you put it.
See my response to Conor.

But your example is not a good one, because the implementation script of
it (below) shows a 0.1 density estimation for both the "Make" and the
"Make, Model" combination. So the optimizer could (and should) know that
there is perfect correlation between Make and Model.

Gert-Jan


--drop table Vehicle
create table Vehicle
(Make char(10) not null
,Model char(10) not null
,constraint PK_Vehicle primary key clustered (Make, Model)
)
insert into Vehicle values ('Honda', 'Accord')
insert into Vehicle values ('Toyota','Corolla')
insert into Vehicle values ('c','c')
insert into Vehicle values ('d','d')
insert into Vehicle values ('e','e')
insert into Vehicle values ('f','f')
insert into Vehicle values ('g','g')
insert into Vehicle values ('h','h')
insert into Vehicle values ('i','i')
insert into Vehicle values ('j','j')
go
update statistics Vehicle
go
dbcc show_statistics (Vehicle,PK_Vehicle)
Author
30 Aug 2006 1:08 PM
Alexander Kuznetsov
> But your example is not a good one, because the implementation script of
> it (below) shows a 0.1 density estimation for both the "Make" and the
> "Make, Model" combination. So the optimizer could (and should) know that
> there is perfect correlation between Make and Model.
Gert-Jan,

you are right. I inflated Vehicle table to 100K rows, and the optimizer
did recognize the 100% correlation.

Thanks for the correction.
Author
28 Aug 2006 8:03 PM
boah123
mEmENT0m0RI wrote:
> My guess is that the question was not whether the index is going to be
> used or not, but more like if there is going to be a performance
> difference by switching the order of the parameters in the where clause
> assuming that the index is (or is not) used in both scenarios.
> And also what would produce better results: defining a composite index
> on both columns or defining two separate ones on each column?
>
>
> *** Sent via Developersdex http://www.developersdex.com ***

Thats correct!
Arjen

AddThis Social Bookmark Button