Home All Groups Group Topic Archive Search About

How does engine decide whether to scan or seek index

Author
26 Jan 2006 5:21 PM
Neil Waldie
Hi,

I'm working on an application that uses SQL Server 7 as a back end. Nearly
everything in the database references one table with the following columns.

generation_id int (primary key)
entity_id int
effective_from datetime
effective_to datetime
active tinyint
deleted tinyint
other text fields.....(e.g. notes)

Querying this table can be pretty slow. But in most cases I can squeeze
reasonable performance out of my queries. By trial and error I've found I get
best performance with a non-clustered index on all the columns above bar the
'other text cols.' However, in a few cases with what seems to be an almost
identical query, the execution plan shows the index is being scanned, and
this causes the query to run slowly.

Can anyone tell me how (in general) the engine decides whether to scan or
seek on the index? Is it related to the properties of any other tables you
join to? I haven't been able to find a description of this process anywhere -
all you seem to get is a description of what the various types of index are,
and how things work is relegated to the 'as if by magic' category!

TIA

Neil

Author
26 Jan 2006 5:50 PM
Mark Williams
From my experience, you will see the optimizer choose a scan of an index
(rather than a seek) whenever the filter in the WHERE condition specifies a
range of values using operators like <,>, BETWEEN or LIKE.

If your table has a clustered index, and you specify an exact-match filter
in your WHERE clause using a column or columns that have a nonclustered
index, you will typically see an index seek with a bookmark lookup to the
clustered index.

If you have two identical queries that execute with different plans,
investigate how often statistics are updated on your tables and indexes.

--

Show quote
"Neil Waldie" wrote:

> Hi,
>
> I'm working on an application that uses SQL Server 7 as a back end. Nearly
> everything in the database references one table with the following columns.
>
> generation_id int (primary key)
> entity_id int
> effective_from datetime
> effective_to datetime
> active tinyint
> deleted tinyint
> other text fields.....(e.g. notes)
>
> Querying this table can be pretty slow. But in most cases I can squeeze
> reasonable performance out of my queries. By trial and error I've found I get
> best performance with a non-clustered index on all the columns above bar the
> 'other text cols.' However, in a few cases with what seems to be an almost
> identical query, the execution plan shows the index is being scanned, and
> this causes the query to run slowly.
>
> Can anyone tell me how (in general) the engine decides whether to scan or
> seek on the index? Is it related to the properties of any other tables you
> join to? I haven't been able to find a description of this process anywhere -
> all you seem to get is a description of what the various types of index are,
> and how things work is relegated to the 'as if by magic' category!
>
> TIA
>
> Neil
Author
27 Jan 2006 8:17 AM
Robert Klemme
Mark Williams wrote:
> From my experience, you will see the optimizer choose a scan of an
> index (rather than a seek) whenever the filter in the WHERE condition
> specifies a range of values using operators like <,>, BETWEEN or LIKE.
>
> If your table has a clustered index, and you specify an exact-match
> filter in your WHERE clause using a column or columns that have a
> nonclustered index, you will typically see an index seek with a
> bookmark lookup to the clustered index.
>
> If you have two identical queries that execute with different plans,
> investigate how often statistics are updated on your tables and
> indexes.

I'm not sure what you're at here.  Two identical queries is actually one
query.  If it executes with different plans over time then statistics
could be the reason or change in data distribution.

If two queries are just similar (meaning, same conditions but differnt
values for conditions) the optimizer will often choose a scan if
statistics indicate that a majority of rows will be returned for this
criterium.

Kind regards

    robert
Author
27 Jan 2006 9:01 AM
Neil Waldie
Robert,

I thought that might trip me up! You can never shortcut these things....

What I meant by 'almost identical' was that the database consists of several
tables that are linked to my problem table by the generation id. Most of the
queries I'm using are of the same form (TableX INNER JOIN Generation on
TableX.g_id = Generation.g_id) but using different tables (TableX). The
Generation table they ALL join too was someone else's bright idea which I
have to implement and it is this sort of query that causes me problems. 99%
of them use an index seek in execution, 1 % use a scan.

This is why I raised the possibility that the decision to do that may be
related to something about the table you join to. You've pointed out that
'the optimizer will often choose a scan if statistics indicate that a
majority of rows will be returned for this criterium.'  The criteria for
searching will always be the same - is the search date between the effective
dates and is the row active and not deleted. But maybe the number of rows
returned by the join on the primary key affects this decision too? I'll have
a look and see if this only happens where TableX contains a large number of
rows that join to the Generation table.

As Mark pointed out, I didn't specify my WHERE clause, so it makes it hard
for you to know anything about the criteria for searching!

Thanks for your help - you've given me something to think about. I'll post a
note if I find anything worth reporting.

Neil

Show quote
"Robert Klemme" wrote:

> Mark Williams wrote:
> > From my experience, you will see the optimizer choose a scan of an
> > index (rather than a seek) whenever the filter in the WHERE condition
> > specifies a range of values using operators like <,>, BETWEEN or LIKE.
> >
> > If your table has a clustered index, and you specify an exact-match
> > filter in your WHERE clause using a column or columns that have a
> > nonclustered index, you will typically see an index seek with a
> > bookmark lookup to the clustered index.
> >
> > If you have two identical queries that execute with different plans,
> > investigate how often statistics are updated on your tables and
> > indexes.
>
> I'm not sure what you're at here.  Two identical queries is actually one
> query.  If it executes with different plans over time then statistics
> could be the reason or change in data distribution.
>
> If two queries are just similar (meaning, same conditions but differnt
> values for conditions) the optimizer will often choose a scan if
> statistics indicate that a majority of rows will be returned for this
> criterium.
>
> Kind regards
>
>     robert
>
>
Author
27 Jan 2006 10:56 AM
Robert Klemme
Neil Waldie wrote:
Show quote
> Robert,
>
> I thought that might trip me up! You can never shortcut these
> things....
>
> What I meant by 'almost identical' was that the database consists of
> several tables that are linked to my problem table by the generation
> id. Most of the queries I'm using are of the same form (TableX INNER
> JOIN Generation on TableX.g_id = Generation.g_id) but using different
> tables (TableX). The Generation table they ALL join too was someone
> else's bright idea which I have to implement and it is this sort of
> query that causes me problems. 99% of them use an index seek in
> execution, 1 % use a scan.
>
> This is why I raised the possibility that the decision to do that may
> be related to something about the table you join to. You've pointed
> out that 'the optimizer will often choose a scan if statistics
> indicate that a majority of rows will be returned for this
> criterium.'  The criteria for searching will always be the same - is
> the search date between the effective dates and is the row active and
> not deleted. But maybe the number of rows returned by the join on the
> primary key affects this decision too? I'll have a look and see if
> this only happens where TableX contains a large number of rows that
> join to the Generation table.

That's exactly what I meant.  If the optimizer figures that 90% of rows
from you fact table are returned by the join he might as well go about
scanning.

Then again, if tables are joint the decision situation becomes more
complex: two tables with statistics, join condition etc.

> As Mark pointed out, I didn't specify my WHERE clause, so it makes it
> hard for you to know anything about the criteria for searching!

True. :-)

> Thanks for your help - you've given me something to think about. I'll
> post a note if I find anything worth reporting.

I'm glad I could help.  Good luck!

    robert
Author
30 Jan 2006 4:01 AM
awny100
Neil,

Could you provide the order of they columns in your index key?
Also it seems from your description above that there is no predicates
on entity_id. Is this the case?

Regards
Awny Al-Omari
Author
30 Jan 2006 8:39 AM
Neil Waldie
Awny,

You're correct - in this type of query entity_id is unneccessary as the
generation id is unique and the join will be between the entity table and the
generation table, ensuring I only get the entities I want. The entity_id is,
in truth, rarely of any relevance. I probably should have lumped it in the
'other columns' section.

The index is currently defined as
generation_id,
effective_from,
effective_to,
active,
deleted

but I have tried several other combinations, often having several indices
defined at the same time in the hope that one of them would provide more joy
than the others. I can't say I've tried every possible combination, but
certainly the first three columns have been re-arranged to have a bash at
being first!

Neil



Show quote
"awny***@gmail.com" wrote:

> Neil,
>
> Could you provide the order of they columns in your index key?
> Also it seems from your description above that there is no predicates
> on entity_id. Is this the case?
>
> Regards
> Awny Al-Omari
>
>
Author
30 Jan 2006 2:35 PM
awny100
Neil,

It could be that the optimizer is choosing a Hash/Merge Join with index
scan against a Nested Loop Join with index seek on generation as inner.
There are several factors in optimizer decision. Most important are

- The number of rows retrieved from TableX i.e. the table size assuming
no local predicates on the table.
- The size of the generation table
- The constants used in your queries and in particular the date value
used against 'effective_from'. This determine their predicates
selectivities.

It would help if you can post a sample query with showplan. Also what
are roughly the table sizes?

Regards
Awny  Al-Omari
Author
27 Jan 2006 8:38 AM
Neil Waldie
Mmm. All my queries have to do the same thing - Join another entity table to
the generation data in the table to get the currently active version. So
you're correct in guessing that I use BETWEEN in the where clause - I should
have given more info. Unfortunately, they all use BETWEEN, so that is not the
key to my problem - In the vast majority of cases I still get a seek, not a
scan.

Re: statistics, I've been forcing an update in my testing using UPDATE
STATISTICS table WITH FULLSCAN. I've been doing this between tests and
changing index definitions as I thought this would ensure a level playing
field. I'd be interested to know what you think about that - is it the best
thing to do when testing? It only takes a second, so if it does no harm, I'll
keep doing it!

Thanks for your comments.

Neil

Show quote
"Mark Williams" wrote:

> From my experience, you will see the optimizer choose a scan of an index
> (rather than a seek) whenever the filter in the WHERE condition specifies a
> range of values using operators like <,>, BETWEEN or LIKE.
>
> If your table has a clustered index, and you specify an exact-match filter
> in your WHERE clause using a column or columns that have a nonclustered
> index, you will typically see an index seek with a bookmark lookup to the
> clustered index.
>
> If you have two identical queries that execute with different plans,
> investigate how often statistics are updated on your tables and indexes.
>
> --
>
> "Neil Waldie" wrote:
>
> > Hi,
> >
> > I'm working on an application that uses SQL Server 7 as a back end. Nearly
> > everything in the database references one table with the following columns.
> >
> > generation_id int (primary key)
> > entity_id int
> > effective_from datetime
> > effective_to datetime
> > active tinyint
> > deleted tinyint
> > other text fields.....(e.g. notes)
> >
> > Querying this table can be pretty slow. But in most cases I can squeeze
> > reasonable performance out of my queries. By trial and error I've found I get
> > best performance with a non-clustered index on all the columns above bar the
> > 'other text cols.' However, in a few cases with what seems to be an almost
> > identical query, the execution plan shows the index is being scanned, and
> > this causes the query to run slowly.
> >
> > Can anyone tell me how (in general) the engine decides whether to scan or
> > seek on the index? Is it related to the properties of any other tables you
> > join to? I haven't been able to find a description of this process anywhere -
> > all you seem to get is a description of what the various types of index are,
> > and how things work is relegated to the 'as if by magic' category!
> >
> > TIA
> >
> > Neil

AddThis Social Bookmark Button