Home All Groups Group Topic Archive Search About

clustered index on a identity field

Author
8 Sep 2005 6:44 PM
Pradeep Kutty
Hi All,

I have heard that if you create a clustered index on a identity field and
its a PK, it reduces the page splits...
But how?

Thanks,
Prad

Author
8 Sep 2005 2:42 PM
jason
if you mean that the number of pages required to define a table will
grow more slowly, that is correct. but there are many reasons not to do
this. in my experience, clustering any primary key is a pretty bad
idea, especially an identity column. the reasons are twofold:

1) clustering ensures that all like values will be ordered together. if
you apply this to a value that is gauranteed to be unique, all you are
accomplishing is gauranteeing the physical order of the rows on disk,
which would serve no real purpose (i.e. none of your apps should care
what order the rows are on disk).

in my opinion, you should reserve a clustered index for a column that
has non-unique values, and is often used as a search criteria by value.
that way, all of the rows with the same value are physically grouped
together, greatly improving the performance of the disk read to collect
all the data.

2) creating a clustered index on an identity column is very likely to
create insert hotspots, because it is gauranteed that almost every new
row will be written to the same datapage. if you have a large number of
inserts occurring, this can create lock issues, if the current
insertion page is locked for any reason from insertion. it's far
healthier if your database can insert rows into a nicely dispersed set
of data pages, to prevent issues of lock escalation impeding inserts.

anyway, those are just my two cents. i'm sure other people have other
opinions.
Author
8 Sep 2005 6:07 PM
Brian Selzer
I disagree.

The physical location on disk of index rows in a clustered index is
undefined.  There is no guarantee that the index pages of a clustered index
reside in any specified order on the physical hardware. What is defined is
that the rows within a specific leaf page are in order, and each clustered
index leaf page contains next and previous pointers to locate the next and
previous leaf pages.  One of the main benefits of using a clustered index is
that once the first row in an ordered scan is found, SQL Server doesn't need
to walk the b-tree to find the next row.  It can use the next pointer to
continue the ordered scan.  This can improve query performance considerably.

You should avoid using a non-unique key in a clustered index.  If a table
has a clustered index, then every nonclustered index uses the clustered
index key to locate the row in the table.  If the key of a clustered index
is not unique, then SQL Server adds a hidden 4-byte uniquifier to the key to
make it unique.  The entire key along with the uniqifier is then stored in
every nonclustered index row.

In addition, there is a significant degradation of join performance.  If you
put a nonclustered index on the primary key, then every join with related
tables requires an additional lookup to satisfy the query.  For each row,
the index for the primary key constraint is consulted yielding the clustered
index key.  That key is then used to lookup the row in the clustered index.
Therefore, if you have related tables, then the clustered index should live
on the primary key.

Your assertion about insert hotspots fails to mention the additional index
maintenance (page splits and updates) that is required if your clustered
index isn't on the IDENTITY column.  If the clustered index is on an
IDENTITY column, then there will never be any page splits, because inserts
become appends, both in the b-tree index nodes and in the leaf nodes.  You
can also use a 100% fillfactor, because there will never be any inserts,
only appends.  The tree grows up: at each level of the b-tree, a new page is
appended if it is needed, and if a new page is needed at the root level,
then a new root page is also added.  That's why inserts into a table with
only a clustered index on the IDENTITY column perform almost as well as
inserts into a table without any indexes at all.  Deletes and updates in
this case require minimal index maintenance, because SQL Server doesn't join
index pages, it only splits them when an insert exceeds the fillfactor
threshold.  Deletes just leave holes, so if you expect a lot of deletes,
then you should schedule a periodic DBCC INDEXDEFRAG or DBCC DBREINDEX.

SQL Server uses lazy spooling, which means that changes to the database are
cached and then periodically flushed to disk (checkpointed).  (SQL Server
only ensures that writes to the transaction log are flushed to disk before
returning from a commit.)  This means that more often than not, the last
page is still resident in memory if there are a large number of inserts,
thereby turning the insert hotspot into a performance improvement.

Finally, unless otherwise forced, the exclusive locks applied by an insert
should not escalate unless the page is full and all of the rows in the page
were just inserted by the current transaction.  Therefore, there is little
if any lock contention due to inserts.


Show quote
"jason" <iae***@yahoo.com> wrote in message
news:1126190553.831020.325290@f14g2000cwb.googlegroups.com...
> if you mean that the number of pages required to define a table will
> grow more slowly, that is correct. but there are many reasons not to do
> this. in my experience, clustering any primary key is a pretty bad
> idea, especially an identity column. the reasons are twofold:
>
> 1) clustering ensures that all like values will be ordered together. if
> you apply this to a value that is gauranteed to be unique, all you are
> accomplishing is gauranteeing the physical order of the rows on disk,
> which would serve no real purpose (i.e. none of your apps should care
> what order the rows are on disk).
>
> in my opinion, you should reserve a clustered index for a column that
> has non-unique values, and is often used as a search criteria by value.
> that way, all of the rows with the same value are physically grouped
> together, greatly improving the performance of the disk read to collect
> all the data.
>
> 2) creating a clustered index on an identity column is very likely to
> create insert hotspots, because it is gauranteed that almost every new
> row will be written to the same datapage. if you have a large number of
> inserts occurring, this can create lock issues, if the current
> insertion page is locked for any reason from insertion. it's far
> healthier if your database can insert rows into a nicely dispersed set
> of data pages, to prevent issues of lock escalation impeding inserts.
>
> anyway, those are just my two cents. i'm sure other people have other
> opinions.
>
Author
8 Sep 2005 7:04 PM
jason
interesting, you clearly have some well informed opinions. they just
happen to disagree with some of the sources i've used to come to my
understanding :) these things happen. i admit, my understanding of
clustered indexing, and indexing in general, is incomplete, but here
are some points that i thought i'd share, to see if perhaps there's an
absolute answer:

* from every text at my disposal, the primary advantage to a clustered
index is the fact that the data is guaranteed to be in the same order
as the index.

* the leaf nodes of a clustered index don't actually point to the next
and previous rows, because they don't have to. the next and previous
values are implied by the next and previous positions on disk.

* nonclustered indexes do point to next and previous data rows, because
the index order may not match the physical order, so a pointer is
required

* the greatest advantages to using clustered indexes, therefore, are
cases where the physical order of the data can increase your
performance. chief among these are (1) columns commonly used in group
by and order by clauses (2) columns on the many side of a one-to-many
relationship. because there is a tremendous increase to the performance
of the reads for these operations (3) columns whose values are
frequently evaluated between a range

* by the same reasoning, clustered indexes are wasted on unique value
columns, as they serve no tangible purpose (except as you mention,
perhaps ameliorating problems using identity columns without a
clustered index)

* using a clustered index on a unique value is wasted. the only reason
is the default for new tables created in enterprise manager is because
it grants advantages to some queries above a table that has zero
clustered indexes (better one than none)

here are my sources for this information:

microsoft sql server 2000 bible, paul nielsen, pp768-771

sql server 2000 for experienced dba's, brian knight, pp252-255

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql7/html/msdn_sql7perftune.asp

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql7/html/sql7sapr3.asp

these sources cite specific examples of where and why you would want to
use clustered indexes, and they are all on data that is grouped by,
sorted by, or evaluated in a range. especially date fields (which are
rarely unique, yes?)

one caveat that is mentioned is that you shouldn't apply a clustered
index to a column that is updated often, for the reason you stated,
that the clustered index is appended to every nonclustered index in the
table. but you cite this as a performance hindrance? my undrestanding
is that the presence of the clustered index is actually increasing
performance (if at a small cost of space) by skipping the pointer to
data page step of a lookup.

what are your sources, that we might compare / contrast?

jason
Author
8 Sep 2005 8:35 PM
David Browne
"jason" <iae***@yahoo.com> wrote in message
news:1126206276.049357.153630@g43g2000cwa.googlegroups.com...
> interesting, you clearly have some well informed opinions. they just
> happen to disagree with some of the sources i've used to come to my
> understanding :) these things happen. i admit, my understanding of
> clustered indexing, and indexing in general, is incomplete, but here
> are some points that i thought i'd share, to see if perhaps there's an
> absolute answer:
>
> * from every text at my disposal, the primary advantage to a clustered
> index is the fact that the data is guaranteed to be in the same order
> as the index.
>

And the clustered index requires one fewer IO per seek since the leaf is the
data page; plus seek performance of all non-clustered indexes requires 1-4
extra IO's.

> * the leaf nodes of a clustered index don't actually point to the next
> and previous rows, because they don't have to. the next and previous
> values are implied by the next and previous positions on disk.
>

No. The leaf nodes are data pages.  All data pages in a table are arranged
in a doubly linked list: each page has a next page and previous page
pointer.  A table scan traverses the table by following the next/previous
pointers from page to page.  For a table with a clustered index the order of
rows on a page and the order of pages determined by the next/previous
pointers follow the order of the index.  So after you traverse the index
down to the leaf level once you can then follow the next/previous pointers
to follow the index order.  So you don't have to go back to the index pages
to satisfy a range query: just seek to the first value and the scan to the
end.

A common misconception is that data pages in a clustered index are
guaranteed to be physically ordered.  They are not.  Only the logical order
of the doubly-linked list of data pages is guaranteed.  Difference between
the logical ordering of the pages and the physical order can arise over time
due to index fragmentation.


Show quote
> * nonclustered indexes do point to next and previous data rows, because
> the index order may not match the physical order, so a pointer is
> required
>
> * the greatest advantages to using clustered indexes, therefore, are
> cases where the physical order of the data can increase your
> performance. chief among these are (1) columns commonly used in group
> by and order by clauses (2) columns on the many side of a one-to-many
> relationship. because there is a tremendous increase to the performance
> of the reads for these operations (3) columns whose values are
> frequently evaluated between a range
>
> * by the same reasoning, clustered indexes are wasted on unique value
> columns, as they serve no tangible purpose (except as you mention,
> perhaps ameliorating problems using identity columns without a
> clustered index)

A row in a table with a clustered index can only be accessed through the
clustered index.  There's simply no other way to find the row.  If you have
a non-clustered unique index, and a non-unique clustered index then
accessing the table through the unique index will be more expensive since
you must traverse the unique index only to get, at the leaf level, a
non-unique clustered index key (plus 4-byte uniqifier), and then you must
traverse the non-unique index down to the leaf level (data page) on which
the row is located.  There are situation where this is a net improvement in
performance, but for applications which do primarily single-row
insert/update/delete clustering the primary key optimizes the most important
transaction.

>
> * using a clustered index on a unique value is wasted. the only reason
> is the default for new tables created in enterprise manager is because
> it grants advantages to some queries above a table that has zero
> clustered indexes (better one than none)

This is not generally the case since choosing the clustered index always
optimizes some access paths and degrades others.  You must analyze your
workload and measure the impace of clustering a non-unique tuple.

There are no valid sources for recieved wisdom or rules of thumb for this.
Only knoledge of how SQL Server is implemented and testing.

David
Author
9 Sep 2005 1:33 PM
jason
> There are no valid sources for recieved wisdom or rules of thumb for this.
> Only knoledge of how SQL Server is implemented and testing.

I'm getting that impression, which is why discussions like these are
extremely useful. They can supplement experiences you haven't had yet,
and as you say, give a more complete picture of how SQL Server is
implemented. So thank you very much!

Jason
Author
9 Sep 2005 12:21 AM
Brian Selzer
Perhaps you should bin those books and open up BOL.

SQL Server Architecture/Database Architecture/Physical Database
Architecture/Table and Index Architecture
This section contains information about the physical structure of clustered
indexes, nonclustered indexes, and heaps (tables without clustered indexes).


Creating and Maintaining Databases/Indexes
This section contains information about when to use clustered indexes, why
you should keep the key small, etc.



Show quote
"jason" <iae***@yahoo.com> wrote in message
news:1126206276.049357.153630@g43g2000cwa.googlegroups.com...
> interesting, you clearly have some well informed opinions. they just
> happen to disagree with some of the sources i've used to come to my
> understanding :) these things happen. i admit, my understanding of
> clustered indexing, and indexing in general, is incomplete, but here
> are some points that i thought i'd share, to see if perhaps there's an
> absolute answer:
>
> * from every text at my disposal, the primary advantage to a clustered
> index is the fact that the data is guaranteed to be in the same order
> as the index.
>
> * the leaf nodes of a clustered index don't actually point to the next
> and previous rows, because they don't have to. the next and previous
> values are implied by the next and previous positions on disk.
>
> * nonclustered indexes do point to next and previous data rows, because
> the index order may not match the physical order, so a pointer is
> required
>
> * the greatest advantages to using clustered indexes, therefore, are
> cases where the physical order of the data can increase your
> performance. chief among these are (1) columns commonly used in group
> by and order by clauses (2) columns on the many side of a one-to-many
> relationship. because there is a tremendous increase to the performance
> of the reads for these operations (3) columns whose values are
> frequently evaluated between a range
>
> * by the same reasoning, clustered indexes are wasted on unique value
> columns, as they serve no tangible purpose (except as you mention,
> perhaps ameliorating problems using identity columns without a
> clustered index)
>
> * using a clustered index on a unique value is wasted. the only reason
> is the default for new tables created in enterprise manager is because
> it grants advantages to some queries above a table that has zero
> clustered indexes (better one than none)
>
> here are my sources for this information:
>
> microsoft sql server 2000 bible, paul nielsen, pp768-771
>
> sql server 2000 for experienced dba's, brian knight, pp252-255
>
>
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql7/html/msdn_sql7perftune.asp
>
>
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql7/html/sql7sapr3.asp
Show quote
>
> these sources cite specific examples of where and why you would want to
> use clustered indexes, and they are all on data that is grouped by,
> sorted by, or evaluated in a range. especially date fields (which are
> rarely unique, yes?)
>
> one caveat that is mentioned is that you shouldn't apply a clustered
> index to a column that is updated often, for the reason you stated,
> that the clustered index is appended to every nonclustered index in the
> table. but you cite this as a performance hindrance? my undrestanding
> is that the presence of the clustered index is actually increasing
> performance (if at a small cost of space) by skipping the pointer to
> data page step of a lookup.
>
> what are your sources, that we might compare / contrast?
>
> jason
>
Author
9 Sep 2005 1:37 PM
jason
I've just added that to my library. Though it seems shocking that four
resources (including one that cites BOL as one of its resources, and
two direct microsoft articles) would be completely wrong. I suspect
that, as David states in his versions of the thread, there are a fuzzy
set of virtues to be implemented, each more or less advantageous in
certain situations. Still, good to know as many of those implentations
and situations as possible, so thanks for directing me to your sources!

Jason
Author
8 Sep 2005 6:19 PM
David Gugick
jason wrote:
Show quote
> if you mean that the number of pages required to define a table will
> grow more slowly, that is correct. but there are many reasons not to
> do this. in my experience, clustering any primary key is a pretty bad
> idea, especially an identity column. the reasons are twofold:
>
> 1) clustering ensures that all like values will be ordered together.
> if you apply this to a value that is gauranteed to be unique, all you
> are accomplishing is gauranteeing the physical order of the rows on
> disk, which would serve no real purpose (i.e. none of your apps
> should care what order the rows are on disk).
>
> in my opinion, you should reserve a clustered index for a column that
> has non-unique values, and is often used as a search criteria by
> value. that way, all of the rows with the same value are physically
> grouped together, greatly improving the performance of the disk read
> to collect all the data.
>
> 2) creating a clustered index on an identity column is very likely to
> create insert hotspots, because it is gauranteed that almost every new
> row will be written to the same datapage. if you have a large number
> of inserts occurring, this can create lock issues, if the current
> insertion page is locked for any reason from insertion. it's far
> healthier if your database can insert rows into a nicely dispersed set
> of data pages, to prevent issues of lock escalation impeding inserts.
>
> anyway, those are just my two cents. i'm sure other people have other
> opinions.

I would like to comment on some of what you wrote. Clustering on a
non-unique columns has a few implications.

- SQL Server cannot handle non-unique clustered keys internally and will
automatically add a UNIQUEIDENTIFIER to the rows when necessary. This
increases the key length by 4-bytes for each non-unique value. Using an
IDENTITY value, which is by definition unique, does not have this added
overhead.
- The clustered index key is the pointer into the actual data.
Therefore, the clustered index key is a part of each non-clustered index
key. This can make the non-clustered indexes grow considerably. Given
that using a non-unique set of columns is likely to include at least one
non-integer based column, you are looking at large keys all around. This
will slow down inserting and updating because of the need to hit each
non-clustered index. Clustered IDENTITY columns add the least amount of
overhead to non-clustered indexes.
- A non-unique clustered index key is likely to be more susceptible to
updating. Updating a clustered index key means physically moving the row
to a new location and this will likely cause page splitting, which is a
slow process that causes internal fragmentation. This will slow down
inserting and updating because of the need to locate/relocate the row
and the need to update each non-clustered index key. IDENTITY columns
cannot be updated and are immune to this problem.
- Insert hotspots aren't really a problem any longer like they were in
the old day of page locking. SQL Server uses row locking for inserts.
Keeping the disk heads from moving around can actually increase
performance because of the slow nature of the hard drive. Using an
IDENTITY keeps the disk heads in the same location and prevents page
splitting as new rows are added.
- A non-unique clustered index may require additional maintenance to
avoid page splitting. That is, you may have to implement index rebuilds
using a FILLFACTOR that leaves a certain amount of space available for
inserts on each page. The maintenance is ongoing and is best done during
off-hours; assuming your business has off-hours. This increases the
table size and number of pages and will hurt overall performance.
- Page splitting causes external fragmentation. That is, even if you use
a clustered index on a set of non-unique columns, there's no reason to
assume that pages that contain the same key will be contiguous. So a
read of two pages of a specific clustered index key can cause the disk
heads to move from one end of the table to another. This can be avoided
with some index maintenance. Clustered IDENTITY indexes do not have this
problem.

Having said that, there are reasons to use non-unique clustered indexes.
As you stated, they can, in fact, help the fetching of a number of rows
of a specific key. If your database is used primarily for this type of
operation, a non-unique clustered index may be the way to go. I've
certainly used them this way on many occasions.



--
David Gugick
Quest Software
www.imceda.com
www.quest.com
Author
9 Sep 2005 1:46 PM
jason
thanks for these caveats! very useful information. i think i will have
to see both of these implementations in practice before i can
intelligently choose which is best for the database model i'm
responsible for. it seems like there are virtues to each
implementation, based primarily on, unsurprisingly, the manner in which
the database will be primarily used.

other than waiting for performance hindrances in the applications which
consume this data, are there tools / metrics that i can use to monitor
some of the possible degredations you mention due to non-unique
clustered indexes? perhaps if i can watch the pro's and the con's in
action by way of quantifiable metrics, i'll make the most informed
decisions.

thanks again!

jason
Author
12 Sep 2005 3:24 PM
David Gugick
jason wrote:
Show quote
> thanks for these caveats! very useful information. i think i will have
> to see both of these implementations in practice before i can
> intelligently choose which is best for the database model i'm
> responsible for. it seems like there are virtues to each
> implementation, based primarily on, unsurprisingly, the manner in
> which the database will be primarily used.
>
> other than waiting for performance hindrances in the applications
> which consume this data, are there tools / metrics that i can use to
> monitor some of the possible degredations you mention due to
> non-unique clustered indexes? perhaps if i can watch the pro's and
> the con's in action by way of quantifiable metrics, i'll make the
> most informed decisions.
>
> thanks again!
>
> jason

You'll have to document the types of access your tables experience. For
example, what percentage of operations are inserts and updates vs.
selects? Are the PK key values ever updated? For the selects, do you
normally select a single row or select a number of rows? If you select a
number of rows, what columns are you filtering on? Once you have this
information about each table, you should be better able to determine the
best physical implementation for a PK as well as what other indexes are
required for best performance. I would encourage you to stay away from
any implementation that causes a good amount of page splitting as it's a
very slow operation on the already lowest performing appendage of the
database; the disk subsystem.

--
David Gugick
Quest Software
www.imceda.com
www.quest.com
Author
11 Sep 2005 6:56 PM
JXStern
On 8 Sep 2005 07:42:33 -0700, "jason" <iae***@yahoo.com> wrote:
>1) clustering ensures that all like values will be ordered together. if
>you apply this to a value that is gauranteed to be unique, all you are
>accomplishing is gauranteeing the physical order of the rows on disk,
>which would serve no real purpose (i.e. none of your apps should care
>what order the rows are on disk).

The app doesn't care but in some circumstances performance will be
better when records are physically adjacent.

>in my opinion, you should reserve a clustered index for a column that
>has non-unique values, and is often used as a search criteria by value.
>that way, all of the rows with the same value are physically grouped
>together, greatly improving the performance of the disk read to collect
>all the data.

As others have said, due to the internals of SQLServer, using
clustered indexes for nonunique values is inefficient.

>2) creating a clustered index on an identity column is very likely to
>create insert hotspots, because it is gauranteed that almost every new
>row will be written to the same datapage. if you have a large number of
>inserts occurring, this can create lock issues, if the current
>insertion page is locked for any reason from insertion. it's far
>healthier if your database can insert rows into a nicely dispersed set
>of data pages, to prevent issues of lock escalation impeding inserts.

Hotspots were very bad in SQL 6.0, still bad in 6.5, and much less of
a problem since 7.0 when row-locking was introduced.  Better an insert
"hotspot" then splitting a page every couple of inserts, which is one
of the more popular "alternatives".

Remember, Microsoft strongly recommends the use of clustered indexes
and clustered PKs particularly, and (for better or worse) SQLServer
has a number of implementation details that motivate this
recommendation.  Probably something like 50% of new databases are
designed with identity PKs - again, for better or worse.  Ask Joe
Celko his opinion!  And another, oh, maybe 49%, use GUIDs as clustered
PKs, which turns out not to be quite as absolutely horrible as it
looks at first glance, but again, it's not quite a Microsoft
requirement anymore for doing merge replication, but again, it's a
technique Microsoft recommends very strongly and which IMHO should be
used much, much more carefully than that.

J.

(ok, maybe it's not quite 50% and 49% for these two designs, but it
sure seems that way.  natural PKs?  Hah!)
Author
12 Sep 2005 2:14 PM
jason
hehe, funny you should mention CELKO, i just had an interesting debate
with him the other day about natural and artificial relational keys :)

thanks for the info of course. i still find it really odd that so many
(recent) microsoft sources recommend using clustered indexes on columns
that are grouped by frequently, or frequently part of a ranged query.
i'm hoping to implement both in a variety of circumstances so that i
can measure the differences more scientifically.

thanks again,

jason
Author
8 Sep 2005 2:58 PM
Alejandro Mesa
See if this helps:

Tips on Optimizing SQL Server Clustered Indexes
http://www.sql-server-performance.com/clustered_indexes.asp


AMB

Show quote
"Pradeep Kutty" wrote:

> Hi All,
>
> I have heard that if you create a clustered index on a identity field and
> its a PK, it reduces the page splits...
> But how?
>
> Thanks,
> Prad
>
>
>
Author
8 Sep 2005 5:27 PM
Aaron Bertrand [SQL Server MVP]
Can you fix your clock?


Show quote
"Pradeep Kutty" <prade***@healthasyst.com> wrote in message
news:eCh33%23HtFHA.616@TK2MSFTNGP11.phx.gbl...
> Hi All,
>
> I have heard that if you create a clustered index on a identity field and
> its a PK, it reduces the page splits...
> But how?
>
> Thanks,
> Prad
>

AddThis Social Bookmark Button