|
database
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
clustered index on a identity fieldHi All,
I have heard that if you create a clustered index on a identity field and its a PK, it reduces the page splits... But how? Thanks, Prad if you mean that the number of pages required to define a table will
grow more slowly, that is correct. but there are many reasons not to do this. in my experience, clustering any primary key is a pretty bad idea, especially an identity column. the reasons are twofold: 1) clustering ensures that all like values will be ordered together. if you apply this to a value that is gauranteed to be unique, all you are accomplishing is gauranteeing the physical order of the rows on disk, which would serve no real purpose (i.e. none of your apps should care what order the rows are on disk). in my opinion, you should reserve a clustered index for a column that has non-unique values, and is often used as a search criteria by value. that way, all of the rows with the same value are physically grouped together, greatly improving the performance of the disk read to collect all the data. 2) creating a clustered index on an identity column is very likely to create insert hotspots, because it is gauranteed that almost every new row will be written to the same datapage. if you have a large number of inserts occurring, this can create lock issues, if the current insertion page is locked for any reason from insertion. it's far healthier if your database can insert rows into a nicely dispersed set of data pages, to prevent issues of lock escalation impeding inserts. anyway, those are just my two cents. i'm sure other people have other opinions. I disagree.
The physical location on disk of index rows in a clustered index is undefined. There is no guarantee that the index pages of a clustered index reside in any specified order on the physical hardware. What is defined is that the rows within a specific leaf page are in order, and each clustered index leaf page contains next and previous pointers to locate the next and previous leaf pages. One of the main benefits of using a clustered index is that once the first row in an ordered scan is found, SQL Server doesn't need to walk the b-tree to find the next row. It can use the next pointer to continue the ordered scan. This can improve query performance considerably. You should avoid using a non-unique key in a clustered index. If a table has a clustered index, then every nonclustered index uses the clustered index key to locate the row in the table. If the key of a clustered index is not unique, then SQL Server adds a hidden 4-byte uniquifier to the key to make it unique. The entire key along with the uniqifier is then stored in every nonclustered index row. In addition, there is a significant degradation of join performance. If you put a nonclustered index on the primary key, then every join with related tables requires an additional lookup to satisfy the query. For each row, the index for the primary key constraint is consulted yielding the clustered index key. That key is then used to lookup the row in the clustered index. Therefore, if you have related tables, then the clustered index should live on the primary key. Your assertion about insert hotspots fails to mention the additional index maintenance (page splits and updates) that is required if your clustered index isn't on the IDENTITY column. If the clustered index is on an IDENTITY column, then there will never be any page splits, because inserts become appends, both in the b-tree index nodes and in the leaf nodes. You can also use a 100% fillfactor, because there will never be any inserts, only appends. The tree grows up: at each level of the b-tree, a new page is appended if it is needed, and if a new page is needed at the root level, then a new root page is also added. That's why inserts into a table with only a clustered index on the IDENTITY column perform almost as well as inserts into a table without any indexes at all. Deletes and updates in this case require minimal index maintenance, because SQL Server doesn't join index pages, it only splits them when an insert exceeds the fillfactor threshold. Deletes just leave holes, so if you expect a lot of deletes, then you should schedule a periodic DBCC INDEXDEFRAG or DBCC DBREINDEX. SQL Server uses lazy spooling, which means that changes to the database are cached and then periodically flushed to disk (checkpointed). (SQL Server only ensures that writes to the transaction log are flushed to disk before returning from a commit.) This means that more often than not, the last page is still resident in memory if there are a large number of inserts, thereby turning the insert hotspot into a performance improvement. Finally, unless otherwise forced, the exclusive locks applied by an insert should not escalate unless the page is full and all of the rows in the page were just inserted by the current transaction. Therefore, there is little if any lock contention due to inserts. Show quote "jason" <iae***@yahoo.com> wrote in message news:1126190553.831020.325290@f14g2000cwb.googlegroups.com... > if you mean that the number of pages required to define a table will > grow more slowly, that is correct. but there are many reasons not to do > this. in my experience, clustering any primary key is a pretty bad > idea, especially an identity column. the reasons are twofold: > > 1) clustering ensures that all like values will be ordered together. if > you apply this to a value that is gauranteed to be unique, all you are > accomplishing is gauranteeing the physical order of the rows on disk, > which would serve no real purpose (i.e. none of your apps should care > what order the rows are on disk). > > in my opinion, you should reserve a clustered index for a column that > has non-unique values, and is often used as a search criteria by value. > that way, all of the rows with the same value are physically grouped > together, greatly improving the performance of the disk read to collect > all the data. > > 2) creating a clustered index on an identity column is very likely to > create insert hotspots, because it is gauranteed that almost every new > row will be written to the same datapage. if you have a large number of > inserts occurring, this can create lock issues, if the current > insertion page is locked for any reason from insertion. it's far > healthier if your database can insert rows into a nicely dispersed set > of data pages, to prevent issues of lock escalation impeding inserts. > > anyway, those are just my two cents. i'm sure other people have other > opinions. > interesting, you clearly have some well informed opinions. they just
happen to disagree with some of the sources i've used to come to my understanding :) these things happen. i admit, my understanding of clustered indexing, and indexing in general, is incomplete, but here are some points that i thought i'd share, to see if perhaps there's an absolute answer: * from every text at my disposal, the primary advantage to a clustered index is the fact that the data is guaranteed to be in the same order as the index. * the leaf nodes of a clustered index don't actually point to the next and previous rows, because they don't have to. the next and previous values are implied by the next and previous positions on disk. * nonclustered indexes do point to next and previous data rows, because the index order may not match the physical order, so a pointer is required * the greatest advantages to using clustered indexes, therefore, are cases where the physical order of the data can increase your performance. chief among these are (1) columns commonly used in group by and order by clauses (2) columns on the many side of a one-to-many relationship. because there is a tremendous increase to the performance of the reads for these operations (3) columns whose values are frequently evaluated between a range * by the same reasoning, clustered indexes are wasted on unique value columns, as they serve no tangible purpose (except as you mention, perhaps ameliorating problems using identity columns without a clustered index) * using a clustered index on a unique value is wasted. the only reason is the default for new tables created in enterprise manager is because it grants advantages to some queries above a table that has zero clustered indexes (better one than none) here are my sources for this information: microsoft sql server 2000 bible, paul nielsen, pp768-771 sql server 2000 for experienced dba's, brian knight, pp252-255 http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql7/html/msdn_sql7perftune.asp http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql7/html/sql7sapr3.asp these sources cite specific examples of where and why you would want to use clustered indexes, and they are all on data that is grouped by, sorted by, or evaluated in a range. especially date fields (which are rarely unique, yes?) one caveat that is mentioned is that you shouldn't apply a clustered index to a column that is updated often, for the reason you stated, that the clustered index is appended to every nonclustered index in the table. but you cite this as a performance hindrance? my undrestanding is that the presence of the clustered index is actually increasing performance (if at a small cost of space) by skipping the pointer to data page step of a lookup. what are your sources, that we might compare / contrast? jason "jason" <iae***@yahoo.com> wrote in message And the clustered index requires one fewer IO per seek since the leaf is the news:1126206276.049357.153630@g43g2000cwa.googlegroups.com... > interesting, you clearly have some well informed opinions. they just > happen to disagree with some of the sources i've used to come to my > understanding :) these things happen. i admit, my understanding of > clustered indexing, and indexing in general, is incomplete, but here > are some points that i thought i'd share, to see if perhaps there's an > absolute answer: > > * from every text at my disposal, the primary advantage to a clustered > index is the fact that the data is guaranteed to be in the same order > as the index. > data page; plus seek performance of all non-clustered indexes requires 1-4 extra IO's. > * the leaf nodes of a clustered index don't actually point to the next No. The leaf nodes are data pages. All data pages in a table are arranged > and previous rows, because they don't have to. the next and previous > values are implied by the next and previous positions on disk. > in a doubly linked list: each page has a next page and previous page pointer. A table scan traverses the table by following the next/previous pointers from page to page. For a table with a clustered index the order of rows on a page and the order of pages determined by the next/previous pointers follow the order of the index. So after you traverse the index down to the leaf level once you can then follow the next/previous pointers to follow the index order. So you don't have to go back to the index pages to satisfy a range query: just seek to the first value and the scan to the end. A common misconception is that data pages in a clustered index are guaranteed to be physically ordered. They are not. Only the logical order of the doubly-linked list of data pages is guaranteed. Difference between the logical ordering of the pages and the physical order can arise over time due to index fragmentation. Show quote > * nonclustered indexes do point to next and previous data rows, because A row in a table with a clustered index can only be accessed through the > the index order may not match the physical order, so a pointer is > required > > * the greatest advantages to using clustered indexes, therefore, are > cases where the physical order of the data can increase your > performance. chief among these are (1) columns commonly used in group > by and order by clauses (2) columns on the many side of a one-to-many > relationship. because there is a tremendous increase to the performance > of the reads for these operations (3) columns whose values are > frequently evaluated between a range > > * by the same reasoning, clustered indexes are wasted on unique value > columns, as they serve no tangible purpose (except as you mention, > perhaps ameliorating problems using identity columns without a > clustered index) clustered index. There's simply no other way to find the row. If you have a non-clustered unique index, and a non-unique clustered index then accessing the table through the unique index will be more expensive since you must traverse the unique index only to get, at the leaf level, a non-unique clustered index key (plus 4-byte uniqifier), and then you must traverse the non-unique index down to the leaf level (data page) on which the row is located. There are situation where this is a net improvement in performance, but for applications which do primarily single-row insert/update/delete clustering the primary key optimizes the most important transaction. > This is not generally the case since choosing the clustered index always > * using a clustered index on a unique value is wasted. the only reason > is the default for new tables created in enterprise manager is because > it grants advantages to some queries above a table that has zero > clustered indexes (better one than none) optimizes some access paths and degrades others. You must analyze your workload and measure the impace of clustering a non-unique tuple. There are no valid sources for recieved wisdom or rules of thumb for this. Only knoledge of how SQL Server is implemented and testing. David > There are no valid sources for recieved wisdom or rules of thumb for this. I'm getting that impression, which is why discussions like these are> Only knoledge of how SQL Server is implemented and testing. extremely useful. They can supplement experiences you haven't had yet, and as you say, give a more complete picture of how SQL Server is implemented. So thank you very much! Jason Perhaps you should bin those books and open up BOL.
SQL Server Architecture/Database Architecture/Physical Database Architecture/Table and Index Architecture This section contains information about the physical structure of clustered indexes, nonclustered indexes, and heaps (tables without clustered indexes). Creating and Maintaining Databases/Indexes This section contains information about when to use clustered indexes, why you should keep the key small, etc. Show quote "jason" <iae***@yahoo.com> wrote in message http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql7/html/msdn_sql7perftune.aspnews:1126206276.049357.153630@g43g2000cwa.googlegroups.com... > interesting, you clearly have some well informed opinions. they just > happen to disagree with some of the sources i've used to come to my > understanding :) these things happen. i admit, my understanding of > clustered indexing, and indexing in general, is incomplete, but here > are some points that i thought i'd share, to see if perhaps there's an > absolute answer: > > * from every text at my disposal, the primary advantage to a clustered > index is the fact that the data is guaranteed to be in the same order > as the index. > > * the leaf nodes of a clustered index don't actually point to the next > and previous rows, because they don't have to. the next and previous > values are implied by the next and previous positions on disk. > > * nonclustered indexes do point to next and previous data rows, because > the index order may not match the physical order, so a pointer is > required > > * the greatest advantages to using clustered indexes, therefore, are > cases where the physical order of the data can increase your > performance. chief among these are (1) columns commonly used in group > by and order by clauses (2) columns on the many side of a one-to-many > relationship. because there is a tremendous increase to the performance > of the reads for these operations (3) columns whose values are > frequently evaluated between a range > > * by the same reasoning, clustered indexes are wasted on unique value > columns, as they serve no tangible purpose (except as you mention, > perhaps ameliorating problems using identity columns without a > clustered index) > > * using a clustered index on a unique value is wasted. the only reason > is the default for new tables created in enterprise manager is because > it grants advantages to some queries above a table that has zero > clustered indexes (better one than none) > > here are my sources for this information: > > microsoft sql server 2000 bible, paul nielsen, pp768-771 > > sql server 2000 for experienced dba's, brian knight, pp252-255 > > > http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql7/html/sql7sapr3.asp> Show quote > > these sources cite specific examples of where and why you would want to > use clustered indexes, and they are all on data that is grouped by, > sorted by, or evaluated in a range. especially date fields (which are > rarely unique, yes?) > > one caveat that is mentioned is that you shouldn't apply a clustered > index to a column that is updated often, for the reason you stated, > that the clustered index is appended to every nonclustered index in the > table. but you cite this as a performance hindrance? my undrestanding > is that the presence of the clustered index is actually increasing > performance (if at a small cost of space) by skipping the pointer to > data page step of a lookup. > > what are your sources, that we might compare / contrast? > > jason > I've just added that to my library. Though it seems shocking that four
resources (including one that cites BOL as one of its resources, and two direct microsoft articles) would be completely wrong. I suspect that, as David states in his versions of the thread, there are a fuzzy set of virtues to be implemented, each more or less advantageous in certain situations. Still, good to know as many of those implentations and situations as possible, so thanks for directing me to your sources! Jason jason wrote:
Show quote > if you mean that the number of pages required to define a table will I would like to comment on some of what you wrote. Clustering on a > grow more slowly, that is correct. but there are many reasons not to > do this. in my experience, clustering any primary key is a pretty bad > idea, especially an identity column. the reasons are twofold: > > 1) clustering ensures that all like values will be ordered together. > if you apply this to a value that is gauranteed to be unique, all you > are accomplishing is gauranteeing the physical order of the rows on > disk, which would serve no real purpose (i.e. none of your apps > should care what order the rows are on disk). > > in my opinion, you should reserve a clustered index for a column that > has non-unique values, and is often used as a search criteria by > value. that way, all of the rows with the same value are physically > grouped together, greatly improving the performance of the disk read > to collect all the data. > > 2) creating a clustered index on an identity column is very likely to > create insert hotspots, because it is gauranteed that almost every new > row will be written to the same datapage. if you have a large number > of inserts occurring, this can create lock issues, if the current > insertion page is locked for any reason from insertion. it's far > healthier if your database can insert rows into a nicely dispersed set > of data pages, to prevent issues of lock escalation impeding inserts. > > anyway, those are just my two cents. i'm sure other people have other > opinions. non-unique columns has a few implications. - SQL Server cannot handle non-unique clustered keys internally and will automatically add a UNIQUEIDENTIFIER to the rows when necessary. This increases the key length by 4-bytes for each non-unique value. Using an IDENTITY value, which is by definition unique, does not have this added overhead. - The clustered index key is the pointer into the actual data. Therefore, the clustered index key is a part of each non-clustered index key. This can make the non-clustered indexes grow considerably. Given that using a non-unique set of columns is likely to include at least one non-integer based column, you are looking at large keys all around. This will slow down inserting and updating because of the need to hit each non-clustered index. Clustered IDENTITY columns add the least amount of overhead to non-clustered indexes. - A non-unique clustered index key is likely to be more susceptible to updating. Updating a clustered index key means physically moving the row to a new location and this will likely cause page splitting, which is a slow process that causes internal fragmentation. This will slow down inserting and updating because of the need to locate/relocate the row and the need to update each non-clustered index key. IDENTITY columns cannot be updated and are immune to this problem. - Insert hotspots aren't really a problem any longer like they were in the old day of page locking. SQL Server uses row locking for inserts. Keeping the disk heads from moving around can actually increase performance because of the slow nature of the hard drive. Using an IDENTITY keeps the disk heads in the same location and prevents page splitting as new rows are added. - A non-unique clustered index may require additional maintenance to avoid page splitting. That is, you may have to implement index rebuilds using a FILLFACTOR that leaves a certain amount of space available for inserts on each page. The maintenance is ongoing and is best done during off-hours; assuming your business has off-hours. This increases the table size and number of pages and will hurt overall performance. - Page splitting causes external fragmentation. That is, even if you use a clustered index on a set of non-unique columns, there's no reason to assume that pages that contain the same key will be contiguous. So a read of two pages of a specific clustered index key can cause the disk heads to move from one end of the table to another. This can be avoided with some index maintenance. Clustered IDENTITY indexes do not have this problem. Having said that, there are reasons to use non-unique clustered indexes. As you stated, they can, in fact, help the fetching of a number of rows of a specific key. If your database is used primarily for this type of operation, a non-unique clustered index may be the way to go. I've certainly used them this way on many occasions. thanks for these caveats! very useful information. i think i will have
to see both of these implementations in practice before i can intelligently choose which is best for the database model i'm responsible for. it seems like there are virtues to each implementation, based primarily on, unsurprisingly, the manner in which the database will be primarily used. other than waiting for performance hindrances in the applications which consume this data, are there tools / metrics that i can use to monitor some of the possible degredations you mention due to non-unique clustered indexes? perhaps if i can watch the pro's and the con's in action by way of quantifiable metrics, i'll make the most informed decisions. thanks again! jason jason wrote:
Show quote > thanks for these caveats! very useful information. i think i will have You'll have to document the types of access your tables experience. For > to see both of these implementations in practice before i can > intelligently choose which is best for the database model i'm > responsible for. it seems like there are virtues to each > implementation, based primarily on, unsurprisingly, the manner in > which the database will be primarily used. > > other than waiting for performance hindrances in the applications > which consume this data, are there tools / metrics that i can use to > monitor some of the possible degredations you mention due to > non-unique clustered indexes? perhaps if i can watch the pro's and > the con's in action by way of quantifiable metrics, i'll make the > most informed decisions. > > thanks again! > > jason example, what percentage of operations are inserts and updates vs. selects? Are the PK key values ever updated? For the selects, do you normally select a single row or select a number of rows? If you select a number of rows, what columns are you filtering on? Once you have this information about each table, you should be better able to determine the best physical implementation for a PK as well as what other indexes are required for best performance. I would encourage you to stay away from any implementation that causes a good amount of page splitting as it's a very slow operation on the already lowest performing appendage of the database; the disk subsystem. On 8 Sep 2005 07:42:33 -0700, "jason" <iae***@yahoo.com> wrote: The app doesn't care but in some circumstances performance will be>1) clustering ensures that all like values will be ordered together. if >you apply this to a value that is gauranteed to be unique, all you are >accomplishing is gauranteeing the physical order of the rows on disk, >which would serve no real purpose (i.e. none of your apps should care >what order the rows are on disk). better when records are physically adjacent. >in my opinion, you should reserve a clustered index for a column that As others have said, due to the internals of SQLServer, using>has non-unique values, and is often used as a search criteria by value. >that way, all of the rows with the same value are physically grouped >together, greatly improving the performance of the disk read to collect >all the data. clustered indexes for nonunique values is inefficient. >2) creating a clustered index on an identity column is very likely to Hotspots were very bad in SQL 6.0, still bad in 6.5, and much less of>create insert hotspots, because it is gauranteed that almost every new >row will be written to the same datapage. if you have a large number of >inserts occurring, this can create lock issues, if the current >insertion page is locked for any reason from insertion. it's far >healthier if your database can insert rows into a nicely dispersed set >of data pages, to prevent issues of lock escalation impeding inserts. a problem since 7.0 when row-locking was introduced. Better an insert "hotspot" then splitting a page every couple of inserts, which is one of the more popular "alternatives". Remember, Microsoft strongly recommends the use of clustered indexes and clustered PKs particularly, and (for better or worse) SQLServer has a number of implementation details that motivate this recommendation. Probably something like 50% of new databases are designed with identity PKs - again, for better or worse. Ask Joe Celko his opinion! And another, oh, maybe 49%, use GUIDs as clustered PKs, which turns out not to be quite as absolutely horrible as it looks at first glance, but again, it's not quite a Microsoft requirement anymore for doing merge replication, but again, it's a technique Microsoft recommends very strongly and which IMHO should be used much, much more carefully than that. J. (ok, maybe it's not quite 50% and 49% for these two designs, but it sure seems that way. natural PKs? Hah!) hehe, funny you should mention CELKO, i just had an interesting debate
with him the other day about natural and artificial relational keys :) thanks for the info of course. i still find it really odd that so many (recent) microsoft sources recommend using clustered indexes on columns that are grouped by frequently, or frequently part of a ranged query. i'm hoping to implement both in a variety of circumstances so that i can measure the differences more scientifically. thanks again, jason See if this helps:
Tips on Optimizing SQL Server Clustered Indexes http://www.sql-server-performance.com/clustered_indexes.asp AMB Show quote "Pradeep Kutty" wrote: > Hi All, > > I have heard that if you create a clustered index on a identity field and > its a PK, it reduces the page splits... > But how? > > Thanks, > Prad > > > Can you fix your clock?
Show quote "Pradeep Kutty" <prade***@healthasyst.com> wrote in message news:eCh33%23HtFHA.616@TK2MSFTNGP11.phx.gbl... > Hi All, > > I have heard that if you create a clustered index on a identity field and > its a PK, it reduces the page splits... > But how? > > Thanks, > Prad > |
|||||||||||||||||||||||