|
database
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Why is Count(*) slow?Reading the above thread about "Count" got me thinking about why it's
generally considered bad practice to use Count(*) (or Count anything come to think of it). Considering that rows are "fixed length", surely it's easy for the database to count all the rows in a table without scanning the whole table? How does it work under the hood so to speak, as to make the case of counting the rows in a table so slow? It does an index scan if an index exists. On a large table, this can be
very expensive. On a heap, it can be even worse because of the I/O required to do a scan... though I'm sure you don't have any heaps, right? I am not sure why speed is a concern. How often are you blindly throwing "SELECT COUNT(*) FROM MegaTable" at the database? sysindexes is a better place to get this information if you are more worried about speed than accuracy. And you can improve accuracy by running DBCC UPDATEUSAGE regularly. Show quote "Robin Tucker" <idontwanttobespammedanymore@reallyidont.com> wrote in message news:daluao$bsj$1$8302bc10@news.demon.co.uk... > Reading the above thread about "Count" got me thinking about why it's > generally considered bad practice to use Count(*) (or Count anything come > to think of it). Considering that rows are "fixed length", surely it's > easy for the database to count all the rows in a table without scanning > the whole table? > > How does it work under the hood so to speak, as to make the case of > counting the rows in a table so slow? > > > I often do "SELECT COUNT(*) From MegaTable WHERE Condition1, ...n"
If it really is a huge table & I really query it often (I'm thinking of one place I query every 30 seconds and the table exceeds 1,000,000 records) I create an index that covers the conditions in my WHERE clause. Is there a better way to do this? dwilson Show quote "Aaron Bertrand [SQL Server MVP]" <ten.xoc@dnartreb.noraa> wrote in message news:ezYCOC8gFHA.2060@TK2MSFTNGP10.phx.gbl... > It does an index scan if an index exists. On a large table, this can be > very expensive. On a heap, it can be even worse because of the I/O required > to do a scan... though I'm sure you don't have any heaps, right? > > I am not sure why speed is a concern. How often are you blindly throwing > "SELECT COUNT(*) FROM MegaTable" at the database? > > sysindexes is a better place to get this information if you are more worried > about speed than accuracy. And you can improve accuracy by running DBCC > UPDATEUSAGE regularly. > > > > > "Robin Tucker" <idontwanttobespammedanymore@reallyidont.com> wrote in > message news:daluao$bsj$1$8302bc10@news.demon.co.uk... > > Reading the above thread about "Count" got me thinking about why it's > > generally considered bad practice to use Count(*) (or Count anything come > > to think of it). Considering that rows are "fixed length", surely it's > > easy for the database to count all the rows in a table without scanning > > the whole table? > > > > How does it work under the hood so to speak, as to make the case of > > counting the rows in a table so slow? > > > > > > > > Hi
COUNT(*) has to physically look at each row in a table and count it. A Row does not match a page, as multiple rows can exist on a page so there is no quick way to count it. "Inside SQL Server 2000" by Kalen Delaney is a excellent reference on internals. Regards -------------------------------- Mike Epprecht, Microsoft SQL Server MVP Zurich, Switzerland MVP Program: http://www.microsoft.com/mvp Blog: http://www.msmvps.com/epprecht/ Show quote "Robin Tucker" wrote: > Reading the above thread about "Count" got me thinking about why it's > generally considered bad practice to use Count(*) (or Count anything come to > think of it). Considering that rows are "fixed length", surely it's easy > for the database to count all the rows in a table without scanning the whole > table? > > How does it work under the hood so to speak, as to make the case of counting > the rows in a table so slow? > > > > Robin,
If you are thinking that the table is stored like an array, and the length can be divided by the size of an element, you are misguided here. To store data this way would be terribly inefficient. For example, to delete 1 row from a 1,000,000-row table would require moving the data from 500,000 rows on average. Instead, SQL Server deletes the target row from its data page only, and only by inspecting each data page can the query processor determine how many rows are on each page, and therefore in the table. The storage model for a SQL Server table is much different from that of an array. COUNT(*) will be evaluated using the leaf level of the narrowest table index, but each data page of that index must be considered. If it is important to evaluate COUNT(*) efficiently, you can create an indexed view to maintain the row count. See http://groups.google.co.uk/groups?q=AF626850-1EB2-4886-959A-040461F99ACD for an example. Steve Kass Drew University Robin Tucker wrote: Show quote >Reading the above thread about "Count" got me thinking about why it's >generally considered bad practice to use Count(*) (or Count anything come to >think of it). Considering that rows are "fixed length", surely it's easy >for the database to count all the rows in a table without scanning the whole >table? > >How does it work under the hood so to speak, as to make the case of counting >the rows in a table so slow? > > > > > Ahhhh, yes thats it.
I visualise a table as a large array which is totally wrong. (I don't use Count (*) FROM GiganticTable by the way in any of my production code, I was just curious to know why I avoid it!). Show quote "Steve Kass" <sk***@drew.edu> wrote in message news:uAUCVH8gFHA.2072@TK2MSFTNGP14.phx.gbl... > Robin, > > If you are thinking that the table is stored like an > array, and the length can be divided by the size of > an element, you are misguided here. To store data > this way would be terribly inefficient. For example, > to delete 1 row from a 1,000,000-row table would > require moving the data from 500,000 rows on > average. Instead, SQL Server deletes the target > row from its data page only, and only by inspecting > each data page can the query processor determine > how many rows are on each page, and therefore in > the table. > > The storage model for a SQL Server table is much > different from that of an array. COUNT(*) will be > evaluated using the leaf level of the narrowest table index, > but each data page of that index must be considered. > > If it is important to evaluate COUNT(*) efficiently, you > can create an indexed view to maintain the row count. > > See > http://groups.google.co.uk/groups?q=AF626850-1EB2-4886-959A-040461F99ACD > for an example. > > Steve Kass > Drew University > > > Robin Tucker wrote: > >>Reading the above thread about "Count" got me thinking about why it's >>generally considered bad practice to use Count(*) (or Count anything come >>to think of it). Considering that rows are "fixed length", surely it's >>easy for the database to count all the rows in a table without scanning >>the whole table? >> >>How does it work under the hood so to speak, as to make the case of >>counting the rows in a table so slow? >> >> >> >> It is probably best not to design an application that depends on making
frequent record counts. If there is an identity column on the table, if the identity started at 1, and if there are no deletes from the table, then querying the maximum value of this column should give you a quick and correct count. If the identity started at something other than 1 or if there are gaps, then you can make whatever + / - adjustment is needed to the result for a somewhat accurate count. Show quote "Robin Tucker" <idontwanttobespammedanymore@reallyidont.com> wrote in message news:daluao$bsj$1$8302bc10@news.demon.co.uk... > Reading the above thread about "Count" got me thinking about why it's > generally considered bad practice to use Count(*) (or Count anything come to > think of it). Considering that rows are "fixed length", surely it's easy > for the database to count all the rows in a table without scanning the whole > table? > > How does it work under the hood so to speak, as to make the case of counting > the rows in a table so slow? > > > That, IMO, would be a extraordinarily bad use of the Identity column. It is far
too easy to create gaps in your sequence and not realize that gaps exist. A better way would be to have a small table and a couple of triggers that update the count into this table on each insert or delete. I suppose my first recommendation would be better indexing, second would be the indexed view and last would be the trigger idea. Thomas Show quote "JT" <some***@microsoft.com> wrote in message news:%233%23b%23R9gFHA.1468@TK2MSFTNGP14.phx.gbl... > It is probably best not to design an application that depends on making > frequent record counts. If there is an identity column on the table, if the > identity started at 1, and if there are no deletes from the table, then > querying the maximum value of this column should give you a quick and > correct count. If the identity started at something other than 1 or if there > are gaps, then you can make whatever + / - adjustment is needed to the > result for a somewhat accurate count. > > "Robin Tucker" <idontwanttobespammedanymore@reallyidont.com> wrote in > message news:daluao$bsj$1$8302bc10@news.demon.co.uk... >> Reading the above thread about "Count" got me thinking about why it's >> generally considered bad practice to use Count(*) (or Count anything come > to >> think of it). Considering that rows are "fixed length", surely it's easy >> for the database to count all the rows in a table without scanning the > whole >> table? >> >> How does it work under the hood so to speak, as to make the case of > counting >> the rows in a table so slow? >> >> >> > > > better way would be to have a small table and a couple of triggers that Or better yet, to avoid bogging down OLTP operations and transactions, a > update the count into this table on each insert or delete. scheduled job that polls the table periodically and grabs a count (optionally with nolock), or updates usage and uses sysindexes. If it is a high activity database, this will be less work than maintaining counts for every operation, and while on a less busy system you will take counts more often than you may need, it will not interfere with any of your application code (and will not require triggers). A Agreed. After I wrote the trigger idea I thought about and realized I wouldn't
want to do it that way. Either way, using an Identity is definitely not the way to go. Thomas Show quote "Aaron Bertrand [SQL Server MVP]" <ten.xoc@dnartreb.noraa> wrote in message news:e6YSNx9gFHA.2632@TK2MSFTNGP09.phx.gbl... >> better way would be to have a small table and a couple of triggers that >> update the count into this table on each insert or delete. > > Or better yet, to avoid bogging down OLTP operations and transactions, a > scheduled job that polls the table periodically and grabs a count (optionally > with nolock), or updates usage and uses sysindexes. If it is a high activity > database, this will be less work than maintaining counts for every operation, > and while on a less busy system you will take counts more often than you may > need, it will not interfere with any of your application code (and will not > require triggers). > > A > Yes, in an OLTP system there can be gaps in identity values due to aborted
transactions, deletions, etc. and this would make it unreliable. In the case where I used max(id) to get a count of rows, it is part of a datawarehousing ETL process where a text file is bulk copied into a new table, and a count of the records inserted is needed for inclusion on a report. In this specific situation, I've found the count to be consistently accurate. Show quote "JT" <some***@microsoft.com> wrote in message news:%233%23b%23R9gFHA.1468@TK2MSFTNGP14.phx.gbl... > It is probably best not to design an application that depends on making > frequent record counts. If there is an identity column on the table, if the > identity started at 1, and if there are no deletes from the table, then > querying the maximum value of this column should give you a quick and > correct count. If the identity started at something other than 1 or if there > are gaps, then you can make whatever + / - adjustment is needed to the > result for a somewhat accurate count. > > "Robin Tucker" <idontwanttobespammedanymore@reallyidont.com> wrote in > message news:daluao$bsj$1$8302bc10@news.demon.co.uk... > > Reading the above thread about "Count" got me thinking about why it's > > generally considered bad practice to use Count(*) (or Count anything come > to > > think of it). Considering that rows are "fixed length", surely it's easy > > for the database to count all the rows in a table without scanning the > whole > > table? > > > > How does it work under the hood so to speak, as to make the case of > counting > > the rows in a table so slow? > > > > > > > > > In the case where I used max(id) to get a count of rows, it is part of a So why not do SELECT @@ROWCOUNT right after the BULK INSERT statement? > datawarehousing ETL process where a text file is bulk copied into a new > table, and a count of the records inserted is needed Adding a surrogate key just to keep track of rowcounts is not a very efficient way to do this... The ETL is performed by a Visual Basic application using the command line
version of the bulk copy program, so I don't have access to @@rowcount. The identity column is also the table's primary key. Record count is always instantaneous and accurate. Show quote "Aaron Bertrand [SQL Server MVP]" <ten.xoc@dnartreb.noraa> wrote in message news:OjoxO$%23gFHA.2916@TK2MSFTNGP14.phx.gbl... > > In the case where I used max(id) to get a count of rows, it is part of a > > datawarehousing ETL process where a text file is bulk copied into a new > > table, and a count of the records inserted is needed > > So why not do SELECT @@ROWCOUNT right after the BULK INSERT statement? > Adding a surrogate key just to keep track of rowcounts is not a very > efficient way to do this... > > You'll never need to do a select count(*) again. I've been looking at the
sys table very closely over the last few weeks. There is actually a field in one of the systables that contains the number of rows in the table!!!!!!! Talk about great features hidden in the back of sql server! So how do you get the count? And I'll bet that it'll be quicker than practically every select count(*) no matter how many indexes you create. Here's a stored procedures to do it.... (you'll need to have the stored procedure in the same database at the table, but I'm sure that someone will alter this to work anywhere) Hey, MS how about shipping that version in the master database in SQL 2005? Maybe call it sp_rowcount Ceate Procedure ProcRowCount @TableName sysname As Select si.rows From sysindexes si Join sysobjects so On so.id = si.id Where si.indid <= 1 and so.name = @TableName Go I've not tested this version of the stored procedure, it should work though. You could change this so that it's a UDF, that way you can use it just like select count(*). Show quote "Robin Tucker" <idontwanttobespammedanymore@reallyidont.com> wrote in message news:daluao$bsj$1$8302bc10@news.demon.co.uk... > Reading the above thread about "Count" got me thinking about why it's > generally considered bad practice to use Count(*) (or Count anything come > to think of it). Considering that rows are "fixed length", surely it's > easy for the database to count all the rows in a table without scanning > the whole table? > > How does it work under the hood so to speak, as to make the case of > counting the rows in a table so slow? > > > Hi
Sysindexes is not guaranteed to be up to date. In the trade off for performance, some system tables are only updated at a later stage. There have been many discussions on this in these groups. Regards -------------------------------- Mike Epprecht, Microsoft SQL Server MVP Zurich, Switzerland IM: m***@epprecht.net MVP Program: http://www.microsoft.com/mvp Blog: http://www.msmvps.com/epprecht/ Show quote "Colin Dawson" <newsgro***@cjdawson.com> wrote in message news:kwWze.66356$G8.5246@text.news.blueyonder.co.uk... > You'll never need to do a select count(*) again. I've been looking at the > sys table very closely over the last few weeks. There is actually a > field in one of the systables that contains the number of rows in the > table!!!!!!! Talk about great features hidden in the back of sql server! > > So how do you get the count? And I'll bet that it'll be quicker than > practically every select count(*) no matter how many indexes you create. > > Here's a stored procedures to do it.... (you'll need to have the stored > procedure in the same database at the table, but I'm sure that someone > will alter this to work anywhere) Hey, MS how about shipping that > version in the master database in SQL 2005? Maybe call it sp_rowcount > > Ceate Procedure ProcRowCount > @TableName sysname > As > Select > si.rows > From sysindexes si > Join sysobjects so On so.id = si.id > Where si.indid <= 1 > and so.name = @TableName > Go > > I've not tested this version of the stored procedure, it should work > though. You could change this so that it's a UDF, that way you can use it > just like select count(*). > > > > "Robin Tucker" <idontwanttobespammedanymore@reallyidont.com> wrote in > message news:daluao$bsj$1$8302bc10@news.demon.co.uk... >> Reading the above thread about "Count" got me thinking about why it's >> generally considered bad practice to use Count(*) (or Count anything come >> to think of it). Considering that rows are "fixed length", surely it's >> easy for the database to count all the rows in a table without scanning >> the whole table? >> >> How does it work under the hood so to speak, as to make the case of >> counting the rows in a table so slow? >> >> >> > > > You'll never need to do a select count(*) again. I've been looking at the Yes, however it is only updated periodically. So, to rely on a row count > sys table very closely over the last few weeks. There is actually a > field in one of the systables that contains the number of rows in the > table!!!!!!! for any table, you will have to do DBCC UPDATEUSAGE first. A |
|||||||||||||||||||||||