Home All Groups Group Topic Archive Search About

Single 5 mil row update or many 1000 row updates?

Author
13 Aug 2005 3:49 PM
Richard C
Hi,

I'm looking for some feedback as to which of these two approaches to a
problem is better.  We're dealing with 5 million records of 100 million to be
updated, and both plans come up as clustered primary key scanning. On the one
hand, a DBA has come up with:

(the primary key of sourcetable is a triplet of key1 varchar(40), key2
char(1), key3 int)

* Option 1 *
create table #temp... (varchar(40), char(1), int, bit)
insert into #temp select <keyfields: varchar, char, int>, case processed
when 'Y' then 1 else 0 end from sourcetable
where date between x and y
create clustered index icx_temp on #temp (key1, key2, key3)
update sourcetable set processed = 'Y'
from sourcetable (paglock), #temp
where sourcetable.key1 = #temp.key1 and ... key2... key3
and .... <other conditions>

A developer has come up with

* Option 2 *
declare @key1of3 varchar(40) set @key1of3 = ''
declare @key2of3 char(1) set @key2of3 = ''
declare @key3of3 int set @key3of3 = 0
select <keyfields: varchar, char, int> from sourcetable into #temp where 1=0
alter table #temp add constraint primary key (key1, key2, key3)
while (1=1) begin
  insert into #temp select top 1000 <keyfields: varchar, char, int>
  from sourcetable
  where date between x and y and .... <other conditions>
  and ((key1 > @key1of3)
    or (key1 = @key1of3 and key2 >= @key2of3)
    or (key1 = @key1of3 and key2 = @key2of3 and key3 > @key3of3))
  order by key1, key2, key3
  update sourcetable set processed = 'Y'
  from sourcetable, #temp
  where sourcetable.key1 = #temp.key1 and ... key2... key3
  if @rowcount = 0 break
  select top 1 @key1of3 = key1, @key2of3 = key2, @key3of3 = key3
  from #temp order by key1 desc, key2 desc, key3 desc
  truncate table #temp
end

If you can't already tell, the purpose is to update all records in
sourcetable to Processed = 'Y' where it meets certain conditions, including a
date range. The date range is big enough that it will hit 5 million to be
updated, from a total table size of 100 million records. Planning on both
reveal a traversal of the clustered primary key (key1 varchar(40), key2
char(1), key3 int) in both cases.

Questions:

1. Which of these two will normally be faster? And if one is faster, by how
much, i.e. marginally, significantly, about 10 times faster?

2. Three (3) ways of creating a temp table - which of these is best, and
why? Or is it case by case basis?
a) create table #temp .... <explicit create, explicit column types>
b) select <fields> into #temp from ... where 1=0 <explicit create, implicit
column types>; followed by a separate insert
c) select <fields> into #temp from ..... <implicit create, implicit column
types>
I've heard that SQL Server 6.5 locks sysobjects and syscolumns in tempdb
when using option 3, but I've done some tests and can see no problem in SQL
Server 2000 - comments?

3) In the DBA's option, will the PAGLOCKs be obtained and released as the
updates occur, i.e. only one PAGELOCK will be active at a time, or will it
issue enough PAGELOCKs to cover all the 5 million rows to be updated - in one
single implicit transaction? More importantly, will updates to the table be
able to occur during the UPDATE phase?

4) Given that option 1 takes 1.5 hrs to complete the task, and depending on
the answers to (1) and (3), will you recommend option 1 or 2?

5) If your answer was option 2 in (4), if the date range were decreased to a
single day, i.e. a day with 300,000 records to update, option 1 has the
advantage of using a cross of two indexes (anotherdatefield, processed) and
(date) instead of the clustered primary key scan. Therefore this reduces time
in the SELECT step. Option 2 is tied to incrementally scanning on the
clustered primary key. Option 1 will finish in 4 minutes - does this change
your answer to (4)?

If you cannot understand the above, please do not attempt to answer! If it's
my expressive ability, I apologize.

Thanks,
Richard

Author
13 Aug 2005 4:22 PM
David Browne
Show quote
"Richard C" <Richard C@discussions.microsoft.com> wrote in message
news:23A0024B-B2E7-433C-86A1-A397A0631DC0@microsoft.com...
> Hi,
>
> I'm looking for some feedback as to which of these two approaches to a
> problem is better.  We're dealing with 5 million records of 100 million to
> be
> updated, and both plans come up as clustered primary key scanning. On the
> one
> hand, a DBA has come up with:
.. . .
>
> 1. Which of these two will normally be faster? And if one is faster, by
> how
> much, i.e. marginally, significantly, about 10 times faster?
>

A single bulk operation is both faster and easier to manage.  Only break a
single logical operation into multiple batches when foreced to do so.
Things which might force you to process in batches might be running out of
log space or unacceptable lock wait times for other sessions.

David
Author
13 Aug 2005 5:45 PM
Brian Selzer
You should be careful when splitting any single update into several separate
updates.  You lose the protection afforded by the implicit transaction
within which each update statement operates.  There must be no
interdependence between the data in the updated rows--key value changes,
etc.  There must also be a mechanism to restart the job so that it can pick
up where it left off.  If a failure occurs during a single update, then all
work performed by that update is rolled back.  If you split up the updates,
you must build in a recovery mechanism in case one or more of the updates
fail.

Since you lose the protection in either of the cases below, why not simply
use SET ROWCOUNT ?
For example:

SET ROWCOUNT 1000
WHILE 1=1
BEGIN
    UPDATE SET Processed = 'Y' WHERE Processed != 'Y' -- add other
conditions here
    IF @@ROWCOUNT = 0 BREAK
END

This is a much more readable solution, can easily be recovered from, and
will perform better than either of the solutions below because no temp
tables are involved.

Show quote
"Richard C" <Richard C@discussions.microsoft.com> wrote in message
news:23A0024B-B2E7-433C-86A1-A397A0631DC0@microsoft.com...
> Hi,
>
> I'm looking for some feedback as to which of these two approaches to a
> problem is better.  We're dealing with 5 million records of 100 million to
be
> updated, and both plans come up as clustered primary key scanning. On the
one
> hand, a DBA has come up with:
>
> (the primary key of sourcetable is a triplet of key1 varchar(40), key2
> char(1), key3 int)
>
> * Option 1 *
> create table #temp... (varchar(40), char(1), int, bit)
> insert into #temp select <keyfields: varchar, char, int>, case processed
> when 'Y' then 1 else 0 end from sourcetable
> where date between x and y
> create clustered index icx_temp on #temp (key1, key2, key3)
> update sourcetable set processed = 'Y'
> from sourcetable (paglock), #temp
> where sourcetable.key1 = #temp.key1 and ... key2... key3
> and .... <other conditions>
>
> A developer has come up with
>
> * Option 2 *
> declare @key1of3 varchar(40) set @key1of3 = ''
> declare @key2of3 char(1) set @key2of3 = ''
> declare @key3of3 int set @key3of3 = 0
> select <keyfields: varchar, char, int> from sourcetable into #temp where
1=0
> alter table #temp add constraint primary key (key1, key2, key3)
> while (1=1) begin
>   insert into #temp select top 1000 <keyfields: varchar, char, int>
>   from sourcetable
>   where date between x and y and .... <other conditions>
>   and ((key1 > @key1of3)
>     or (key1 = @key1of3 and key2 >= @key2of3)
>     or (key1 = @key1of3 and key2 = @key2of3 and key3 > @key3of3))
>   order by key1, key2, key3
>   update sourcetable set processed = 'Y'
>   from sourcetable, #temp
>   where sourcetable.key1 = #temp.key1 and ... key2... key3
>   if @rowcount = 0 break
>   select top 1 @key1of3 = key1, @key2of3 = key2, @key3of3 = key3
>   from #temp order by key1 desc, key2 desc, key3 desc
>   truncate table #temp
> end
>
> If you can't already tell, the purpose is to update all records in
> sourcetable to Processed = 'Y' where it meets certain conditions,
including a
> date range. The date range is big enough that it will hit 5 million to be
> updated, from a total table size of 100 million records. Planning on both
> reveal a traversal of the clustered primary key (key1 varchar(40), key2
> char(1), key3 int) in both cases.
>
> Questions:
>
> 1. Which of these two will normally be faster? And if one is faster, by
how
> much, i.e. marginally, significantly, about 10 times faster?
>
> 2. Three (3) ways of creating a temp table - which of these is best, and
> why? Or is it case by case basis?
>  a) create table #temp .... <explicit create, explicit column types>
>  b) select <fields> into #temp from ... where 1=0 <explicit create,
implicit
> column types>; followed by a separate insert
>  c) select <fields> into #temp from ..... <implicit create, implicit
column
> types>
> I've heard that SQL Server 6.5 locks sysobjects and syscolumns in tempdb
> when using option 3, but I've done some tests and can see no problem in
SQL
> Server 2000 - comments?
>
> 3) In the DBA's option, will the PAGLOCKs be obtained and released as the
> updates occur, i.e. only one PAGELOCK will be active at a time, or will it
> issue enough PAGELOCKs to cover all the 5 million rows to be updated - in
one
> single implicit transaction? More importantly, will updates to the table
be
> able to occur during the UPDATE phase?
>
> 4) Given that option 1 takes 1.5 hrs to complete the task, and depending
on
> the answers to (1) and (3), will you recommend option 1 or 2?
>
> 5) If your answer was option 2 in (4), if the date range were decreased to
a
> single day, i.e. a day with 300,000 records to update, option 1 has the
> advantage of using a cross of two indexes (anotherdatefield, processed)
and
> (date) instead of the clustered primary key scan. Therefore this reduces
time
> in the SELECT step. Option 2 is tied to incrementally scanning on the
> clustered primary key. Option 1 will finish in 4 minutes - does this
change
> your answer to (4)?
>
> If you cannot understand the above, please do not attempt to answer! If
it's
> my expressive ability, I apologize.
>
> Thanks,
> Richard
>

AddThis Social Bookmark Button