|
database
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Single 5 mil row update or many 1000 row updates?I'm looking for some feedback as to which of these two approaches to a problem is better. We're dealing with 5 million records of 100 million to be updated, and both plans come up as clustered primary key scanning. On the one hand, a DBA has come up with: (the primary key of sourcetable is a triplet of key1 varchar(40), key2 char(1), key3 int) * Option 1 * create table #temp... (varchar(40), char(1), int, bit) insert into #temp select <keyfields: varchar, char, int>, case processed when 'Y' then 1 else 0 end from sourcetable where date between x and y create clustered index icx_temp on #temp (key1, key2, key3) update sourcetable set processed = 'Y' from sourcetable (paglock), #temp where sourcetable.key1 = #temp.key1 and ... key2... key3 and .... <other conditions> A developer has come up with * Option 2 * declare @key1of3 varchar(40) set @key1of3 = '' declare @key2of3 char(1) set @key2of3 = '' declare @key3of3 int set @key3of3 = 0 select <keyfields: varchar, char, int> from sourcetable into #temp where 1=0 alter table #temp add constraint primary key (key1, key2, key3) while (1=1) begin insert into #temp select top 1000 <keyfields: varchar, char, int> from sourcetable where date between x and y and .... <other conditions> and ((key1 > @key1of3) or (key1 = @key1of3 and key2 >= @key2of3) or (key1 = @key1of3 and key2 = @key2of3 and key3 > @key3of3)) order by key1, key2, key3 update sourcetable set processed = 'Y' from sourcetable, #temp where sourcetable.key1 = #temp.key1 and ... key2... key3 if @rowcount = 0 break select top 1 @key1of3 = key1, @key2of3 = key2, @key3of3 = key3 from #temp order by key1 desc, key2 desc, key3 desc truncate table #temp end If you can't already tell, the purpose is to update all records in sourcetable to Processed = 'Y' where it meets certain conditions, including a date range. The date range is big enough that it will hit 5 million to be updated, from a total table size of 100 million records. Planning on both reveal a traversal of the clustered primary key (key1 varchar(40), key2 char(1), key3 int) in both cases. Questions: 1. Which of these two will normally be faster? And if one is faster, by how much, i.e. marginally, significantly, about 10 times faster? 2. Three (3) ways of creating a temp table - which of these is best, and why? Or is it case by case basis? a) create table #temp .... <explicit create, explicit column types> b) select <fields> into #temp from ... where 1=0 <explicit create, implicit column types>; followed by a separate insert c) select <fields> into #temp from ..... <implicit create, implicit column types> I've heard that SQL Server 6.5 locks sysobjects and syscolumns in tempdb when using option 3, but I've done some tests and can see no problem in SQL Server 2000 - comments? 3) In the DBA's option, will the PAGLOCKs be obtained and released as the updates occur, i.e. only one PAGELOCK will be active at a time, or will it issue enough PAGELOCKs to cover all the 5 million rows to be updated - in one single implicit transaction? More importantly, will updates to the table be able to occur during the UPDATE phase? 4) Given that option 1 takes 1.5 hrs to complete the task, and depending on the answers to (1) and (3), will you recommend option 1 or 2? 5) If your answer was option 2 in (4), if the date range were decreased to a single day, i.e. a day with 300,000 records to update, option 1 has the advantage of using a cross of two indexes (anotherdatefield, processed) and (date) instead of the clustered primary key scan. Therefore this reduces time in the SELECT step. Option 2 is tied to incrementally scanning on the clustered primary key. Option 1 will finish in 4 minutes - does this change your answer to (4)? If you cannot understand the above, please do not attempt to answer! If it's my expressive ability, I apologize. Thanks, Richard
Show quote
"Richard C" <Richard C@discussions.microsoft.com> wrote in message A single bulk operation is both faster and easier to manage. Only break a news:23A0024B-B2E7-433C-86A1-A397A0631DC0@microsoft.com... > Hi, > > I'm looking for some feedback as to which of these two approaches to a > problem is better. We're dealing with 5 million records of 100 million to > be > updated, and both plans come up as clustered primary key scanning. On the > one > hand, a DBA has come up with: .. . . > > 1. Which of these two will normally be faster? And if one is faster, by > how > much, i.e. marginally, significantly, about 10 times faster? > single logical operation into multiple batches when foreced to do so. Things which might force you to process in batches might be running out of log space or unacceptable lock wait times for other sessions. David You should be careful when splitting any single update into several separate
updates. You lose the protection afforded by the implicit transaction within which each update statement operates. There must be no interdependence between the data in the updated rows--key value changes, etc. There must also be a mechanism to restart the job so that it can pick up where it left off. If a failure occurs during a single update, then all work performed by that update is rolled back. If you split up the updates, you must build in a recovery mechanism in case one or more of the updates fail. Since you lose the protection in either of the cases below, why not simply use SET ROWCOUNT ? For example: SET ROWCOUNT 1000 WHILE 1=1 BEGIN UPDATE SET Processed = 'Y' WHERE Processed != 'Y' -- add other conditions here IF @@ROWCOUNT = 0 BREAK END This is a much more readable solution, can easily be recovered from, and will perform better than either of the solutions below because no temp tables are involved. Show quote "Richard C" <Richard C@discussions.microsoft.com> wrote in message news:23A0024B-B2E7-433C-86A1-A397A0631DC0@microsoft.com... > Hi, > > I'm looking for some feedback as to which of these two approaches to a > problem is better. We're dealing with 5 million records of 100 million to be > updated, and both plans come up as clustered primary key scanning. On the one > hand, a DBA has come up with: > > (the primary key of sourcetable is a triplet of key1 varchar(40), key2 > char(1), key3 int) > > * Option 1 * > create table #temp... (varchar(40), char(1), int, bit) > insert into #temp select <keyfields: varchar, char, int>, case processed > when 'Y' then 1 else 0 end from sourcetable > where date between x and y > create clustered index icx_temp on #temp (key1, key2, key3) > update sourcetable set processed = 'Y' > from sourcetable (paglock), #temp > where sourcetable.key1 = #temp.key1 and ... key2... key3 > and .... <other conditions> > > A developer has come up with > > * Option 2 * > declare @key1of3 varchar(40) set @key1of3 = '' > declare @key2of3 char(1) set @key2of3 = '' > declare @key3of3 int set @key3of3 = 0 > select <keyfields: varchar, char, int> from sourcetable into #temp where 1=0 > alter table #temp add constraint primary key (key1, key2, key3) > while (1=1) begin > insert into #temp select top 1000 <keyfields: varchar, char, int> > from sourcetable > where date between x and y and .... <other conditions> > and ((key1 > @key1of3) > or (key1 = @key1of3 and key2 >= @key2of3) > or (key1 = @key1of3 and key2 = @key2of3 and key3 > @key3of3)) > order by key1, key2, key3 > update sourcetable set processed = 'Y' > from sourcetable, #temp > where sourcetable.key1 = #temp.key1 and ... key2... key3 > if @rowcount = 0 break > select top 1 @key1of3 = key1, @key2of3 = key2, @key3of3 = key3 > from #temp order by key1 desc, key2 desc, key3 desc > truncate table #temp > end > > If you can't already tell, the purpose is to update all records in > sourcetable to Processed = 'Y' where it meets certain conditions, including a > date range. The date range is big enough that it will hit 5 million to be > updated, from a total table size of 100 million records. Planning on both > reveal a traversal of the clustered primary key (key1 varchar(40), key2 > char(1), key3 int) in both cases. > > Questions: > > 1. Which of these two will normally be faster? And if one is faster, by how > much, i.e. marginally, significantly, about 10 times faster? > > 2. Three (3) ways of creating a temp table - which of these is best, and > why? Or is it case by case basis? > a) create table #temp .... <explicit create, explicit column types> > b) select <fields> into #temp from ... where 1=0 <explicit create, implicit > column types>; followed by a separate insert > c) select <fields> into #temp from ..... <implicit create, implicit column > types> > I've heard that SQL Server 6.5 locks sysobjects and syscolumns in tempdb > when using option 3, but I've done some tests and can see no problem in SQL > Server 2000 - comments? > > 3) In the DBA's option, will the PAGLOCKs be obtained and released as the > updates occur, i.e. only one PAGELOCK will be active at a time, or will it > issue enough PAGELOCKs to cover all the 5 million rows to be updated - in one > single implicit transaction? More importantly, will updates to the table be > able to occur during the UPDATE phase? > > 4) Given that option 1 takes 1.5 hrs to complete the task, and depending on > the answers to (1) and (3), will you recommend option 1 or 2? > > 5) If your answer was option 2 in (4), if the date range were decreased to a > single day, i.e. a day with 300,000 records to update, option 1 has the > advantage of using a cross of two indexes (anotherdatefield, processed) and > (date) instead of the clustered primary key scan. Therefore this reduces time > in the SELECT step. Option 2 is tied to incrementally scanning on the > clustered primary key. Option 1 will finish in 4 minutes - does this change > your answer to (4)? > > If you cannot understand the above, please do not attempt to answer! If it's > my expressive ability, I apologize. > > Thanks, > Richard > |
|||||||||||||||||||||||