|
database
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Data Warehouse Load Hints?(SQL 2000) So if we are loading large tables while no one else is logged in
and no other processes will occur, should we provide hints for the table inserts and updates for a performance boost? If so, which hints? Can any help avoid deadlocks? And will this be different for 2005? Does setting the db to singleuser help speed-up updates? Thanks Greg C A deadlock is when two processes have locked resources (typically in an
uncommitted transaction) that the other needs to continue. How would this happen if the ETL is the only process running? http://support.microsoft.com/default.aspx?scid=kb;en-us;169960 http://msdn.microsoft.com/library/default.asp?url=/library/en-us/acdata/ac_8_con_7a_3hdf.asp As for maximizing the performance of your ETL, I would reccomend the "bulk load" method, becuase it supports features such as minimal transaction logging. http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/incbulkload.mspx Setting the database to DBO Use Only would also be a good idea; it may improve performance, but mostly to insure that noone or nothing else logs in. Show quote "gc" <nospam@hotmail.com> wrote in message news:Xns977C861DB2E90nospam@24.93.43.121... > (SQL 2000) So if we are loading large tables while no one else is logged > in > and no other processes will occur, should we provide hints for the table > inserts and updates for a performance boost? If so, which hints? Can any > help avoid deadlocks? And will this be different for 2005? > > Does setting the db to singleuser help speed-up updates? > > Thanks > Greg C > > > "JT" <some***@microsoft.com> wrote in Well...you are asking someone who really needs to ask the question. Did news:uTboj28PGHA.720@TK2MSFTNGP14.phx.gbl: that make sense? :-) I am not a dba, and have learned through trial and error what seems to work, at least most of the time. I thought 'bulk load' recovery was just beneficial when importing text files or using 'select into', no? In our ETL, we have a number of procedures (which can call other procedures) to test and process the data after the raw data is imported w/ DTS. The DTS part is always fast, some of the subsequent processing will lock-up periodically...I assume this would be a deadlock or lock escalation? (When I look at the locks I will see many pages on one object). It just happened to occurr on one machine...I stopped and restarted w/o mods to the code and the next time it processed w/o halting. I drop indexes for the DTS, for the data transformation I optimize sql for performance, individual statements are broken into transactions, most of the largest updates/inserts are broken into smaller transactions, temp tables are used when needed, I aggressively manage the multiple data and log files on multiple raids and hard drives. But I have not been able to understand the hints but more importantly why a process will run fine most times then decide to freeze once (btw, I also check for open transactions to make certain that there was no minor error leaving an open tran). thanks for the references > A deadlock is when two processes have locked resources (typically in ?url=/library/en-us/acdata/ac_8_con_7a_3hdf.asp As for maximizing the > an uncommitted transaction) that the other needs to continue. How > would this happen if the ETL is the only process running? > http://support.microsoft.com/default.aspx?scid=kb;en-us;169960 > http://msdn.microsoft.com/library/default.asp performance of your ETL, I > would reccomend the "bulk load" method, becuase it supports features http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/incbulkloa> such as minimal transaction logging. d.mspx Setting the database to DBO Use Only would also be a good Show quote > idea; it may improve performance, but mostly to insure that noone or > nothing else logs in. > > "gc" <nospam@hotmail.com> wrote in message > news:Xns977C861DB2E90nospam@24.93.43.121... >> (SQL 2000) So if we are loading large tables while no one else is >> logged in >> and no other processes will occur, should we provide hints for the >> table inserts and updates for a performance boost? If so, which >> hints? Can any help avoid deadlocks? And will this be different for >> 2005? >> >> Does setting the db to singleuser help speed-up updates? >> >> Thanks >> Greg C >> >> >> > > > You can know if a process if being blocked by another process by executing
sp_who2 and seeing if the [blkby] column for a spid (process id) contains the spid of another blocking process. When importing large amounts of data, use the bulk copy program (BCP.EXE) or BULK INSERT command. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/adminsql/ad_impt_bcp_9esz.asp Try reducing the batch size of your bulk copy process: http://support.microsoft.com/default.aspx?scid=kb;en-us;81339 Consider what impact transaction logging is having on your process. There are a few conditions (such as dropping indexes and specifying the TABLOCK hint) that are required for minimal logging to take effect. http://support.microsoft.com/default.aspx?scid=kb;en-us;59462 http://support.microsoft.com/default.aspx?scid=kb;en-us;110139 Show quote "gc" <nospam@hotmail.com> wrote in message news:Xns977CEEB88B493nospam@24.93.43.121... > "JT" <some***@microsoft.com> wrote in > news:uTboj28PGHA.720@TK2MSFTNGP14.phx.gbl: > > Well...you are asking someone who really needs to ask the question. Did > that make sense? :-) I am not a dba, and have learned through trial and > error what seems to work, at least most of the time. > > I thought 'bulk load' recovery was just beneficial when importing text > files or using 'select into', no? > > In our ETL, we have a number of procedures (which can call other > procedures) to test and process the data after the raw data is imported > w/ DTS. The DTS part is always fast, some of the subsequent processing > will lock-up periodically...I assume this would be a deadlock or lock > escalation? (When I look at the locks I will see many pages on one > object). It just happened to occurr on one machine...I stopped and > restarted w/o mods to the code and the next time it processed w/o > halting. > > I drop indexes for the DTS, for the data transformation I optimize sql > for performance, individual statements are broken into transactions, > most of the largest updates/inserts are broken into smaller transactions, > temp tables are used when needed, I aggressively manage the multiple data > and log files on multiple raids and hard drives. But I have not been > able to understand the hints but more importantly why a process will run > fine most times then decide to freeze once (btw, I also check for open > transactions to make certain that there was no minor error leaving an > open tran). > > thanks for the references > > >> A deadlock is when two processes have locked resources (typically in >> an uncommitted transaction) that the other needs to continue. How >> would this happen if the ETL is the only process running? >> http://support.microsoft.com/default.aspx?scid=kb;en-us;169960 >> http://msdn.microsoft.com/library/default.asp > ?url=/library/en-us/acdata/ac_8_con_7a_3hdf.asp As for maximizing the > performance of your ETL, I >> would reccomend the "bulk load" method, becuase it supports features >> such as minimal transaction logging. > http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/incbulkloa > d.mspx Setting the database to DBO Use Only would also be a good >> idea; it may improve performance, but mostly to insure that noone or >> nothing else logs in. >> >> "gc" <nospam@hotmail.com> wrote in message >> news:Xns977C861DB2E90nospam@24.93.43.121... >>> (SQL 2000) So if we are loading large tables while no one else is >>> logged in >>> and no other processes will occur, should we provide hints for the >>> table inserts and updates for a performance boost? If so, which >>> hints? Can any help avoid deadlocks? And will this be different for >>> 2005? >>> >>> Does setting the db to singleuser help speed-up updates? >>> >>> Thanks >>> Greg C >>> >>> >>> >> >> >> > gc (nospam@hotmail.com) writes:
> In our ETL, we have a number of procedures (which can call other A deadlock is when two (or more) processes are waiting for each other> procedures) to test and process the data after the raw data is imported > w/ DTS. The DTS part is always fast, some of the subsequent processing > will lock-up periodically...I assume this would be a deadlock or lock > escalation? (When I look at the locks I will see many pages on one > object). It just happened to occurr on one machine...I stopped and > restarted w/o mods to the code and the next time it processed w/o > halting. to release resources. SQL Server detects a deadlock, and will select one of the processes as a deadlock victim and cancel execution for that process, so that at least one of the processes can continue working. If you processing "locks up" this can be due to blocking, but also due to long-running queries. With the information you have provided, it is difficult to tell. As JT said, use sp_who2 to check for blocking. Another issue to watch out for is auto-grow of the database. By default, SQL Server auto-grows a database with 10% when you run out of space. If your database is huge - and data warehouses often are - then 10% can take quite some to time grow. This can be dealt with expanding the database to a reasonable size in advance. -- Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx Just a couple of tips that I've learned:
1. If you have a clustered index, make sure that it's located on a monotonically increasing value. This will minimize page splitting. 2. The clustered index does not have to be your primary key; in fact, if you use a natural key, then it should not be. Our data is very date and time sensitive, so I index the date and time of load; others use a sequential numbering system (like an identity column), but I needed the date of load for other reasons, and I prefer not to add columns that have no meaning. 3 Depending on the size of your data, it may be more appropriate to drop all of your non-clustered indexes , load the data, and rebuild the indexes. HTH, Stu "Stu" <stuart.ainswo***@gmail.com> wrote in news:1141535576.352832.290970 @e56g2000cwe.googlegroups.com:> 1. If you have a clustered index, make sure that it's located on a Not sure that I understand this one.> monotonically increasing value. This will minimize page splitting. If I load some data into the field of the clustered key, are you saying that it makes a difference what order that I load the records? If so, is that different whether the clustered index is the primary key or not? > 2. The clustered index does not have to be your primary key; in fact, I use natural keys for something like a client id, but convert almost all > if you use a natural key, then it should not be. Our data is very date > and time sensitive, so I index the date and time of load; others use a > sequential numbering system (like an identity column), but I needed the > date of load for other reasons, and I prefer not to add columns that > have no meaning. of the dimensions to surrogate (identity) keys. > 3 Depending on the size of your data, it may be more appropriate to I do this everytime. 1.5 out of 3... :-)> drop all of your non-clustered indexes , load the data, and rebuild the > indexes. > Ugh. Page splitting is difficult to explain; think of your table as a
blank notebook. The clustered index is kind of like a page number for your other indexes to go and retrieve information; it's a way for the optimizer to go figure out where row X is located in your table. As you write data to your notebook, it begins to fill up each page. As you continue to write, it skips to the next page; if your clustered index does not increase in a sequential fashion, when the optimizer encounters a a value that is out of order, it has to split the page at the insertion point, and move records below that point on a page to a new page; that's why you should cluster on a sequential value that is independent of the order of the data outside of the warehouse. That's why I recommend a datetime representation so it increases without splitting the pages. If your primary key is a natural key (like a combination of clientID and rowID), then the likelihood is that your rows will be out of order as they load into the target database, and hence some reordering will be necessary if you cluster on that. Clear as mud? Stu Also consider that if a good candidate for a clustered index cannot be
found, then best not to implement one. Show quote "Stu" <stuart.ainswo***@gmail.com> wrote in message news:1141539867.013564.240070@e56g2000cwe.googlegroups.com... > Ugh. Page splitting is difficult to explain; think of your table as a > blank notebook. The clustered index is kind of like a page number for > your other indexes to go and retrieve information; it's a way for the > optimizer to go figure out where row X is located in your table. > > As you write data to your notebook, it begins to fill up each page. As > you continue to write, it skips to the next page; if your clustered > index does not increase in a sequential fashion, when the optimizer > encounters a a value that is out of order, it has to split the page at > the insertion point, and move records below that point on a page to a > new page; that's why you should cluster on a sequential value that is > independent of the order of the data outside of the warehouse. That's > why I recommend a datetime representation so it increases without > splitting the pages. > > If your primary key is a natural key (like a combination of clientID > and rowID), then the likelihood is that your rows will be out of order > as they load into the target database, and hence some reordering will > be necessary if you cluster on that. > > Clear as mud? > > Stu > |
|||||||||||||||||||||||