Home All Groups Group Topic Archive Search About

Best Data Type for a Tracking ID?

Author
28 Dec 2005 5:27 PM
Mark S.
Hello,

We're in the table design process and could really use outside thoughts on
our options.

The table in question will grow to several million records. This table will
update and insert depending on what the tracking ID does as it lives out it
day long life span.

Example Stored Proc:
UPDATE tbTracking
SET myCount = myCount +1
WHERE  trackingID = @trackingID AND productID = @productID
IF (@@rowcount = 0)
    BEGIN
        INSERT INTO tbTracking (...) VALUES (...)   ;
    END

Everyday at midnight the table will be truncate just afer many other queries
generated reports and store cumulative data into other tables.

Our goal is extreme performance, with that in mind what is the best data
type for trackingID and productID? We have the luxury to make them whatever
is best, as this is a completely new project.

Internally debated NVARCHR(50), INT, Binary, UNIQUEIDENTIFIER and the like,
but we're in need of advise from those with more experience with these
matters.

Thank you for your time.

Mark

Author
28 Dec 2005 5:44 PM
David Portas
Mark S. wrote:

Show quote
> Hello,
>
> We're in the table design process and could really use outside thoughts on
> our options.
>
> The table in question will grow to several million records. This table will
> update and insert depending on what the tracking ID does as it lives out it
> day long life span.
>
> Example Stored Proc:
> UPDATE tbTracking
> SET myCount = myCount +1
> WHERE  trackingID = @trackingID AND productID = @productID
> IF (@@rowcount = 0)
>     BEGIN
>         INSERT INTO tbTracking (...) VALUES (...)   ;
>     END
>
> Everyday at midnight the table will be truncate just afer many other queries
> generated reports and store cumulative data into other tables.
>
> Our goal is extreme performance, with that in mind what is the best data
> type for trackingID and productID? We have the luxury to make them whatever
> is best, as this is a completely new project.
>
> Internally debated NVARCHR(50), INT, Binary, UNIQUEIDENTIFIER and the like,
> but we're in need of advise from those with more experience with these
> matters.
>
> Thank you for your time.
>
> Mark

This doesn't make much sense to me. As described it isn't clear whether
trackingid is a surrogate key or not. If it isn't then I don't
understand what its purpose is. What is the business meaning of
trackingid? Why isn't the datatype predetermined? What key or keys
exist in this table?

Maybe we are talking surrogate keys here, in which case see:
http://www.aspfaq.com/show.asp?id=2504

> Everyday at midnight the table will be truncate just afer many other queries
> generated reports and store cumulative data into other tables.

A guiding in principle in data warehouse applications is usually to
capture the data at the finest possible level of granularity. I expect
you've considered this, but I just thought it worth restating for the
benefit of all.

--
David Portas
SQL Server MVP
--
Author
28 Dec 2005 6:25 PM
JT
Who or what determines the value of TrackingID?

If this key is to uniquely identify an event (such as a support call), then
perhaps (8 byte) datetime would be the logical choice. It has inherent
meaning, and unless you have hundreds of support calls coming in per minute,
it is unlikely that this key value would be duplicated, and retry logic on
the insert procedure or statement can handle the unlikely event that does.
There may even be a need to purge the tbTracking table based on < a specific
date/time rather than truncating the entire table at the end of the day.

As for ProductID, this should be an existing attribute in your database.

Show quote
"Mark S." <ma***@yahoo.com> wrote in message
news:Ow6CKQ9CGHA.4004@tk2msftngp13.phx.gbl...
> Hello,
>
> We're in the table design process and could really use outside thoughts on
> our options.
>
> The table in question will grow to several million records. This table
> will update and insert depending on what the tracking ID does as it lives
> out it day long life span.
>
> Example Stored Proc:
> UPDATE tbTracking
> SET myCount = myCount +1
> WHERE  trackingID = @trackingID AND productID = @productID
> IF (@@rowcount = 0)
>    BEGIN
>        INSERT INTO tbTracking (...) VALUES (...)   ;
>    END
>
> Everyday at midnight the table will be truncate just afer many other
> queries generated reports and store cumulative data into other tables.
>
> Our goal is extreme performance, with that in mind what is the best data
> type for trackingID and productID? We have the luxury to make them
> whatever is best, as this is a completely new project.
>
> Internally debated NVARCHR(50), INT, Binary, UNIQUEIDENTIFIER and the
> like, but we're in need of advise from those with more experience with
> these matters.
>
> Thank you for your time.
>
> Mark
>
>
Author
28 Dec 2005 7:26 PM
Joe from WI
In my experiences, int data type is the best performer especially when it
comes to indexing, joins, etc.  (This comment relates to 32-bit O/S and SQL
Server 2000.)

I would make a meaningless key:
TrackingRowId int, identity, primary key, clustered
ProductRowId int, identity, primary key, clustered

SQL Server takes care of assigning the next number.  The clustered index
will always insert data at the end of the data pages which avoids page splits.

If you want to reset the tracking id each midnight, simply change the reseed
the tables as part of your processing.

Personally...
I would NOT use a character string for a key because its slower to compare
strings.
I would NOT use a UNIQUEIDENTIFIER for a key because it is a pain in the
butt when you have to write a manual query to pull data.

Finally, I have to reiterate David's point of capturing data at the lowest
level.  Are you sure you want to summarize and truncate on a nightly basis. 
What if a bug is discovered?  You won't have the original data to recreate
the totals.  What if a new way of summarizing is added?  You won't have the
original data to drill into or summarize.

And, to reiterate JT's point of having a datetime column that can be very
handy.  It is debateable whether to use it as the key or not.  If you're
truncating the tables each night, it may be a waste.

Just my two cents,
Joe

Show quote
"Mark S." wrote:

> Hello,
>
> We're in the table design process and could really use outside thoughts on
> our options.
>
> The table in question will grow to several million records. This table will
> update and insert depending on what the tracking ID does as it lives out it
> day long life span.
>
> Example Stored Proc:
> UPDATE tbTracking
> SET myCount = myCount +1
> WHERE  trackingID = @trackingID AND productID = @productID
> IF (@@rowcount = 0)
>     BEGIN
>         INSERT INTO tbTracking (...) VALUES (...)   ;
>     END
>
> Everyday at midnight the table will be truncate just afer many other queries
> generated reports and store cumulative data into other tables.
>
> Our goal is extreme performance, with that in mind what is the best data
> type for trackingID and productID? We have the luxury to make them whatever
> is best, as this is a completely new project.
>
> Internally debated NVARCHR(50), INT, Binary, UNIQUEIDENTIFIER and the like,
> but we're in need of advise from those with more experience with these
> matters.
>
> Thank you for your time.
>
> Mark
>
>
>
Author
29 Dec 2005 12:36 AM
Mark S.
Gentleman thank you for your feedback.

Joe, other than the UniqueIndentifier being unfriendly for humans, do you
find it faster than INTs?

As to th other questions and caveats, all that has been considered
previously, and wasn't mentioned in my question in the interests of brevity
and an attempt to focus my question on a single point. Regardless, thank you
for your full consideration.

Cheers,

Mark





Show quote
"Joe from WI" <Joefro***@discussions.microsoft.com> wrote in message
news:50A0324B-1446-4044-A850-706B59A04169@microsoft.com...
> In my experiences, int data type is the best performer especially when it
> comes to indexing, joins, etc.  (This comment relates to 32-bit O/S and
> SQL
> Server 2000.)
>
> I would make a meaningless key:
> TrackingRowId int, identity, primary key, clustered
> ProductRowId int, identity, primary key, clustered
>
> SQL Server takes care of assigning the next number.  The clustered index
> will always insert data at the end of the data pages which avoids page
> splits.
>
> If you want to reset the tracking id each midnight, simply change the
> reseed
> the tables as part of your processing.
>
> Personally...
> I would NOT use a character string for a key because its slower to compare
> strings.
> I would NOT use a UNIQUEIDENTIFIER for a key because it is a pain in the
> butt when you have to write a manual query to pull data.
>
> Finally, I have to reiterate David's point of capturing data at the lowest
> level.  Are you sure you want to summarize and truncate on a nightly
> basis.
> What if a bug is discovered?  You won't have the original data to recreate
> the totals.  What if a new way of summarizing is added?  You won't have
> the
> original data to drill into or summarize.
>
> And, to reiterate JT's point of having a datetime column that can be very
> handy.  It is debateable whether to use it as the key or not.  If you're
> truncating the tables each night, it may be a waste.
>
> Just my two cents,
> Joe
>
> "Mark S." wrote:
>
>> Hello,
>>
>> We're in the table design process and could really use outside thoughts
>> on
>> our options.
>>
>> The table in question will grow to several million records. This table
>> will
>> update and insert depending on what the tracking ID does as it lives out
>> it
>> day long life span.
>>
>> Example Stored Proc:
>> UPDATE tbTracking
>> SET myCount = myCount +1
>> WHERE  trackingID = @trackingID AND productID = @productID
>> IF (@@rowcount = 0)
>>     BEGIN
>>         INSERT INTO tbTracking (...) VALUES (...)   ;
>>     END
>>
>> Everyday at midnight the table will be truncate just afer many other
>> queries
>> generated reports and store cumulative data into other tables.
>>
>> Our goal is extreme performance, with that in mind what is the best data
>> type for trackingID and productID? We have the luxury to make them
>> whatever
>> is best, as this is a completely new project.
>>
>> Internally debated NVARCHR(50), INT, Binary, UNIQUEIDENTIFIER and the
>> like,
>> but we're in need of advise from those with more experience with these
>> matters.
>>
>> Thank you for your time.
>>
>> Mark
>>
>>
>>
Author
29 Dec 2005 12:44 AM
Mark S.
The last line of the article David suggested
http://www.aspfaq.com/show.asp?id=2504

Says alot about GUI() not being optimized, if anyone disagrees, feel free to
speak up:
"the wider datatype leads to a drop in index performance (if clustered, each
insert almost guaranteed to 'dirty' a different page), and an increase in
storage requirements; " and five other cons.

Thank you.
Author
29 Dec 2005 5:00 AM
Brian Selzer
To maximize INSERT performance: use INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED
WITH FILLFACTOR = 100

This causes every new row to be added at the end of the table--minimizing
index maintenance and eliminating page splits.  Use caution, however: since
SQL Server doesn't automatically reorganize indexes, a high volume of DELETE
activity will cause the index to become sparse, which can reduce SELECT
performance, but it will not affect INSERT performance.

The INT datatype matches the word size of most Intel processors (32-bit), so
comparisons require fewer CPU cycles.

The stored procedure is an example of what not to do.  It's a recipie for
primary key constraint violations.  There's nothing that prevents two
transactions from trying to INSERT the same information at the same time.
Most of the time what will happen is that one connection will succeed with
the INSERT and the other will UPDATE the newly inserted row, but a collision
will occur if identical UPDATE statements occur simultaneously on two
unbound connections followed by (since the row doesn't yet exist)
simultaneous identical INSERT statements.

The correct method is to use something like this:

BEGIN TRAN
IF EXISTS (SELECT WITH(UPDLOCK, HOLDLOCK))
    UPDATE
ELSE
    INSERT
COMMIT

Some people use the following instead, but I prefer the above method since
it is easier to read and understand

BEGIN TRAN
INSERT...SELECT...WHERE NOT EXISTS(SELECT WITH(UPDLOCK, HOLDLOCK))
IF @@ROWCOUNT = 0
    UPDATE
COMMIT

Note that there isn't any marked reduction in performance or concurrency
between this and your sample, because UPDLOCK doesn't block SELECTs, and any
blocking that does occur is necessary to maintain integrity.  Without the
EXISTS clause, the INSERT or UPDATE will be applying an exclusive lock
anyway which involves reading the index page into memory.  With the EXISTS
clause, the SELECT reads the index page and applies an update lock, and the
INSERT or UPDATE simply transition from an update lock to an exclusive lock
in memory--no additional physical read is necessary.

Show quote
"Mark S." <ma***@yahoo.com> wrote in message
news:Ow6CKQ9CGHA.4004@tk2msftngp13.phx.gbl...
> Hello,
>
> We're in the table design process and could really use outside thoughts on
> our options.
>
> The table in question will grow to several million records. This table
> will update and insert depending on what the tracking ID does as it lives
> out it day long life span.
>
> Example Stored Proc:
> UPDATE tbTracking
> SET myCount = myCount +1
> WHERE  trackingID = @trackingID AND productID = @productID
> IF (@@rowcount = 0)
>    BEGIN
>        INSERT INTO tbTracking (...) VALUES (...)   ;
>    END
>
> Everyday at midnight the table will be truncate just afer many other
> queries generated reports and store cumulative data into other tables.
>
> Our goal is extreme performance, with that in mind what is the best data
> type for trackingID and productID? We have the luxury to make them
> whatever is best, as this is a completely new project.
>
> Internally debated NVARCHR(50), INT, Binary, UNIQUEIDENTIFIER and the
> like, but we're in need of advise from those with more experience with
> these matters.
>
> Thank you for your time.
>
> Mark
>
>
Author
29 Dec 2005 6:02 AM
Mark S.
Thank you very much.

Show quote
"Brian Selzer" <br***@selzer-software.com> wrote in message
news:%23oG2PTDDGHA.3980@TK2MSFTNGP12.phx.gbl...
> To maximize INSERT performance: use INT IDENTITY(1, 1) PRIMARY KEY
> CLUSTERED WITH FILLFACTOR = 100
>
> This causes every new row to be added at the end of the table--minimizing
> index maintenance and eliminating page splits.  Use caution, however:
> since SQL Server doesn't automatically reorganize indexes, a high volume
> of DELETE activity will cause the index to become sparse, which can reduce
> SELECT performance, but it will not affect INSERT performance.
>
> The INT datatype matches the word size of most Intel processors (32-bit),
> so comparisons require fewer CPU cycles.
>
> The stored procedure is an example of what not to do.  It's a recipie for
> primary key constraint violations.  There's nothing that prevents two
> transactions from trying to INSERT the same information at the same time.
> Most of the time what will happen is that one connection will succeed with
> the INSERT and the other will UPDATE the newly inserted row, but a
> collision will occur if identical UPDATE statements occur simultaneously
> on two unbound connections followed by (since the row doesn't yet exist)
> simultaneous identical INSERT statements.
>
> The correct method is to use something like this:
>
> BEGIN TRAN
> IF EXISTS (SELECT WITH(UPDLOCK, HOLDLOCK))
>    UPDATE
> ELSE
>    INSERT
> COMMIT
>
> Some people use the following instead, but I prefer the above method since
> it is easier to read and understand
>
> BEGIN TRAN
> INSERT...SELECT...WHERE NOT EXISTS(SELECT WITH(UPDLOCK, HOLDLOCK))
> IF @@ROWCOUNT = 0
>    UPDATE
> COMMIT
>
> Note that there isn't any marked reduction in performance or concurrency
> between this and your sample, because UPDLOCK doesn't block SELECTs, and
> any blocking that does occur is necessary to maintain integrity.  Without
> the EXISTS clause, the INSERT or UPDATE will be applying an exclusive lock
> anyway which involves reading the index page into memory.  With the EXISTS
> clause, the SELECT reads the index page and applies an update lock, and
> the INSERT or UPDATE simply transition from an update lock to an exclusive
> lock in memory--no additional physical read is necessary.
>
> "Mark S." <ma***@yahoo.com> wrote in message
> news:Ow6CKQ9CGHA.4004@tk2msftngp13.phx.gbl...
>> Hello,
>>
>> We're in the table design process and could really use outside thoughts
>> on our options.
>>
>> The table in question will grow to several million records. This table
>> will update and insert depending on what the tracking ID does as it lives
>> out it day long life span.
>>
>> Example Stored Proc:
>> UPDATE tbTracking
>> SET myCount = myCount +1
>> WHERE  trackingID = @trackingID AND productID = @productID
>> IF (@@rowcount = 0)
>>    BEGIN
>>        INSERT INTO tbTracking (...) VALUES (...)   ;
>>    END
>>
>> Everyday at midnight the table will be truncate just afer many other
>> queries generated reports and store cumulative data into other tables.
>>
>> Our goal is extreme performance, with that in mind what is the best data
>> type for trackingID and productID? We have the luxury to make them
>> whatever is best, as this is a completely new project.
>>
>> Internally debated NVARCHR(50), INT, Binary, UNIQUEIDENTIFIER and the
>> like, but we're in need of advise from those with more experience with
>> these matters.
>>
>> Thank you for your time.
>>
>> Mark
>>
>>
>
>

AddThis Social Bookmark Button