Home All Groups Group Topic Archive Search About

Detect Changes Using Checksum

Author
30 Aug 2006 2:36 PM
Dave
Hi,

I am working on a project which integrates diverse and often unreliable data
sources. I need to check for changes in source data on a daily basis. I
import data into MSSQL 2000 and using checksum functionality I compare
yesterday's snapshot of the data with today's. I run a SProc which allocates
a checksum value to each record before the comparison. I can therefore
determine which records have changed and further process them.

But is checksum sophisticated/robust enough to rely on in a production
environment. Testing has not unearthed any problems so far but I was
wondering is my approach fundamentally sound or have I missed something?

Thanks,

Dave.

Author
30 Aug 2006 2:59 PM
ML
Why use checksums if you can compare the values directly?

Different values could produce the same checksum.


ML

---
http://milambda.blogspot.com/
Author
30 Aug 2006 3:03 PM
Steve Kass
Dave,

Checksum will not detect all changes. It's not a question of
robustness, but mathematical impossibility. There are only
4 billion different possible checksum values, and far more than
4 billion different possible contents of a single row for a typical table.

Without comparing all the data, you are taking a chance - not a big
one perhaps, but a chance nonetheless - that you will get the same
checksum for different data.

See http://groups.google.com/groups/search?q=gabcdefe for an
example of when this happens.

Steve Kass
Drew University
www.stevekass.com

Dave wrote:

Show quote
>Hi,
>
>I am working on a project which integrates diverse and often unreliable data
>sources. I need to check for changes in source data on a daily basis. I
>import data into MSSQL 2000 and using checksum functionality I compare
>yesterday's snapshot of the data with today's. I run a SProc which allocates
>a checksum value to each record before the comparison. I can therefore
>determine which records have changed and further process them.
>
>But is checksum sophisticated/robust enough to rely on in a production
>environment. Testing has not unearthed any problems so far but I was
>wondering is my approach fundamentally sound or have I missed something?
>
>Thanks,
>
>Dave.
>
>

>
Author
30 Aug 2006 3:22 PM
Dave
The reason I was using checksum (binary_checksum) was for speed and
simplicity (I Imagined). It was hoped to run the process considerably more
frequently than nightly. The risk of error is small but too big. I will
reconfigure my SProcs and perform proper data comparison.

Thanks ML and Steve.

Dave.
Author
30 Aug 2006 5:46 PM
Alexander Kuznetsov
> frequently than nightly. The risk of error is small but too big. I will
> reconfigure my SProcs and perform proper data comparison.

Dave,

instead of compatring data column by column, as follows

c1.col1 <> c2.col1 OR (c1.col1 IS NULL AND c2.col1 IS NOT NULL) OR
(c1.col1 IS NOT NULL AND c2.col1 IS NULL)

use

SELECT COUNT(*) FROM(
SELECT col1, col2, ... FROM YesterdaysVersion
UNION
SELECT col1, col2, ... FROM TodaysVersion) t

If it's 1 , versions are identical
Author
30 Aug 2006 7:35 PM
Jim Underwood
Alex,
If both YesterdaysVersion and TodaysVersion contain 100 identical rows,
won't the count(*) return 100, rahter than 1?

Did you mean to do an except?


Show quote
"Alexander Kuznetsov" <AK_TIREDOFSPAM@hotmail.COM> wrote in message
news:1156959990.514484.105740@74g2000cwt.googlegroups.com...
> > frequently than nightly. The risk of error is small but too big. I will
> > reconfigure my SProcs and perform proper data comparison.
>
> Dave,
>
> instead of compatring data column by column, as follows
>
> c1.col1 <> c2.col1 OR (c1.col1 IS NULL AND c2.col1 IS NOT NULL) OR
> (c1.col1 IS NOT NULL AND c2.col1 IS NULL)
>
> use
>
> SELECT COUNT(*) FROM(
> SELECT col1, col2, ... FROM YesterdaysVersion
> UNION
> SELECT col1, col2, ... FROM TodaysVersion) t
>
> If it's 1 , versions are identical
>
Author
30 Aug 2006 9:27 PM
Hugo Kornelis
On Wed, 30 Aug 2006 08:22:02 -0700, Dave wrote:

>The reason I was using checksum (binary_checksum) was for speed and
>simplicity (I Imagined). It was hoped to run the process considerably more
>frequently than nightly. The risk of error is small but too big. I will
>reconfigure my SProcs and perform proper data comparison.
>
>Thanks ML and Steve.
>
>Dave.

Hi Dave,

In addition to Alexander's suggestion, here's yet another way to compare
two tables:

SELECT   col1, col2, ..., colN, MIN(version) AS version
FROM    (SELECT col1, col2, ..., colN, 'Yesterday' AS version
         FROM   YesterdaysVersion
         UNION ALL
         SELECT col1, col2, ..., colN, 'Today' AS version
         FROM   YesterdaysVersion)
GROUP BY col1, col2, ..., colN
WHERE    COUNT(*) = 1

--
Hugo Kornelis, SQL Server MVP
Author
31 Aug 2006 7:20 AM
Uri Dimant
Hugo
Perhaps I missunderstood the OP's request , but the below example shows that
both tables have an idetrnical data.

SELECT COUNT(*) FROM(
SELECT c FROM #t
UNION
SELECT c FROM #t1) t

create table #t (c int)
insert into #t values (50)
insert into #t values (100)
-----------
create table #t1(c int)
insert into #t1 values (50)
insert into #t1 values (100)




Show quote
"Hugo Kornelis" <h***@perFact.REMOVETHIS.info.INVALID> wrote in message
news:hg0cf2970le3o6vslg7gvnv0464k6fe3sr@4ax.com...
> On Wed, 30 Aug 2006 08:22:02 -0700, Dave wrote:
>
>>The reason I was using checksum (binary_checksum) was for speed and
>>simplicity (I Imagined). It was hoped to run the process considerably more
>>frequently than nightly. The risk of error is small but too big. I will
>>reconfigure my SProcs and perform proper data comparison.
>>
>>Thanks ML and Steve.
>>
>>Dave.
>
> Hi Dave,
>
> In addition to Alexander's suggestion, here's yet another way to compare
> two tables:
>
> SELECT   col1, col2, ..., colN, MIN(version) AS version
> FROM    (SELECT col1, col2, ..., colN, 'Yesterday' AS version
>         FROM   YesterdaysVersion
>         UNION ALL
>         SELECT col1, col2, ..., colN, 'Today' AS version
>         FROM   YesterdaysVersion)
> GROUP BY col1, col2, ..., colN
> WHERE    COUNT(*) = 1
>
> --
> Hugo Kornelis, SQL Server MVP
Author
1 Sep 2006 9:51 PM
Hugo Kornelis
On Thu, 31 Aug 2006 10:20:36 +0300, Uri Dimant wrote:

Show quote
>Hugo
>Perhaps I missunderstood the OP's request , but the below example shows that
>both tables have an idetrnical data.
>
>SELECT COUNT(*) FROM(
>SELECT c FROM #t
>UNION
>SELECT c FROM #t1) t
>
>create table #t (c int)
>insert into #t values (50)
>insert into #t values (100)
>-----------
>create table #t1(c int)
>insert into #t1 values (50)
>insert into #t1 values (100)

Hi Uri,

If I remove one of the rows from one of the tables, the result of the
query doesn't change even though the tables are no longer identical. If
I then remove the "other" row from the other table, the output is STILL
the same.

Your query can be used to find if the data is identical, but only in
conjunction with two extra queries, for the COUNT(*) of each of the two
tables individually. And once those have run, you know if the tables are
identical or not, but you don't know what the differences are.

My query shows the differences immediately, and is probably faster too
(one table scan for each table vs two scans for your query).

--
Hugo Kornelis, SQL Server MVP
Author
30 Aug 2006 5:45 PM
Arnie Rowland
Building upon the previous remarks, and your concern for speed and
efficiency, you may be able to compare the indexed columns first, and then
only if there is a match, compare the non-indexed columns.

--
Arnie Rowland, Ph.D.
Westwood Consulting, Inc

Most good judgment comes from experience.
Most experience comes from bad judgment.
- Anonymous


Show quote
"Dave" <D***@discussions.microsoft.com> wrote in message
news:F4FF896C-E231-4073-B90D-CE8EC7B1461A@microsoft.com...
> Hi,
>
> I am working on a project which integrates diverse and often unreliable
> data
> sources. I need to check for changes in source data on a daily basis. I
> import data into MSSQL 2000 and using checksum functionality I compare
> yesterday's snapshot of the data with today's. I run a SProc which
> allocates
> a checksum value to each record before the comparison. I can therefore
> determine which records have changed and further process them.
>
> But is checksum sophisticated/robust enough to rely on in a production
> environment. Testing has not unearthed any problems so far but I was
> wondering is my approach fundamentally sound or have I missed something?
>
> Thanks,
>
> Dave.
>
>

AddThis Social Bookmark Button