|
database
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Detect Changes Using ChecksumHi,
I am working on a project which integrates diverse and often unreliable data sources. I need to check for changes in source data on a daily basis. I import data into MSSQL 2000 and using checksum functionality I compare yesterday's snapshot of the data with today's. I run a SProc which allocates a checksum value to each record before the comparison. I can therefore determine which records have changed and further process them. But is checksum sophisticated/robust enough to rely on in a production environment. Testing has not unearthed any problems so far but I was wondering is my approach fundamentally sound or have I missed something? Thanks, Dave. Why use checksums if you can compare the values directly?
Different values could produce the same checksum. ML --- http://milambda.blogspot.com/ Dave,
Checksum will not detect all changes. It's not a question of robustness, but mathematical impossibility. There are only 4 billion different possible checksum values, and far more than 4 billion different possible contents of a single row for a typical table. Without comparing all the data, you are taking a chance - not a big one perhaps, but a chance nonetheless - that you will get the same checksum for different data. See http://groups.google.com/groups/search?q=gabcdefe for an example of when this happens. Steve Kass Drew University www.stevekass.com Dave wrote: Show quote >Hi, > >I am working on a project which integrates diverse and often unreliable data >sources. I need to check for changes in source data on a daily basis. I >import data into MSSQL 2000 and using checksum functionality I compare >yesterday's snapshot of the data with today's. I run a SProc which allocates >a checksum value to each record before the comparison. I can therefore >determine which records have changed and further process them. > >But is checksum sophisticated/robust enough to rely on in a production >environment. Testing has not unearthed any problems so far but I was >wondering is my approach fundamentally sound or have I missed something? > >Thanks, > >Dave. > > > > The reason I was using checksum (binary_checksum) was for speed and
simplicity (I Imagined). It was hoped to run the process considerably more frequently than nightly. The risk of error is small but too big. I will reconfigure my SProcs and perform proper data comparison. Thanks ML and Steve. Dave. > frequently than nightly. The risk of error is small but too big. I will Dave,> reconfigure my SProcs and perform proper data comparison. instead of compatring data column by column, as follows c1.col1 <> c2.col1 OR (c1.col1 IS NULL AND c2.col1 IS NOT NULL) OR (c1.col1 IS NOT NULL AND c2.col1 IS NULL) use SELECT COUNT(*) FROM( SELECT col1, col2, ... FROM YesterdaysVersion UNION SELECT col1, col2, ... FROM TodaysVersion) t If it's 1 , versions are identical Alex,
If both YesterdaysVersion and TodaysVersion contain 100 identical rows, won't the count(*) return 100, rahter than 1? Did you mean to do an except? Show quote "Alexander Kuznetsov" <AK_TIREDOFSPAM@hotmail.COM> wrote in message news:1156959990.514484.105740@74g2000cwt.googlegroups.com... > > frequently than nightly. The risk of error is small but too big. I will > > reconfigure my SProcs and perform proper data comparison. > > Dave, > > instead of compatring data column by column, as follows > > c1.col1 <> c2.col1 OR (c1.col1 IS NULL AND c2.col1 IS NOT NULL) OR > (c1.col1 IS NOT NULL AND c2.col1 IS NULL) > > use > > SELECT COUNT(*) FROM( > SELECT col1, col2, ... FROM YesterdaysVersion > UNION > SELECT col1, col2, ... FROM TodaysVersion) t > > If it's 1 , versions are identical > On Wed, 30 Aug 2006 08:22:02 -0700, Dave wrote:
>The reason I was using checksum (binary_checksum) was for speed and Hi Dave,>simplicity (I Imagined). It was hoped to run the process considerably more >frequently than nightly. The risk of error is small but too big. I will >reconfigure my SProcs and perform proper data comparison. > >Thanks ML and Steve. > >Dave. In addition to Alexander's suggestion, here's yet another way to compare two tables: SELECT col1, col2, ..., colN, MIN(version) AS version FROM (SELECT col1, col2, ..., colN, 'Yesterday' AS version FROM YesterdaysVersion UNION ALL SELECT col1, col2, ..., colN, 'Today' AS version FROM YesterdaysVersion) GROUP BY col1, col2, ..., colN WHERE COUNT(*) = 1 -- Hugo Kornelis, SQL Server MVP Hugo
Perhaps I missunderstood the OP's request , but the below example shows that both tables have an idetrnical data. SELECT COUNT(*) FROM( SELECT c FROM #t UNION SELECT c FROM #t1) t create table #t (c int) insert into #t values (50) insert into #t values (100) ----------- create table #t1(c int) insert into #t1 values (50) insert into #t1 values (100) Show quote "Hugo Kornelis" <h***@perFact.REMOVETHIS.info.INVALID> wrote in message news:hg0cf2970le3o6vslg7gvnv0464k6fe3sr@4ax.com... > On Wed, 30 Aug 2006 08:22:02 -0700, Dave wrote: > >>The reason I was using checksum (binary_checksum) was for speed and >>simplicity (I Imagined). It was hoped to run the process considerably more >>frequently than nightly. The risk of error is small but too big. I will >>reconfigure my SProcs and perform proper data comparison. >> >>Thanks ML and Steve. >> >>Dave. > > Hi Dave, > > In addition to Alexander's suggestion, here's yet another way to compare > two tables: > > SELECT col1, col2, ..., colN, MIN(version) AS version > FROM (SELECT col1, col2, ..., colN, 'Yesterday' AS version > FROM YesterdaysVersion > UNION ALL > SELECT col1, col2, ..., colN, 'Today' AS version > FROM YesterdaysVersion) > GROUP BY col1, col2, ..., colN > WHERE COUNT(*) = 1 > > -- > Hugo Kornelis, SQL Server MVP On Thu, 31 Aug 2006 10:20:36 +0300, Uri Dimant wrote:
Show quote >Hugo Hi Uri,>Perhaps I missunderstood the OP's request , but the below example shows that >both tables have an idetrnical data. > >SELECT COUNT(*) FROM( >SELECT c FROM #t >UNION >SELECT c FROM #t1) t > >create table #t (c int) >insert into #t values (50) >insert into #t values (100) >----------- >create table #t1(c int) >insert into #t1 values (50) >insert into #t1 values (100) If I remove one of the rows from one of the tables, the result of the query doesn't change even though the tables are no longer identical. If I then remove the "other" row from the other table, the output is STILL the same. Your query can be used to find if the data is identical, but only in conjunction with two extra queries, for the COUNT(*) of each of the two tables individually. And once those have run, you know if the tables are identical or not, but you don't know what the differences are. My query shows the differences immediately, and is probably faster too (one table scan for each table vs two scans for your query). -- Hugo Kornelis, SQL Server MVP Building upon the previous remarks, and your concern for speed and
efficiency, you may be able to compare the indexed columns first, and then only if there is a match, compare the non-indexed columns. -- Show quoteArnie Rowland, Ph.D. Westwood Consulting, Inc Most good judgment comes from experience. Most experience comes from bad judgment. - Anonymous "Dave" <D***@discussions.microsoft.com> wrote in message news:F4FF896C-E231-4073-B90D-CE8EC7B1461A@microsoft.com... > Hi, > > I am working on a project which integrates diverse and often unreliable > data > sources. I need to check for changes in source data on a daily basis. I > import data into MSSQL 2000 and using checksum functionality I compare > yesterday's snapshot of the data with today's. I run a SProc which > allocates > a checksum value to each record before the comparison. I can therefore > determine which records have changed and further process them. > > But is checksum sophisticated/robust enough to rely on in a production > environment. Testing has not unearthed any problems so far but I was > wondering is my approach fundamentally sound or have I missed something? > > Thanks, > > Dave. > > |
|||||||||||||||||||||||