|
database
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Toughy Normalization ProblemI'll gladly give you a free source license of GDI+ Architect (if you want one). See mrgsoft dot com for more info about that. I receive a data feed into my SQL Server with the bizarre property that the keys (called tags) can be reused arbitrarily after three days. This is because the tags are being used worldwide, and the system is based on short barcodes which quickly get recirculated. To clarify, a tag might be used after three days, but not necessarily in three day intervals (I can't simply group by every three days using a number table for example, as I might accidently split keys that go together). Tags can be reused again after five days and then again after three days for example. My goal is to normalize this data by creating a new ID that groups tags used within three day periods. Here's some example data that illustrates what I'm up against. RecordedAt Tag 2/1/2006 3:00 PM ABC 2/1/2006 3:30 PM ABC 2/2/2006 4:00 PM FRED 2/2/2006 5:00 PM ABC 5/1/2006 3:00 PM ABC 6/1/2006 4:30 PM ABC 6/2/2006 4:30 PM ABC This desired output would look as follows: GroupID RecordedAt Tag 1 2/1/2006 3:00 PM ABC 1 2/1/2006 3:30 PM ABC 2 2/2/2006 4:00 PM FRED 1 2/2/2006 5:00 PM ABC 3 5/1/2006 3:00 PM ABC 4 6/1/2006 4:30 PM ABC 4 6/2/2006 4:30 PM ABC All tags used "within three days of each other" should go into the same group. If a tag is seen outside of three days of any other tag, it gets its own group. The only limitations I have is performance and the inability to change how data is inserted into the table. In other words, I can't rewrite the insertion process into this table. I can only respond after data gets there. I can alter the underlying table structure as needed, create new tables, etc. For example, adding a rowID would probably be useful in the solution. I worked out a cursor based SP that does this task by taking the rows one at a time, but you can imagine how sluggish it is. Eventually I want to code a trigger that handles the normalization. But meanwhile, I'm just looking for a good script I can execute to populate the table with the groupIDs for existing data sets. The script below will populate a table with the sample data from above. Myself and your luggage thank you. Justin Weinberg MCAD/MCSD .NET jweinberg@spammyoff_mrgsoft.com if exists (select * from sysobjects where id = object_id('ExampleLineItemsJAW') and sysstat & 0xf = 3) drop table ExampleLineItemsJAW GO CREATE TABLE ExampleLineItemsJAW ( GroupID int, RecordedAt datetime not null, Tag char(10) not null, ) GO Set NOCOUNT ON Insert into ExampleLineItemsJAW (RecordedAt, Tag) VALUES ('2/1/2006 3:00 PM', 'ABC') Insert into ExampleLineItemsJAW (RecordedAt, Tag) VALUES ('2/1/2006 3:30 PM', 'ABC') Insert into ExampleLineItemsJAW (RecordedAt, Tag) VALUES ('2/2/2006 4:00 PM', 'FRED') Insert into ExampleLineItemsJAW (RecordedAt, Tag) VALUES ('2/2/2006 5:00 PM', 'ABC') Insert into ExampleLineItemsJAW (RecordedAt, Tag) VALUES ('5/1/2006 3:00 PM', 'ABC') Insert into ExampleLineItemsJAW (RecordedAt, Tag) VALUES ('6/1/2006 4:30 PM', 'ABC') Insert into ExampleLineItemsJAW (RecordedAt, Tag) VALUES ('6/2/2006 4:30 PM', 'ABC') SET NOCOUNT OFF GO Hello, Justin
This script accomplishes the desired result: ALTER TABLE ExampleLineItemsJAW ADD ID int IDENTITY UNIQUE, PreviousID int NULL, PRIMARY KEY (RecordedAt, Tag) GO UPDATE ExampleLineItemsJAW SET PreviousID=( SELECT ID FROM ExampleLineItemsJAW C INNER JOIN ( SELECT MAX(RecordedAt) as PreviousRecording FROM ExampleLineItemsJAW B WHERE A.Tag=B.Tag AND B.RecordedAt<A.RecordedAt AND B.RecordedAt>A.RecordedAt-3 ) X ON C.Tag=A.Tag AND C.RecordedAt=X.PreviousRecording ) FROM ExampleLineItemsJAW A WHERE EXISTS ( SELECT * FROM ExampleLineItemsJAW D WHERE A.Tag=D.Tag AND D.RecordedAt<A.RecordedAt AND D.RecordedAt>A.RecordedAt-3 ) UPDATE ExampleLineItemsJAW SET GroupID=( SELECT COUNT(*) FROM ExampleLineItemsJAW B WHERE B.ID<=A.ID AND B.PreviousID IS NULL ) FROM ExampleLineItemsJAW A WHERE A.PreviousID IS NULL WHILE EXISTS ( SELECT * FROM ExampleLineItemsJAW WHERE GroupID IS NULL ) BEGIN UPDATE ExampleLineItemsJAW SET GroupID=( SELECT B.GroupID FROM ExampleLineItemsJAW B WHERE B.ID=A.PreviousID AND B.GroupID IS NOT NULL ) FROM ExampleLineItemsJAW A WHERE A.GroupID IS NULL END SELECT GroupID, RecordedAt, Tag FROM ExampleLineItemsJAW The algorithm is this: 1. Find the PreviousID for each row, if the previous row with the same tag is not more than 3 days back 2. For each row that has no PreviousID, assign a different GroupID 3. While there are rows with no GroupID-s, fill the GroupID from the previous row The number of executions of last UPDATE will be the maximum number of rows for any tag minus one (in this case, 2 executions because GroupID=1 contains 3 rows). We need to be sure that it's not allowed to have more than one row for the same tag, recorded at the same time, so I added a primary key to enforce this (it also improves the performance of the queries). If performance is an issue, you may want to add an index on PreviousID and/or GroupID (just before the WHILE loop). You should benchmark with your data to see which indexes are the most appropriate. Razvan I wouldn't consider this a normalization problem, and I don't think you can
avoid a scan. Whenever there is a dependency between rows in the same table, you must either use a correlated subquery based on a self-join, or scan through the rows with a cursor. This dependency is a little more complicated because it doesn't necessarily involve adjacent rows (based on the order in which they were recorded). The group each row belongs to is dependent upon the time that the first row in that group was recorded, meaning that you must use TOP, MAX or MIN along with '>=' or '<=' in order to determine which group each row belongs. I think that in this case a cursor will outperform any non-iterative set-based solution, and depending on the execution plan and whether or not a clustered index exists on (Tag, RecordedAt), the performance may be orders of magnitude better with a cursor. A correlated subquery would have to scan the index more than once for each row, whereas you can maintain the first RecordedAt for the current Tag in a local variable as you fetch each row with the cursor, thus accomplishing your goal with only one pass through the table. Show quote "Justin Weinberg" <jweinberg@_spammyoff_mrgsoft.com> wrote in message news:excJfYGNGHA.3100@tk2msftngp13.phx.gbl... > > Figure this out without using a cursor or loop that hits each record, and > I'll gladly give you a free source license of GDI+ Architect (if you want > one). See mrgsoft dot com for more info about that. > > I receive a data feed into my SQL Server with the bizarre property that > the keys (called tags) can be reused arbitrarily after three days. This > is because the tags are being used worldwide, and the system is based on > short barcodes which quickly get recirculated. > > To clarify, a tag might be used after three days, but not necessarily in > three day intervals (I can't simply group by every three days using a > number table for example, as I might accidently split keys that go > together). Tags can be reused again after five days and then again after > three days for example. > > My goal is to normalize this data by creating a new ID that groups tags > used within three day periods. Here's some example data that illustrates > what I'm up against. > > RecordedAt Tag > 2/1/2006 3:00 PM ABC > 2/1/2006 3:30 PM ABC > 2/2/2006 4:00 PM FRED > 2/2/2006 5:00 PM ABC > 5/1/2006 3:00 PM ABC > 6/1/2006 4:30 PM ABC > 6/2/2006 4:30 PM ABC > > This desired output would look as follows: > > GroupID RecordedAt Tag > 1 2/1/2006 3:00 PM ABC > 1 2/1/2006 3:30 PM ABC > 2 2/2/2006 4:00 PM FRED > 1 2/2/2006 5:00 PM ABC > 3 5/1/2006 3:00 PM ABC > 4 6/1/2006 4:30 PM ABC > 4 6/2/2006 4:30 PM ABC > > > All tags used "within three days of each other" should go into the same > group. If a tag is seen outside of three days of any other tag, it gets > its own group. > > The only limitations I have is performance and the inability to change how > data is inserted into the table. In other words, I can't rewrite the > insertion process into this table. I can only respond after data gets > there. > > I can alter the underlying table structure as needed, create new tables, > etc. For example, adding a rowID would probably be useful in the > solution. > > I worked out a cursor based SP that does this task by taking the rows one > at a time, but you can imagine how sluggish it is. Eventually I want to > code a trigger that handles the normalization. But meanwhile, I'm just > looking for a good script I can execute to populate the table with the > groupIDs for existing data sets. > > The script below will populate a table with the sample data from above. > > Myself and your luggage thank you. > > Justin Weinberg > MCAD/MCSD .NET > jweinberg@spammyoff_mrgsoft.com > > > if exists (select * from sysobjects where id = > object_id('ExampleLineItemsJAW') and sysstat & 0xf = 3) > drop table ExampleLineItemsJAW > GO > > CREATE TABLE ExampleLineItemsJAW > ( > GroupID int, > RecordedAt datetime not null, > Tag char(10) not null, > ) > GO > > Set NOCOUNT ON > > Insert into ExampleLineItemsJAW (RecordedAt, Tag) > VALUES ('2/1/2006 3:00 PM', 'ABC') > > Insert into ExampleLineItemsJAW (RecordedAt, Tag) > VALUES ('2/1/2006 3:30 PM', 'ABC') > > Insert into ExampleLineItemsJAW (RecordedAt, Tag) > VALUES ('2/2/2006 4:00 PM', 'FRED') > > Insert into ExampleLineItemsJAW (RecordedAt, Tag) > VALUES ('2/2/2006 5:00 PM', 'ABC') > > Insert into ExampleLineItemsJAW (RecordedAt, Tag) > VALUES ('5/1/2006 3:00 PM', 'ABC') > > Insert into ExampleLineItemsJAW (RecordedAt, Tag) > VALUES ('6/1/2006 4:30 PM', 'ABC') > > Insert into ExampleLineItemsJAW (RecordedAt, Tag) > VALUES ('6/2/2006 4:30 PM', 'ABC') > > SET NOCOUNT OFF > GO > > > > > Justin,
This one was interesting. A couple of points about the solution below. First, while it uses a loop, it will go much faster than a cursor based solution. I have used the general approach many times, it has always been fast. Second, I did not create any indexes on the test data, but with a large data set you will want some. Most important, I assumed that "3 days" meant 36 hours exactly. It is possible that it should have meant Monday+Tuesday+Wednesday, or Tuesday+Wednesday+Thursday, in other words whole days. That would require some adjustments to the date range tests, of course. Hope you like it! Roy CREATE TABLE dbo.Strange ( RecordedAt datetime NOT NULL , Tag varchar (12) NOT NULL , GroupStatus char(1) NOT NULL DEFAULT ' ', GroupStart datetime NULL ) GO --The test data you described INSERT Strange (RecordedAt, Tag) VALUES('2/1/2006 3:00 PM', 'ABC') INSERT Strange (RecordedAt, Tag) VALUES('2/1/2006 3:30 PM', 'ABC') INSERT Strange (RecordedAt, Tag) VALUES('2/2/2006 4:00 PM', 'FRED') INSERT Strange (RecordedAt, Tag) VALUES('2/2/2006 5:00 PM', 'ABC') INSERT Strange (RecordedAt, Tag) VALUES('5/1/2006 3:00 PM', 'ABC') INSERT Strange (RecordedAt, Tag) VALUES('6/1/2006 4:30 PM', 'ABC') INSERT Strange (RecordedAt, Tag) VALUES('6/2/2006 4:30 PM', 'ABC') --We need more, and more difficult, test data. INSERT Strange (RecordedAt, Tag) SELECT DATEADD(day,2,RecordedAt), Tag FROM Strange UNION ALL SELECT DATEADD(day,4,RecordedAt), Tag FROM Strange UNION ALL SELECT DATEADD(day,6,RecordedAt), Tag FROM Strange UNION ALL SELECT DATEADD(day,8,RecordedAt), Tag FROM Strange --What does the data look like? select * from Strange order by Tag, RecordedAt --First pass, identify SOME of the "first-in-a-group" rows. --These are the ones with nothing in the prior three days. UPDATE Strange SET GroupStatus = 'S', --S for Start GroupStart = RecordedAt WHERE GroupStatus = ' ' --All are this way for the first pass AND NOT EXISTS (select * from Strange as X where Strange.Tag = X.Tag and X.GroupStatus = ' ' and X.RecordedAt >= dateadd(day,-3,Strange.RecordedAt) and X.RecordedAt < Strange.RecordedAt) WHILE @@rowcount <> 0 BEGIN --Whatever new start points have been found, the matching --rows for the subsequent three days need to be marked as --matching. UPDATE Strange SET GroupStatus = 'F', --F for Following GroupStart = M.GroupStart FROM Strange, Strange as M WHERE Strange.Tag = M.Tag AND M.GroupStatus = 'S' AND Strange.GroupStatus = ' ' AND Strange.RecordedAt >= M.RecordedAt AND Strange.RecordedAt <= DATEADD(day,3,M.RecordedAt) --Now we mark new starting points, but we ignore rows --that have already been assigned to batches. UPDATE Strange SET GroupStatus = 'S', --S for Start GroupStart = RecordedAt WHERE GroupStatus = ' ' AND NOT EXISTS (select * from Strange as X where Strange.Tag = X.Tag and X.GroupStatus = ' ' and X.RecordedAt >= dateadd(day,-3,Strange.RecordedAt) and X.RecordedAt < Strange.RecordedAt) --If we find no new batches, we are done. Or at least, should be. END SELECT Tag, GroupStart, MAX(RecordedAt) as LastInGroup, count(*) as Rows, count(*) - count(GroupStart) as Unassigned FROM Strange GROUP BY Tag, GroupStart ORDER BY Tag, GroupStart On Sat, 18 Feb 2006 02:29:56 -0600, "Justin Weinberg" <jweinberg@_spammyoff_mrgsoft.com> wrote: Show quote > >Figure this out without using a cursor or loop that hits each record, and >I'll gladly give you a free source license of GDI+ Architect (if you want >one). See mrgsoft dot com for more info about that. > >I receive a data feed into my SQL Server with the bizarre property that the >keys (called tags) can be reused arbitrarily after three days. This is >because the tags are being used worldwide, and the system is based on short >barcodes which quickly get recirculated. > >To clarify, a tag might be used after three days, but not necessarily in >three day intervals (I can't simply group by every three days using a number >table for example, as I might accidently split keys that go together). Tags >can be reused again after five days and then again after three days for >example. > >My goal is to normalize this data by creating a new ID that groups tags used >within three day periods. Here's some example data that illustrates what >I'm up against. > >RecordedAt Tag >2/1/2006 3:00 PM ABC >2/1/2006 3:30 PM ABC >2/2/2006 4:00 PM FRED >2/2/2006 5:00 PM ABC >5/1/2006 3:00 PM ABC >6/1/2006 4:30 PM ABC >6/2/2006 4:30 PM ABC > >This desired output would look as follows: > > GroupID RecordedAt Tag > 1 2/1/2006 3:00 PM ABC > 1 2/1/2006 3:30 PM ABC > 2 2/2/2006 4:00 PM FRED > 1 2/2/2006 5:00 PM ABC > 3 5/1/2006 3:00 PM ABC > 4 6/1/2006 4:30 PM ABC > 4 6/2/2006 4:30 PM ABC > > >All tags used "within three days of each other" should go into the same >group. If a tag is seen outside of three days of any other tag, it gets its >own group. > >The only limitations I have is performance and the inability to change how >data is inserted into the table. In other words, I can't rewrite the >insertion process into this table. I can only respond after data gets >there. > >I can alter the underlying table structure as needed, create new tables, >etc. For example, adding a rowID would probably be useful in the solution. > >I worked out a cursor based SP that does this task by taking the rows one at >a time, but you can imagine how sluggish it is. Eventually I want to code a >trigger that handles the normalization. But meanwhile, I'm just looking for >a good script I can execute to populate the table with the groupIDs for >existing data sets. > >The script below will populate a table with the sample data from above. > >Myself and your luggage thank you. > >Justin Weinberg >MCAD/MCSD .NET >jweinberg@spammyoff_mrgsoft.com > > >if exists (select * from sysobjects where id = >object_id('ExampleLineItemsJAW') and sysstat & 0xf = 3) > drop table ExampleLineItemsJAW >GO > >CREATE TABLE ExampleLineItemsJAW >( > GroupID int, > RecordedAt datetime not null, > Tag char(10) not null, >) >GO > >Set NOCOUNT ON > >Insert into ExampleLineItemsJAW (RecordedAt, Tag) >VALUES ('2/1/2006 3:00 PM', 'ABC') > >Insert into ExampleLineItemsJAW (RecordedAt, Tag) >VALUES ('2/1/2006 3:30 PM', 'ABC') > >Insert into ExampleLineItemsJAW (RecordedAt, Tag) >VALUES ('2/2/2006 4:00 PM', 'FRED') > >Insert into ExampleLineItemsJAW (RecordedAt, Tag) >VALUES ('2/2/2006 5:00 PM', 'ABC') > >Insert into ExampleLineItemsJAW (RecordedAt, Tag) >VALUES ('5/1/2006 3:00 PM', 'ABC') > >Insert into ExampleLineItemsJAW (RecordedAt, Tag) >VALUES ('6/1/2006 4:30 PM', 'ABC') > >Insert into ExampleLineItemsJAW (RecordedAt, Tag) >VALUES ('6/2/2006 4:30 PM', 'ABC') > >SET NOCOUNT OFF >GO > > > > I'm always impressed by not only the generosity, but the intelligence of the
people who respond to questions on the sql server programming group. Razvan and Roy, I can't think either of you enough for taking the time to think this one through. Both of you provided thoughtful compelling solutions. This really effects the bottom line on my deadlines and the system performance as a whole. Please, either of you contact me if you want to take me up on that GDI+ Architect source code license. Just as an aside to Brian, the reason I classified this as a normalization problem is because I believe that the data violates fourth normal form as it has multivalue dependencies. Again, thanks to all of the posters. Justin Weinberg MCAD/MCSD .NET jweinberg@spammyoff_mrgsoft.com Show quote > Justin, > > This one was interesting. > > A couple of points about the solution below. First, while it uses a > loop, it will go much faster than a cursor based solution. I have > used the general approach many times, it has always been fast. > > Second, I did not create any indexes on the test data, but with a > large data set you will want some. > > Most important, I assumed that "3 days" meant 36 hours exactly. It is > possible that it should have meant Monday+Tuesday+Wednesday, or > Tuesday+Wednesday+Thursday, in other words whole days. That would > require some adjustments to the date range tests, of course. > > Hope you like it! > > Roy > > CREATE TABLE dbo.Strange ( > RecordedAt datetime NOT NULL , > Tag varchar (12) NOT NULL , > GroupStatus char(1) NOT NULL DEFAULT ' ', > GroupStart datetime NULL > ) > GO > > --The test data you described > INSERT Strange (RecordedAt, Tag) VALUES('2/1/2006 3:00 PM', 'ABC') > INSERT Strange (RecordedAt, Tag) VALUES('2/1/2006 3:30 PM', 'ABC') > INSERT Strange (RecordedAt, Tag) VALUES('2/2/2006 4:00 PM', 'FRED') > INSERT Strange (RecordedAt, Tag) VALUES('2/2/2006 5:00 PM', 'ABC') > INSERT Strange (RecordedAt, Tag) VALUES('5/1/2006 3:00 PM', 'ABC') > INSERT Strange (RecordedAt, Tag) VALUES('6/1/2006 4:30 PM', 'ABC') > INSERT Strange (RecordedAt, Tag) VALUES('6/2/2006 4:30 PM', 'ABC') > > --We need more, and more difficult, test data. > INSERT Strange (RecordedAt, Tag) > SELECT DATEADD(day,2,RecordedAt), Tag FROM Strange > UNION ALL > SELECT DATEADD(day,4,RecordedAt), Tag FROM Strange > UNION ALL > SELECT DATEADD(day,6,RecordedAt), Tag FROM Strange > UNION ALL > SELECT DATEADD(day,8,RecordedAt), Tag FROM Strange > > --What does the data look like? > select * > from Strange > order by Tag, RecordedAt > > --First pass, identify SOME of the "first-in-a-group" rows. > --These are the ones with nothing in the prior three days. > UPDATE Strange > SET GroupStatus = 'S', --S for Start > GroupStart = RecordedAt > WHERE GroupStatus = ' ' --All are this way for the first pass > AND NOT EXISTS > (select * from Strange as X > where Strange.Tag = X.Tag > and X.GroupStatus = ' ' > and X.RecordedAt >= dateadd(day,-3,Strange.RecordedAt) > and X.RecordedAt < Strange.RecordedAt) > > WHILE @@rowcount <> 0 > BEGIN > --Whatever new start points have been found, the matching > --rows for the subsequent three days need to be marked as > --matching. > UPDATE Strange > SET GroupStatus = 'F', --F for Following > GroupStart = M.GroupStart > FROM Strange, Strange as M > WHERE Strange.Tag = M.Tag > AND M.GroupStatus = 'S' > AND Strange.GroupStatus = ' ' > AND Strange.RecordedAt >= M.RecordedAt > AND Strange.RecordedAt <= DATEADD(day,3,M.RecordedAt) > > --Now we mark new starting points, but we ignore rows > --that have already been assigned to batches. > UPDATE Strange > SET GroupStatus = 'S', --S for Start > GroupStart = RecordedAt > WHERE GroupStatus = ' ' > AND NOT EXISTS > (select * from Strange as X > where Strange.Tag = X.Tag > and X.GroupStatus = ' ' > and X.RecordedAt >= dateadd(day,-3,Strange.RecordedAt) > and X.RecordedAt < Strange.RecordedAt) > --If we find no new batches, we are done. Or at least, should be. > END > > SELECT Tag, > GroupStart, > MAX(RecordedAt) as LastInGroup, > count(*) as Rows, > count(*) - count(GroupStart) as Unassigned > FROM Strange > GROUP BY Tag, > GroupStart > ORDER BY Tag, > GroupStart > > > On Sat, 18 Feb 2006 02:29:56 -0600, "Justin Weinberg" > <jweinberg@_spammyoff_mrgsoft.com> wrote: > >> >>Figure this out without using a cursor or loop that hits each record, and >>I'll gladly give you a free source license of GDI+ Architect (if you want >>one). See mrgsoft dot com for more info about that. >> >>I receive a data feed into my SQL Server with the bizarre property that >>the >>keys (called tags) can be reused arbitrarily after three days. This is >>because the tags are being used worldwide, and the system is based on >>short >>barcodes which quickly get recirculated. >> >>To clarify, a tag might be used after three days, but not necessarily in >>three day intervals (I can't simply group by every three days using a >>number >>table for example, as I might accidently split keys that go together). >>Tags >>can be reused again after five days and then again after three days for >>example. >> >>My goal is to normalize this data by creating a new ID that groups tags >>used >>within three day periods. Here's some example data that illustrates what >>I'm up against. >> >>RecordedAt Tag >>2/1/2006 3:00 PM ABC >>2/1/2006 3:30 PM ABC >>2/2/2006 4:00 PM FRED >>2/2/2006 5:00 PM ABC >>5/1/2006 3:00 PM ABC >>6/1/2006 4:30 PM ABC >>6/2/2006 4:30 PM ABC >> >>This desired output would look as follows: >> >> GroupID RecordedAt Tag >> 1 2/1/2006 3:00 PM ABC >> 1 2/1/2006 3:30 PM ABC >> 2 2/2/2006 4:00 PM FRED >> 1 2/2/2006 5:00 PM ABC >> 3 5/1/2006 3:00 PM ABC >> 4 6/1/2006 4:30 PM ABC >> 4 6/2/2006 4:30 PM ABC >> >> >>All tags used "within three days of each other" should go into the same >>group. If a tag is seen outside of three days of any other tag, it gets >>its >>own group. >> >>The only limitations I have is performance and the inability to change how >>data is inserted into the table. In other words, I can't rewrite the >>insertion process into this table. I can only respond after data gets >>there. >> >>I can alter the underlying table structure as needed, create new tables, >>etc. For example, adding a rowID would probably be useful in the >>solution. >> >>I worked out a cursor based SP that does this task by taking the rows one >>at >>a time, but you can imagine how sluggish it is. Eventually I want to code >>a >>trigger that handles the normalization. But meanwhile, I'm just looking >>for >>a good script I can execute to populate the table with the groupIDs for >>existing data sets. >> >>The script below will populate a table with the sample data from above. >> >>Myself and your luggage thank you. >> >>Justin Weinberg >>MCAD/MCSD .NET >>jweinberg@spammyoff_mrgsoft.com >> >> >>if exists (select * from sysobjects where id = >>object_id('ExampleLineItemsJAW') and sysstat & 0xf = 3) >> drop table ExampleLineItemsJAW >>GO >> >>CREATE TABLE ExampleLineItemsJAW >>( >> GroupID int, >> RecordedAt datetime not null, >> Tag char(10) not null, >>) >>GO >> >>Set NOCOUNT ON >> >>Insert into ExampleLineItemsJAW (RecordedAt, Tag) >>VALUES ('2/1/2006 3:00 PM', 'ABC') >> >>Insert into ExampleLineItemsJAW (RecordedAt, Tag) >>VALUES ('2/1/2006 3:30 PM', 'ABC') >> >>Insert into ExampleLineItemsJAW (RecordedAt, Tag) >>VALUES ('2/2/2006 4:00 PM', 'FRED') >> >>Insert into ExampleLineItemsJAW (RecordedAt, Tag) >>VALUES ('2/2/2006 5:00 PM', 'ABC') >> >>Insert into ExampleLineItemsJAW (RecordedAt, Tag) >>VALUES ('5/1/2006 3:00 PM', 'ABC') >> >>Insert into ExampleLineItemsJAW (RecordedAt, Tag) >>VALUES ('6/1/2006 4:30 PM', 'ABC') >> >>Insert into ExampleLineItemsJAW (RecordedAt, Tag) >>VALUES ('6/2/2006 4:30 PM', 'ABC') >> >>SET NOCOUNT OFF >>GO >> >> >> >> My understanding of a multi-valued dependency is much different from what
you've posted as an example. The main difference between a functional dependency and a multi-valued dependency is that the dependent for a functional dependency is a single value whereas the dependent for a multi-valued dependency is a well-defined set of values. The determinant and the determinant alone identifies the set of values. In your example, the values in other tuples play a part in determining the set of acceptable values. In addition, I believe that there must be at least three attributes in order for a multi-valued dependency to exist in a relation. As I see it, the dependency in your example is more in the nature of a temporal constraint. Essentially, the values in one tuple determine what is acceptable in other tuples, and the one of the attributes is a timestamp. By the way, I still think you'll get better performance with a cursor. Show quote "Justin Weinberg" <jweinberg@_spammyoff_mrgsoft.com> wrote in message news:%237BnaVNNGHA.984@tk2msftngp13.phx.gbl... > I'm always impressed by not only the generosity, but the intelligence of > the people who respond to questions on the sql server programming group. > > Razvan and Roy, I can't think either of you enough for taking the time to > think this one through. Both of you provided thoughtful compelling > solutions. > > This really effects the bottom line on my deadlines and the system > performance as a whole. Please, either of you contact me if you want to > take me up on that GDI+ Architect source code license. > > Just as an aside to Brian, the reason I classified this as a normalization > problem is because I believe that the data violates fourth normal form as > it has multivalue dependencies. > > Again, thanks to all of the posters. > > Justin Weinberg > MCAD/MCSD .NET > jweinberg@spammyoff_mrgsoft.com > > >> Justin, >> >> This one was interesting. >> >> A couple of points about the solution below. First, while it uses a >> loop, it will go much faster than a cursor based solution. I have >> used the general approach many times, it has always been fast. >> >> Second, I did not create any indexes on the test data, but with a >> large data set you will want some. >> >> Most important, I assumed that "3 days" meant 36 hours exactly. It is >> possible that it should have meant Monday+Tuesday+Wednesday, or >> Tuesday+Wednesday+Thursday, in other words whole days. That would >> require some adjustments to the date range tests, of course. >> >> Hope you like it! >> >> Roy >> >> CREATE TABLE dbo.Strange ( >> RecordedAt datetime NOT NULL , >> Tag varchar (12) NOT NULL , >> GroupStatus char(1) NOT NULL DEFAULT ' ', >> GroupStart datetime NULL >> ) >> GO >> >> --The test data you described >> INSERT Strange (RecordedAt, Tag) VALUES('2/1/2006 3:00 PM', 'ABC') >> INSERT Strange (RecordedAt, Tag) VALUES('2/1/2006 3:30 PM', 'ABC') >> INSERT Strange (RecordedAt, Tag) VALUES('2/2/2006 4:00 PM', 'FRED') >> INSERT Strange (RecordedAt, Tag) VALUES('2/2/2006 5:00 PM', 'ABC') >> INSERT Strange (RecordedAt, Tag) VALUES('5/1/2006 3:00 PM', 'ABC') >> INSERT Strange (RecordedAt, Tag) VALUES('6/1/2006 4:30 PM', 'ABC') >> INSERT Strange (RecordedAt, Tag) VALUES('6/2/2006 4:30 PM', 'ABC') >> >> --We need more, and more difficult, test data. >> INSERT Strange (RecordedAt, Tag) >> SELECT DATEADD(day,2,RecordedAt), Tag FROM Strange >> UNION ALL >> SELECT DATEADD(day,4,RecordedAt), Tag FROM Strange >> UNION ALL >> SELECT DATEADD(day,6,RecordedAt), Tag FROM Strange >> UNION ALL >> SELECT DATEADD(day,8,RecordedAt), Tag FROM Strange >> >> --What does the data look like? >> select * >> from Strange >> order by Tag, RecordedAt >> >> --First pass, identify SOME of the "first-in-a-group" rows. >> --These are the ones with nothing in the prior three days. >> UPDATE Strange >> SET GroupStatus = 'S', --S for Start >> GroupStart = RecordedAt >> WHERE GroupStatus = ' ' --All are this way for the first pass >> AND NOT EXISTS >> (select * from Strange as X >> where Strange.Tag = X.Tag >> and X.GroupStatus = ' ' >> and X.RecordedAt >= dateadd(day,-3,Strange.RecordedAt) >> and X.RecordedAt < Strange.RecordedAt) >> >> WHILE @@rowcount <> 0 >> BEGIN >> --Whatever new start points have been found, the matching >> --rows for the subsequent three days need to be marked as >> --matching. >> UPDATE Strange >> SET GroupStatus = 'F', --F for Following >> GroupStart = M.GroupStart >> FROM Strange, Strange as M >> WHERE Strange.Tag = M.Tag >> AND M.GroupStatus = 'S' >> AND Strange.GroupStatus = ' ' >> AND Strange.RecordedAt >= M.RecordedAt >> AND Strange.RecordedAt <= DATEADD(day,3,M.RecordedAt) >> >> --Now we mark new starting points, but we ignore rows >> --that have already been assigned to batches. >> UPDATE Strange >> SET GroupStatus = 'S', --S for Start >> GroupStart = RecordedAt >> WHERE GroupStatus = ' ' >> AND NOT EXISTS >> (select * from Strange as X >> where Strange.Tag = X.Tag >> and X.GroupStatus = ' ' >> and X.RecordedAt >= dateadd(day,-3,Strange.RecordedAt) >> and X.RecordedAt < Strange.RecordedAt) >> --If we find no new batches, we are done. Or at least, should be. >> END >> >> SELECT Tag, >> GroupStart, >> MAX(RecordedAt) as LastInGroup, >> count(*) as Rows, >> count(*) - count(GroupStart) as Unassigned >> FROM Strange >> GROUP BY Tag, >> GroupStart >> ORDER BY Tag, >> GroupStart >> >> >> On Sat, 18 Feb 2006 02:29:56 -0600, "Justin Weinberg" >> <jweinberg@_spammyoff_mrgsoft.com> wrote: >> >>> >>>Figure this out without using a cursor or loop that hits each record, and >>>I'll gladly give you a free source license of GDI+ Architect (if you want >>>one). See mrgsoft dot com for more info about that. >>> >>>I receive a data feed into my SQL Server with the bizarre property that >>>the >>>keys (called tags) can be reused arbitrarily after three days. This is >>>because the tags are being used worldwide, and the system is based on >>>short >>>barcodes which quickly get recirculated. >>> >>>To clarify, a tag might be used after three days, but not necessarily in >>>three day intervals (I can't simply group by every three days using a >>>number >>>table for example, as I might accidently split keys that go together). >>>Tags >>>can be reused again after five days and then again after three days for >>>example. >>> >>>My goal is to normalize this data by creating a new ID that groups tags >>>used >>>within three day periods. Here's some example data that illustrates what >>>I'm up against. >>> >>>RecordedAt Tag >>>2/1/2006 3:00 PM ABC >>>2/1/2006 3:30 PM ABC >>>2/2/2006 4:00 PM FRED >>>2/2/2006 5:00 PM ABC >>>5/1/2006 3:00 PM ABC >>>6/1/2006 4:30 PM ABC >>>6/2/2006 4:30 PM ABC >>> >>>This desired output would look as follows: >>> >>> GroupID RecordedAt Tag >>> 1 2/1/2006 3:00 PM ABC >>> 1 2/1/2006 3:30 PM ABC >>> 2 2/2/2006 4:00 PM FRED >>> 1 2/2/2006 5:00 PM ABC >>> 3 5/1/2006 3:00 PM ABC >>> 4 6/1/2006 4:30 PM ABC >>> 4 6/2/2006 4:30 PM ABC >>> >>> >>>All tags used "within three days of each other" should go into the same >>>group. If a tag is seen outside of three days of any other tag, it gets >>>its >>>own group. >>> >>>The only limitations I have is performance and the inability to change >>>how >>>data is inserted into the table. In other words, I can't rewrite the >>>insertion process into this table. I can only respond after data gets >>>there. >>> >>>I can alter the underlying table structure as needed, create new tables, >>>etc. For example, adding a rowID would probably be useful in the >>>solution. >>> >>>I worked out a cursor based SP that does this task by taking the rows one >>>at >>>a time, but you can imagine how sluggish it is. Eventually I want to >>>code a >>>trigger that handles the normalization. But meanwhile, I'm just looking >>>for >>>a good script I can execute to populate the table with the groupIDs for >>>existing data sets. >>> >>>The script below will populate a table with the sample data from above. >>> >>>Myself and your luggage thank you. >>> >>>Justin Weinberg >>>MCAD/MCSD .NET >>>jweinberg@spammyoff_mrgsoft.com >>> >>> >>>if exists (select * from sysobjects where id = >>>object_id('ExampleLineItemsJAW') and sysstat & 0xf = 3) >>> drop table ExampleLineItemsJAW >>>GO >>> >>>CREATE TABLE ExampleLineItemsJAW >>>( >>> GroupID int, >>> RecordedAt datetime not null, >>> Tag char(10) not null, >>>) >>>GO >>> >>>Set NOCOUNT ON >>> >>>Insert into ExampleLineItemsJAW (RecordedAt, Tag) >>>VALUES ('2/1/2006 3:00 PM', 'ABC') >>> >>>Insert into ExampleLineItemsJAW (RecordedAt, Tag) >>>VALUES ('2/1/2006 3:30 PM', 'ABC') >>> >>>Insert into ExampleLineItemsJAW (RecordedAt, Tag) >>>VALUES ('2/2/2006 4:00 PM', 'FRED') >>> >>>Insert into ExampleLineItemsJAW (RecordedAt, Tag) >>>VALUES ('2/2/2006 5:00 PM', 'ABC') >>> >>>Insert into ExampleLineItemsJAW (RecordedAt, Tag) >>>VALUES ('5/1/2006 3:00 PM', 'ABC') >>> >>>Insert into ExampleLineItemsJAW (RecordedAt, Tag) >>>VALUES ('6/1/2006 4:30 PM', 'ABC') >>> >>>Insert into ExampleLineItemsJAW (RecordedAt, Tag) >>>VALUES ('6/2/2006 4:30 PM', 'ABC') >>> >>>SET NOCOUNT OFF >>>GO >>> >>> >>> >>> > > Do:
UPDATE tbl SET GroupID = ( SELECT COUNT(DISTINCT New) FROM ( SELECT t1.RecordedAt, t1.Tag, MIN( t2.RecordedAt ) FROM tbl t1 INNER JOIN tbl t2 ON t2.Tag = t1.Tag WHERE DATEDIFF( d, t2.RecordedAt, t1.RecordedAt ) <= 3 GROUP BY t1.RecordedAt, t1.Tag ) v1 ( RecordedAt, Tag, New ) WHERE v1.New <= ( SELECT MIN( t3.RecordedAt ) FROM tbl t3 WHERE t3.Tag = tbl.Tag AND DATEDIFF( d, t3.RecordedAt, tbl.RecordedAt ) <= 3 ) ) ; The easier way to understand this query is to write them as simple SELECT statements. Perhaps a view will make it easier to understand: CREATE VIEW vw ( record_time, tag, group_time ) AS SELECT t1.RecordedAt, t1.Tag, MIN( t2.RecordedAt ) FROM tbl t1 INNER JOIN tbl t2 ON t2.Tag = t1.Tag WHERE DATEDIFF( d, t2.RecordedAt, t1.RecordedAt ) <= 3 GROUP BY t1.RecordedAt, t1.Tag ; Now the resultset from the following query based on this view will have the values that will make up the GroupId column: SELECT RecordedAt, Tag, ( SELECT COUNT( DISTINCT New ) FROM vw v1 WHERE v1.New <= vw.New ) FROM vw ; In 2005, you can make the code a bit more simpler with a WITH CTE clause. -- Anith This is very cleaver stuff.The simple rank even fell into place for you.
I wonder if you saw this prior to getting the starting dates or you just got lucky and it fell into your lap:) Just kidding its very,very good. You were a little long winded in trying to explain it though:) I even took a crack at it with the idea that a rank could be constructed based on the difference between consecutive dates based on sort of tag,date and a running sum of the differences.So far my toy is just off:) Justin,
I'm not sure what your requirements are, but maybe you could use 2 tables: CREATE TABLE Tag ( AvailableAt datetime not null, Tag char(10) not null, ) CREATE TABLE TagUsage ( UsedAt datetime not null, Tag char(10) not null, ) Every time a tag is used, you insert a row into TagUsed and up AvailableAt in Tag. Another version of the same idea:
CREATE TABLE AvailableTags (available_date DATETIME NOT NULL, tag_nbr CHAR(10) NOT NULL, PRIMARY KEY (neartest_available_date, tag_nbr)); Look at the CURRENT_TIMESTAMP, and return rows that are now available. Grab the one that has been available the longest. Then UPDATE the available_date with the three-day formula. This gets us down to one table and a quick way to ask about the re-cycling. First iteration, don't put the GroupId in each record at all.
Instead, create a second table like: create table mygroups ( groupid int, startdate datetime, enddate datetime ) Populate table. When you want to know a data row's group, just search this small table by date. (actually, that's a little tricky, since you're effectively trying to join a specific date to a table that has only a range. I did this for a guy the other day, and he had some freaky SQL that joined on a between, or something like that, and it worked - slowly! I had to munge some freaky SQL to get the performance, but I'm leaving this as an exercise for the reader, it's Saturday, for gosh sakes! Maybe Celko already has it in a book, I dunno.) Second iteration, if you really want to avoid the joins at runtime, denormalize your database by putting the groupid field into your data table after all. A reasonably simple update should match each row to the mygroups table and do what you need. You could also do that trick in a trigger at data row create time. J. On Sat, 18 Feb 2006 02:29:56 -0600, "Justin Weinberg" <jweinberg@_spammyoff_mrgsoft.com> wrote: Show quote >Figure this out without using a cursor or loop that hits each record, and >I'll gladly give you a free source license of GDI+ Architect (if you want >one). See mrgsoft dot com for more info about that. > >I receive a data feed into my SQL Server with the bizarre property that the >keys (called tags) can be reused arbitrarily after three days. This is >because the tags are being used worldwide, and the system is based on short >barcodes which quickly get recirculated. > >To clarify, a tag might be used after three days, but not necessarily in >three day intervals (I can't simply group by every three days using a number >table for example, as I might accidently split keys that go together). Tags >can be reused again after five days and then again after three days for >example. > >My goal is to normalize this data by creating a new ID that groups tags used >within three day periods. Here's some example data that illustrates what >I'm up against. > >RecordedAt Tag >2/1/2006 3:00 PM ABC >2/1/2006 3:30 PM ABC >2/2/2006 4:00 PM FRED >2/2/2006 5:00 PM ABC >5/1/2006 3:00 PM ABC >6/1/2006 4:30 PM ABC >6/2/2006 4:30 PM ABC > >This desired output would look as follows: > > GroupID RecordedAt Tag > 1 2/1/2006 3:00 PM ABC > 1 2/1/2006 3:30 PM ABC > 2 2/2/2006 4:00 PM FRED > 1 2/2/2006 5:00 PM ABC > 3 5/1/2006 3:00 PM ABC > 4 6/1/2006 4:30 PM ABC > 4 6/2/2006 4:30 PM ABC > > >All tags used "within three days of each other" should go into the same >group. If a tag is seen outside of three days of any other tag, it gets its >own group. > >The only limitations I have is performance and the inability to change how >data is inserted into the table. In other words, I can't rewrite the >insertion process into this table. I can only respond after data gets >there. > >I can alter the underlying table structure as needed, create new tables, >etc. For example, adding a rowID would probably be useful in the solution. > >I worked out a cursor based SP that does this task by taking the rows one at >a time, but you can imagine how sluggish it is. Eventually I want to code a >trigger that handles the normalization. But meanwhile, I'm just looking for >a good script I can execute to populate the table with the groupIDs for >existing data sets. > >The script below will populate a table with the sample data from above. > >Myself and your luggage thank you. > >Justin Weinberg >MCAD/MCSD .NET >jweinberg@spammyoff_mrgsoft.com > > >if exists (select * from sysobjects where id = >object_id('ExampleLineItemsJAW') and sysstat & 0xf = 3) > drop table ExampleLineItemsJAW >GO > >CREATE TABLE ExampleLineItemsJAW >( > GroupID int, > RecordedAt datetime not null, > Tag char(10) not null, >) >GO > >Set NOCOUNT ON > >Insert into ExampleLineItemsJAW (RecordedAt, Tag) >VALUES ('2/1/2006 3:00 PM', 'ABC') > >Insert into ExampleLineItemsJAW (RecordedAt, Tag) >VALUES ('2/1/2006 3:30 PM', 'ABC') > >Insert into ExampleLineItemsJAW (RecordedAt, Tag) >VALUES ('2/2/2006 4:00 PM', 'FRED') > >Insert into ExampleLineItemsJAW (RecordedAt, Tag) >VALUES ('2/2/2006 5:00 PM', 'ABC') > >Insert into ExampleLineItemsJAW (RecordedAt, Tag) >VALUES ('5/1/2006 3:00 PM', 'ABC') > >Insert into ExampleLineItemsJAW (RecordedAt, Tag) >VALUES ('6/1/2006 4:30 PM', 'ABC') > >Insert into ExampleLineItemsJAW (RecordedAt, Tag) >VALUES ('6/2/2006 4:30 PM', 'ABC') > >SET NOCOUNT OFF >GO > > > > Hi There,
I tried to comeout with this solution. See if it can help you. Select Tag ,identity(int,1,1) MyID into dummyTable From ExampleLineItemsJAW Group By Tag Update XY1 Set GroupID = datediff(dd,BaseDate,RecordedAt)/3 +MyID --Select datediff(dd,BaseDate,RecordedAt)/3 +MyID,* >From ExampleLineItemsJAW XY1 Inner Join (Select Tag,Min(convert(varchar,RecordedAt,112)) BaseDate FromExampleLineItemsJAW Group By Tag) XY On XY.Tag = XY1.Tag Inner Join DummyTable DM on DM.Tag = XY1.Tag Select * From ExampleLineItemsJAW With Warm regards Jatinder Singh |
|||||||||||||||||||||||