Home All Groups Group Topic Archive Search About

Toughy Normalization Problem

Author
18 Feb 2006 8:29 AM
Justin Weinberg
Figure this out without using a cursor or loop that hits each record, and
I'll gladly give you a free source license of GDI+ Architect (if you want
one). See mrgsoft dot com for more info about that.

I receive a data feed into my SQL Server with the bizarre property that the
keys (called tags) can be reused arbitrarily after three days.  This is
because the tags are being used worldwide, and the system is based on short
barcodes which quickly get recirculated.

To clarify, a tag might be used after three days, but not necessarily in
three day intervals (I can't simply group by every three days using a number
table for example, as I might accidently split keys that go together).  Tags
can be reused again after five days and then again after three days for
example.

My goal is to normalize this data by creating a new ID that groups tags used
within three day periods.  Here's some example data that illustrates what
I'm up against.

RecordedAt                Tag
2/1/2006 3:00 PM      ABC
2/1/2006 3:30 PM      ABC
2/2/2006 4:00 PM      FRED
2/2/2006 5:00 PM      ABC
5/1/2006 3:00 PM      ABC
6/1/2006 4:30 PM      ABC
6/2/2006 4:30 PM      ABC

This desired output would look as follows:

  GroupID        RecordedAt             Tag
      1            2/1/2006 3:00 PM      ABC
      1            2/1/2006 3:30 PM      ABC
      2            2/2/2006 4:00 PM      FRED
      1            2/2/2006 5:00 PM      ABC
      3            5/1/2006 3:00 PM      ABC
      4            6/1/2006 4:30 PM      ABC
      4            6/2/2006 4:30 PM      ABC


All tags used "within three days of each other" should go into the same
group. If a tag is seen outside of three days of any other tag, it gets its
own group.

The only limitations I have is performance and the inability to change how
data is inserted into the table.  In other words, I can't rewrite the
insertion process into this table.  I can only respond after data gets
there.

I can alter the underlying table structure as needed, create new tables,
etc.   For example, adding a rowID would probably be useful in the solution.

I worked out a cursor based SP that does this task by taking the rows one at
a time, but you can imagine how sluggish it is.  Eventually I want to code a
trigger that handles the normalization.  But meanwhile, I'm just looking for
a good script I can execute to populate the table with the groupIDs for
existing data sets.

The script below will populate a table with the sample data from above.

Myself and your luggage thank you.

Justin Weinberg
MCAD/MCSD .NET
jweinberg@spammyoff_mrgsoft.com


if exists (select * from sysobjects where id =
object_id('ExampleLineItemsJAW') and sysstat & 0xf = 3)
drop table ExampleLineItemsJAW
GO

CREATE TABLE ExampleLineItemsJAW
(
GroupID int,
RecordedAt datetime not null,
Tag char(10) not null,
)
GO

Set NOCOUNT ON

Insert into ExampleLineItemsJAW (RecordedAt, Tag)
VALUES ('2/1/2006 3:00 PM', 'ABC')

Insert into ExampleLineItemsJAW (RecordedAt, Tag)
VALUES ('2/1/2006 3:30 PM', 'ABC')

Insert into ExampleLineItemsJAW (RecordedAt, Tag)
VALUES ('2/2/2006 4:00 PM', 'FRED')

Insert into ExampleLineItemsJAW (RecordedAt, Tag)
VALUES ('2/2/2006 5:00 PM', 'ABC')

Insert into ExampleLineItemsJAW (RecordedAt, Tag)
VALUES ('5/1/2006 3:00 PM', 'ABC')

Insert into ExampleLineItemsJAW (RecordedAt, Tag)
VALUES ('6/1/2006 4:30 PM', 'ABC')

Insert into ExampleLineItemsJAW (RecordedAt, Tag)
VALUES ('6/2/2006 4:30 PM', 'ABC')

SET NOCOUNT OFF
GO

Author
18 Feb 2006 9:51 AM
Razvan Socol
Hello, Justin

This script accomplishes the desired result:

ALTER TABLE ExampleLineItemsJAW ADD
    ID int IDENTITY UNIQUE,
    PreviousID int NULL,
    PRIMARY KEY (RecordedAt, Tag)

GO
UPDATE ExampleLineItemsJAW SET PreviousID=(
        SELECT ID FROM ExampleLineItemsJAW C
        INNER JOIN (
            SELECT MAX(RecordedAt) as PreviousRecording
            FROM ExampleLineItemsJAW B
            WHERE A.Tag=B.Tag
            AND B.RecordedAt<A.RecordedAt
            AND B.RecordedAt>A.RecordedAt-3
        ) X ON C.Tag=A.Tag
        AND C.RecordedAt=X.PreviousRecording
    )
FROM ExampleLineItemsJAW A
WHERE EXISTS (
    SELECT * FROM ExampleLineItemsJAW D
    WHERE A.Tag=D.Tag
    AND D.RecordedAt<A.RecordedAt
    AND D.RecordedAt>A.RecordedAt-3
)

UPDATE ExampleLineItemsJAW SET GroupID=(
        SELECT COUNT(*) FROM ExampleLineItemsJAW B
        WHERE B.ID<=A.ID
        AND B.PreviousID IS NULL
    )
FROM ExampleLineItemsJAW A
WHERE A.PreviousID IS NULL

WHILE EXISTS (
    SELECT * FROM ExampleLineItemsJAW
    WHERE GroupID IS NULL
) BEGIN
    UPDATE ExampleLineItemsJAW SET GroupID=(
            SELECT B.GroupID FROM ExampleLineItemsJAW B
            WHERE B.ID=A.PreviousID
            AND B.GroupID IS NOT NULL
        )
    FROM ExampleLineItemsJAW A
    WHERE A.GroupID IS NULL
END

SELECT GroupID, RecordedAt, Tag FROM ExampleLineItemsJAW

The algorithm is this:
1. Find the PreviousID for each row, if the previous row with the same
tag is not more than 3 days back
2. For each row that has no PreviousID, assign a different GroupID
3. While there are rows with no GroupID-s, fill the GroupID from the
previous row

The number of executions of last UPDATE will be the maximum number of
rows for any tag minus one (in this case, 2 executions because
GroupID=1 contains 3 rows).

We need to be sure that it's not allowed to have more than one row for
the same tag, recorded at the same time, so I added a primary key to
enforce this (it also improves the performance of the queries). If
performance is an issue, you may want to add an index on PreviousID
and/or GroupID (just before the WHILE loop). You should benchmark with
your data to see which indexes are the most appropriate.

Razvan
Author
18 Feb 2006 12:53 PM
Brian Selzer
I wouldn't consider this a normalization problem, and I don't think you can
avoid a scan.  Whenever there is a dependency between rows in the same
table, you must either use a correlated subquery based on a self-join, or
scan through the rows with a cursor.  This dependency is a little more
complicated because it doesn't necessarily involve adjacent rows (based on
the order in which they were recorded).  The group each row belongs to is
dependent upon the time that the first row in that group was recorded,
meaning that you must use TOP, MAX or MIN along with '>=' or '<=' in order
to determine which group each row belongs.  I think that in this case a
cursor will outperform any non-iterative set-based solution, and depending
on the execution plan and whether or not a clustered index exists on (Tag,
RecordedAt), the performance may be orders of magnitude better with a
cursor.  A correlated subquery would have to scan the index more than once
for each row, whereas you can maintain the first RecordedAt for the current
Tag in a local variable as you fetch each row with the cursor, thus
accomplishing your goal with only one pass through the table.


Show quote
"Justin Weinberg" <jweinberg@_spammyoff_mrgsoft.com> wrote in message
news:excJfYGNGHA.3100@tk2msftngp13.phx.gbl...
>
> Figure this out without using a cursor or loop that hits each record, and
> I'll gladly give you a free source license of GDI+ Architect (if you want
> one). See mrgsoft dot com for more info about that.
>
> I receive a data feed into my SQL Server with the bizarre property that
> the keys (called tags) can be reused arbitrarily after three days.  This
> is because the tags are being used worldwide, and the system is based on
> short barcodes which quickly get recirculated.
>
> To clarify, a tag might be used after three days, but not necessarily in
> three day intervals (I can't simply group by every three days using a
> number table for example, as I might accidently split keys that go
> together).  Tags can be reused again after five days and then again after
> three days for example.
>
> My goal is to normalize this data by creating a new ID that groups tags
> used within three day periods.  Here's some example data that illustrates
> what I'm up against.
>
> RecordedAt                Tag
> 2/1/2006 3:00 PM      ABC
> 2/1/2006 3:30 PM      ABC
> 2/2/2006 4:00 PM      FRED
> 2/2/2006 5:00 PM      ABC
> 5/1/2006 3:00 PM      ABC
> 6/1/2006 4:30 PM      ABC
> 6/2/2006 4:30 PM      ABC
>
> This desired output would look as follows:
>
>  GroupID        RecordedAt             Tag
>      1            2/1/2006 3:00 PM      ABC
>      1            2/1/2006 3:30 PM      ABC
>      2            2/2/2006 4:00 PM      FRED
>      1            2/2/2006 5:00 PM      ABC
>      3            5/1/2006 3:00 PM      ABC
>      4            6/1/2006 4:30 PM      ABC
>      4            6/2/2006 4:30 PM      ABC
>
>
> All tags used "within three days of each other" should go into the same
> group. If a tag is seen outside of three days of any other tag, it gets
> its own group.
>
> The only limitations I have is performance and the inability to change how
> data is inserted into the table.  In other words, I can't rewrite the
> insertion process into this table.  I can only respond after data gets
> there.
>
> I can alter the underlying table structure as needed, create new tables,
> etc.   For example, adding a rowID would probably be useful in the
> solution.
>
> I worked out a cursor based SP that does this task by taking the rows one
> at a time, but you can imagine how sluggish it is.  Eventually I want to
> code a trigger that handles the normalization.  But meanwhile, I'm just
> looking for a good script I can execute to populate the table with the
> groupIDs for existing data sets.
>
> The script below will populate a table with the sample data from above.
>
> Myself and your luggage thank you.
>
> Justin Weinberg
> MCAD/MCSD .NET
> jweinberg@spammyoff_mrgsoft.com
>
>
> if exists (select * from sysobjects where id =
> object_id('ExampleLineItemsJAW') and sysstat & 0xf = 3)
> drop table ExampleLineItemsJAW
> GO
>
> CREATE TABLE ExampleLineItemsJAW
> (
> GroupID int,
> RecordedAt datetime not null,
> Tag char(10) not null,
> )
> GO
>
> Set NOCOUNT ON
>
> Insert into ExampleLineItemsJAW (RecordedAt, Tag)
> VALUES ('2/1/2006 3:00 PM', 'ABC')
>
> Insert into ExampleLineItemsJAW (RecordedAt, Tag)
> VALUES ('2/1/2006 3:30 PM', 'ABC')
>
> Insert into ExampleLineItemsJAW (RecordedAt, Tag)
> VALUES ('2/2/2006 4:00 PM', 'FRED')
>
> Insert into ExampleLineItemsJAW (RecordedAt, Tag)
> VALUES ('2/2/2006 5:00 PM', 'ABC')
>
> Insert into ExampleLineItemsJAW (RecordedAt, Tag)
> VALUES ('5/1/2006 3:00 PM', 'ABC')
>
> Insert into ExampleLineItemsJAW (RecordedAt, Tag)
> VALUES ('6/1/2006 4:30 PM', 'ABC')
>
> Insert into ExampleLineItemsJAW (RecordedAt, Tag)
> VALUES ('6/2/2006 4:30 PM', 'ABC')
>
> SET NOCOUNT OFF
> GO
>
>
>
>
>
Author
18 Feb 2006 1:09 PM
Roy Harvey
Justin,

This one was interesting.

A couple of points about the solution below.  First, while it uses a
loop, it will go much faster than a cursor based solution.  I have
used the general approach many times, it has always been fast.

Second, I did not create any indexes on the test data, but with a
large data set you will want some.

Most important, I assumed that "3 days" meant 36 hours exactly.  It is
possible that it should have meant Monday+Tuesday+Wednesday, or
Tuesday+Wednesday+Thursday, in other words whole days.  That would
require some adjustments to the date range tests, of course.

Hope you like it!

Roy

CREATE TABLE dbo.Strange (
    RecordedAt datetime NOT NULL ,
    Tag varchar (12) NOT NULL ,
        GroupStatus char(1) NOT NULL DEFAULT ' ',
        GroupStart datetime NULL
)
GO

--The test data you described
INSERT Strange (RecordedAt, Tag) VALUES('2/1/2006 3:00 PM', 'ABC')
INSERT Strange (RecordedAt, Tag) VALUES('2/1/2006 3:30 PM', 'ABC')
INSERT Strange (RecordedAt, Tag) VALUES('2/2/2006 4:00 PM', 'FRED')
INSERT Strange (RecordedAt, Tag) VALUES('2/2/2006 5:00 PM', 'ABC')
INSERT Strange (RecordedAt, Tag) VALUES('5/1/2006 3:00 PM', 'ABC')
INSERT Strange (RecordedAt, Tag) VALUES('6/1/2006 4:30 PM', 'ABC')
INSERT Strange (RecordedAt, Tag) VALUES('6/2/2006 4:30 PM', 'ABC')

--We need more, and more difficult, test data.
INSERT Strange (RecordedAt, Tag)
SELECT DATEADD(day,2,RecordedAt), Tag FROM Strange
UNION ALL
SELECT DATEADD(day,4,RecordedAt), Tag FROM Strange
UNION ALL
SELECT DATEADD(day,6,RecordedAt), Tag FROM Strange
UNION ALL
SELECT DATEADD(day,8,RecordedAt), Tag FROM Strange

--What does the data look like?
select *
  from Strange
order by Tag, RecordedAt

--First pass, identify SOME of the "first-in-a-group" rows.
--These are the ones with nothing in the prior three days.
UPDATE Strange
   SET GroupStatus = 'S', --S for Start
       GroupStart = RecordedAt
WHERE GroupStatus = ' '  --All are this way for the first pass
   AND NOT EXISTS
       (select * from Strange as X
         where Strange.Tag = X.Tag
           and X.GroupStatus = ' '
           and X.RecordedAt >= dateadd(day,-3,Strange.RecordedAt)
           and X.RecordedAt <  Strange.RecordedAt)

WHILE @@rowcount <> 0
BEGIN
--Whatever new start points have been found, the matching
--rows for the subsequent three days need to be marked as
--matching.
UPDATE Strange
   SET GroupStatus = 'F', --F for Following
       GroupStart = M.GroupStart
  FROM Strange, Strange as M
WHERE Strange.Tag = M.Tag
   AND M.GroupStatus = 'S'
   AND Strange.GroupStatus = ' '
   AND Strange.RecordedAt >= M.RecordedAt
   AND Strange.RecordedAt <= DATEADD(day,3,M.RecordedAt)

--Now we mark new starting points, but we ignore rows
--that have already been assigned to batches.
UPDATE Strange
   SET GroupStatus = 'S', --S for Start
       GroupStart = RecordedAt
WHERE GroupStatus = ' '
   AND NOT EXISTS
       (select * from Strange as X
         where Strange.Tag = X.Tag
           and X.GroupStatus = ' '
           and X.RecordedAt >= dateadd(day,-3,Strange.RecordedAt)
           and X.RecordedAt <  Strange.RecordedAt)
--If we find no new batches, we are done.  Or at least, should be.
END

SELECT Tag,
       GroupStart,
       MAX(RecordedAt) as LastInGroup,
       count(*) as Rows,
       count(*) - count(GroupStart) as Unassigned
  FROM Strange
GROUP BY Tag,
       GroupStart
ORDER BY Tag,
       GroupStart


On Sat, 18 Feb 2006 02:29:56 -0600, "Justin Weinberg"
<jweinberg@_spammyoff_mrgsoft.com> wrote:

Show quote
>
>Figure this out without using a cursor or loop that hits each record, and
>I'll gladly give you a free source license of GDI+ Architect (if you want
>one). See mrgsoft dot com for more info about that.
>
>I receive a data feed into my SQL Server with the bizarre property that the
>keys (called tags) can be reused arbitrarily after three days.  This is
>because the tags are being used worldwide, and the system is based on short
>barcodes which quickly get recirculated.
>
>To clarify, a tag might be used after three days, but not necessarily in
>three day intervals (I can't simply group by every three days using a number
>table for example, as I might accidently split keys that go together).  Tags
>can be reused again after five days and then again after three days for
>example.
>
>My goal is to normalize this data by creating a new ID that groups tags used
>within three day periods.  Here's some example data that illustrates what
>I'm up against.
>
>RecordedAt                Tag
>2/1/2006 3:00 PM      ABC
>2/1/2006 3:30 PM      ABC
>2/2/2006 4:00 PM      FRED
>2/2/2006 5:00 PM      ABC
>5/1/2006 3:00 PM      ABC
>6/1/2006 4:30 PM      ABC
>6/2/2006 4:30 PM      ABC
>
>This desired output would look as follows:
>
>  GroupID        RecordedAt             Tag
>      1            2/1/2006 3:00 PM      ABC
>      1            2/1/2006 3:30 PM      ABC
>      2            2/2/2006 4:00 PM      FRED
>      1            2/2/2006 5:00 PM      ABC
>      3            5/1/2006 3:00 PM      ABC
>      4            6/1/2006 4:30 PM      ABC
>      4            6/2/2006 4:30 PM      ABC
>
>
>All tags used "within three days of each other" should go into the same
>group. If a tag is seen outside of three days of any other tag, it gets its
>own group.
>
>The only limitations I have is performance and the inability to change how
>data is inserted into the table.  In other words, I can't rewrite the
>insertion process into this table.  I can only respond after data gets
>there.
>
>I can alter the underlying table structure as needed, create new tables,
>etc.   For example, adding a rowID would probably be useful in the solution.
>
>I worked out a cursor based SP that does this task by taking the rows one at
>a time, but you can imagine how sluggish it is.  Eventually I want to code a
>trigger that handles the normalization.  But meanwhile, I'm just looking for
>a good script I can execute to populate the table with the groupIDs for
>existing data sets.
>
>The script below will populate a table with the sample data from above.
>
>Myself and your luggage thank you.
>
>Justin Weinberg
>MCAD/MCSD .NET
>jweinberg@spammyoff_mrgsoft.com
>
>
>if exists (select * from sysobjects where id =
>object_id('ExampleLineItemsJAW') and sysstat & 0xf = 3)
> drop table ExampleLineItemsJAW
>GO
>
>CREATE TABLE ExampleLineItemsJAW
>(
> GroupID int,
> RecordedAt datetime not null,
> Tag char(10) not null,
>)
>GO
>
>Set NOCOUNT ON
>
>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>VALUES ('2/1/2006 3:00 PM', 'ABC')
>
>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>VALUES ('2/1/2006 3:30 PM', 'ABC')
>
>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>VALUES ('2/2/2006 4:00 PM', 'FRED')
>
>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>VALUES ('2/2/2006 5:00 PM', 'ABC')
>
>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>VALUES ('5/1/2006 3:00 PM', 'ABC')
>
>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>VALUES ('6/1/2006 4:30 PM', 'ABC')
>
>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>VALUES ('6/2/2006 4:30 PM', 'ABC')
>
>SET NOCOUNT OFF
>GO
>
>
>
>
Author
18 Feb 2006 9:46 PM
Justin Weinberg
I'm always impressed by not only the generosity, but the intelligence of the
people who respond to questions on the sql server programming group.

Razvan and Roy, I can't think either of you enough for taking the time to
think this one through. Both of you provided thoughtful compelling
solutions.

This really effects the bottom line on my deadlines and the system
performance as a whole.  Please, either of you contact me if you want to
take me up on that GDI+ Architect source code license.

Just as an aside to Brian, the reason I classified this as a normalization
problem is because I believe that the data violates fourth normal form as it
has multivalue dependencies.

Again, thanks to all of the posters.

Justin Weinberg
MCAD/MCSD .NET
jweinberg@spammyoff_mrgsoft.com


Show quote
> Justin,
>
> This one was interesting.
>
> A couple of points about the solution below.  First, while it uses a
> loop, it will go much faster than a cursor based solution.  I have
> used the general approach many times, it has always been fast.
>
> Second, I did not create any indexes on the test data, but with a
> large data set you will want some.
>
> Most important, I assumed that "3 days" meant 36 hours exactly.  It is
> possible that it should have meant Monday+Tuesday+Wednesday, or
> Tuesday+Wednesday+Thursday, in other words whole days.  That would
> require some adjustments to the date range tests, of course.
>
> Hope you like it!
>
> Roy
>
> CREATE TABLE dbo.Strange (
> RecordedAt datetime NOT NULL ,
> Tag varchar (12) NOT NULL ,
>        GroupStatus char(1) NOT NULL DEFAULT ' ',
>        GroupStart datetime NULL
> )
> GO
>
> --The test data you described
> INSERT Strange (RecordedAt, Tag) VALUES('2/1/2006 3:00 PM', 'ABC')
> INSERT Strange (RecordedAt, Tag) VALUES('2/1/2006 3:30 PM', 'ABC')
> INSERT Strange (RecordedAt, Tag) VALUES('2/2/2006 4:00 PM', 'FRED')
> INSERT Strange (RecordedAt, Tag) VALUES('2/2/2006 5:00 PM', 'ABC')
> INSERT Strange (RecordedAt, Tag) VALUES('5/1/2006 3:00 PM', 'ABC')
> INSERT Strange (RecordedAt, Tag) VALUES('6/1/2006 4:30 PM', 'ABC')
> INSERT Strange (RecordedAt, Tag) VALUES('6/2/2006 4:30 PM', 'ABC')
>
> --We need more, and more difficult, test data.
> INSERT Strange (RecordedAt, Tag)
> SELECT DATEADD(day,2,RecordedAt), Tag FROM Strange
> UNION ALL
> SELECT DATEADD(day,4,RecordedAt), Tag FROM Strange
> UNION ALL
> SELECT DATEADD(day,6,RecordedAt), Tag FROM Strange
> UNION ALL
> SELECT DATEADD(day,8,RecordedAt), Tag FROM Strange
>
> --What does the data look like?
> select *
>  from Strange
> order by Tag, RecordedAt
>
> --First pass, identify SOME of the "first-in-a-group" rows.
> --These are the ones with nothing in the prior three days.
> UPDATE Strange
>   SET GroupStatus = 'S', --S for Start
>       GroupStart = RecordedAt
> WHERE GroupStatus = ' '  --All are this way for the first pass
>   AND NOT EXISTS
>       (select * from Strange as X
>         where Strange.Tag = X.Tag
>           and X.GroupStatus = ' '
>           and X.RecordedAt >= dateadd(day,-3,Strange.RecordedAt)
>           and X.RecordedAt <  Strange.RecordedAt)
>
> WHILE @@rowcount <> 0
> BEGIN
> --Whatever new start points have been found, the matching
> --rows for the subsequent three days need to be marked as
> --matching.
> UPDATE Strange
>   SET GroupStatus = 'F', --F for Following
>       GroupStart = M.GroupStart
>  FROM Strange, Strange as M
> WHERE Strange.Tag = M.Tag
>   AND M.GroupStatus = 'S'
>   AND Strange.GroupStatus = ' '
>   AND Strange.RecordedAt >= M.RecordedAt
>   AND Strange.RecordedAt <= DATEADD(day,3,M.RecordedAt)
>
> --Now we mark new starting points, but we ignore rows
> --that have already been assigned to batches.
> UPDATE Strange
>   SET GroupStatus = 'S', --S for Start
>       GroupStart = RecordedAt
> WHERE GroupStatus = ' '
>   AND NOT EXISTS
>       (select * from Strange as X
>         where Strange.Tag = X.Tag
>           and X.GroupStatus = ' '
>           and X.RecordedAt >= dateadd(day,-3,Strange.RecordedAt)
>           and X.RecordedAt <  Strange.RecordedAt)
> --If we find no new batches, we are done.  Or at least, should be.
> END
>
> SELECT Tag,
>       GroupStart,
>       MAX(RecordedAt) as LastInGroup,
>       count(*) as Rows,
>       count(*) - count(GroupStart) as Unassigned
>  FROM Strange
> GROUP BY Tag,
>       GroupStart
> ORDER BY Tag,
>       GroupStart
>
>
> On Sat, 18 Feb 2006 02:29:56 -0600, "Justin Weinberg"
> <jweinberg@_spammyoff_mrgsoft.com> wrote:
>
>>
>>Figure this out without using a cursor or loop that hits each record, and
>>I'll gladly give you a free source license of GDI+ Architect (if you want
>>one). See mrgsoft dot com for more info about that.
>>
>>I receive a data feed into my SQL Server with the bizarre property that
>>the
>>keys (called tags) can be reused arbitrarily after three days.  This is
>>because the tags are being used worldwide, and the system is based on
>>short
>>barcodes which quickly get recirculated.
>>
>>To clarify, a tag might be used after three days, but not necessarily in
>>three day intervals (I can't simply group by every three days using a
>>number
>>table for example, as I might accidently split keys that go together).
>>Tags
>>can be reused again after five days and then again after three days for
>>example.
>>
>>My goal is to normalize this data by creating a new ID that groups tags
>>used
>>within three day periods.  Here's some example data that illustrates what
>>I'm up against.
>>
>>RecordedAt                Tag
>>2/1/2006 3:00 PM      ABC
>>2/1/2006 3:30 PM      ABC
>>2/2/2006 4:00 PM      FRED
>>2/2/2006 5:00 PM      ABC
>>5/1/2006 3:00 PM      ABC
>>6/1/2006 4:30 PM      ABC
>>6/2/2006 4:30 PM      ABC
>>
>>This desired output would look as follows:
>>
>>  GroupID        RecordedAt             Tag
>>      1            2/1/2006 3:00 PM      ABC
>>      1            2/1/2006 3:30 PM      ABC
>>      2            2/2/2006 4:00 PM      FRED
>>      1            2/2/2006 5:00 PM      ABC
>>      3            5/1/2006 3:00 PM      ABC
>>      4            6/1/2006 4:30 PM      ABC
>>      4            6/2/2006 4:30 PM      ABC
>>
>>
>>All tags used "within three days of each other" should go into the same
>>group. If a tag is seen outside of three days of any other tag, it gets
>>its
>>own group.
>>
>>The only limitations I have is performance and the inability to change how
>>data is inserted into the table.  In other words, I can't rewrite the
>>insertion process into this table.  I can only respond after data gets
>>there.
>>
>>I can alter the underlying table structure as needed, create new tables,
>>etc.   For example, adding a rowID would probably be useful in the
>>solution.
>>
>>I worked out a cursor based SP that does this task by taking the rows one
>>at
>>a time, but you can imagine how sluggish it is.  Eventually I want to code
>>a
>>trigger that handles the normalization.  But meanwhile, I'm just looking
>>for
>>a good script I can execute to populate the table with the groupIDs for
>>existing data sets.
>>
>>The script below will populate a table with the sample data from above.
>>
>>Myself and your luggage thank you.
>>
>>Justin Weinberg
>>MCAD/MCSD .NET
>>jweinberg@spammyoff_mrgsoft.com
>>
>>
>>if exists (select * from sysobjects where id =
>>object_id('ExampleLineItemsJAW') and sysstat & 0xf = 3)
>> drop table ExampleLineItemsJAW
>>GO
>>
>>CREATE TABLE ExampleLineItemsJAW
>>(
>> GroupID int,
>> RecordedAt datetime not null,
>> Tag char(10) not null,
>>)
>>GO
>>
>>Set NOCOUNT ON
>>
>>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>>VALUES ('2/1/2006 3:00 PM', 'ABC')
>>
>>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>>VALUES ('2/1/2006 3:30 PM', 'ABC')
>>
>>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>>VALUES ('2/2/2006 4:00 PM', 'FRED')
>>
>>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>>VALUES ('2/2/2006 5:00 PM', 'ABC')
>>
>>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>>VALUES ('5/1/2006 3:00 PM', 'ABC')
>>
>>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>>VALUES ('6/1/2006 4:30 PM', 'ABC')
>>
>>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>>VALUES ('6/2/2006 4:30 PM', 'ABC')
>>
>>SET NOCOUNT OFF
>>GO
>>
>>
>>
>>
Author
19 Feb 2006 5:29 AM
Brian Selzer
My understanding of a multi-valued dependency is much different from what
you've posted as an example.  The main difference between a functional
dependency and a multi-valued dependency is that the dependent for a
functional dependency is a single value whereas the dependent for a
multi-valued dependency is a well-defined set of values.  The determinant
and the determinant alone identifies the set of values.  In your example,
the values in other tuples play a part in determining the set of acceptable
values.  In addition, I believe that there must be at least three attributes
in order for a multi-valued dependency to exist in a relation.  As I see it,
the dependency in your example is more in the nature of a temporal
constraint.  Essentially, the values in one tuple determine what is
acceptable in other tuples, and the one of the attributes is a timestamp.

By the way, I still think you'll get better performance with a cursor.

Show quote
"Justin Weinberg" <jweinberg@_spammyoff_mrgsoft.com> wrote in message
news:%237BnaVNNGHA.984@tk2msftngp13.phx.gbl...
> I'm always impressed by not only the generosity, but the intelligence of
> the people who respond to questions on the sql server programming group.
>
> Razvan and Roy, I can't think either of you enough for taking the time to
> think this one through. Both of you provided thoughtful compelling
> solutions.
>
> This really effects the bottom line on my deadlines and the system
> performance as a whole.  Please, either of you contact me if you want to
> take me up on that GDI+ Architect source code license.
>
> Just as an aside to Brian, the reason I classified this as a normalization
> problem is because I believe that the data violates fourth normal form as
> it has multivalue dependencies.
>
> Again, thanks to all of the posters.
>
> Justin Weinberg
> MCAD/MCSD .NET
> jweinberg@spammyoff_mrgsoft.com
>
>
>> Justin,
>>
>> This one was interesting.
>>
>> A couple of points about the solution below.  First, while it uses a
>> loop, it will go much faster than a cursor based solution.  I have
>> used the general approach many times, it has always been fast.
>>
>> Second, I did not create any indexes on the test data, but with a
>> large data set you will want some.
>>
>> Most important, I assumed that "3 days" meant 36 hours exactly.  It is
>> possible that it should have meant Monday+Tuesday+Wednesday, or
>> Tuesday+Wednesday+Thursday, in other words whole days.  That would
>> require some adjustments to the date range tests, of course.
>>
>> Hope you like it!
>>
>> Roy
>>
>> CREATE TABLE dbo.Strange (
>> RecordedAt datetime NOT NULL ,
>> Tag varchar (12) NOT NULL ,
>>        GroupStatus char(1) NOT NULL DEFAULT ' ',
>>        GroupStart datetime NULL
>> )
>> GO
>>
>> --The test data you described
>> INSERT Strange (RecordedAt, Tag) VALUES('2/1/2006 3:00 PM', 'ABC')
>> INSERT Strange (RecordedAt, Tag) VALUES('2/1/2006 3:30 PM', 'ABC')
>> INSERT Strange (RecordedAt, Tag) VALUES('2/2/2006 4:00 PM', 'FRED')
>> INSERT Strange (RecordedAt, Tag) VALUES('2/2/2006 5:00 PM', 'ABC')
>> INSERT Strange (RecordedAt, Tag) VALUES('5/1/2006 3:00 PM', 'ABC')
>> INSERT Strange (RecordedAt, Tag) VALUES('6/1/2006 4:30 PM', 'ABC')
>> INSERT Strange (RecordedAt, Tag) VALUES('6/2/2006 4:30 PM', 'ABC')
>>
>> --We need more, and more difficult, test data.
>> INSERT Strange (RecordedAt, Tag)
>> SELECT DATEADD(day,2,RecordedAt), Tag FROM Strange
>> UNION ALL
>> SELECT DATEADD(day,4,RecordedAt), Tag FROM Strange
>> UNION ALL
>> SELECT DATEADD(day,6,RecordedAt), Tag FROM Strange
>> UNION ALL
>> SELECT DATEADD(day,8,RecordedAt), Tag FROM Strange
>>
>> --What does the data look like?
>> select *
>>  from Strange
>> order by Tag, RecordedAt
>>
>> --First pass, identify SOME of the "first-in-a-group" rows.
>> --These are the ones with nothing in the prior three days.
>> UPDATE Strange
>>   SET GroupStatus = 'S', --S for Start
>>       GroupStart = RecordedAt
>> WHERE GroupStatus = ' '  --All are this way for the first pass
>>   AND NOT EXISTS
>>       (select * from Strange as X
>>         where Strange.Tag = X.Tag
>>           and X.GroupStatus = ' '
>>           and X.RecordedAt >= dateadd(day,-3,Strange.RecordedAt)
>>           and X.RecordedAt <  Strange.RecordedAt)
>>
>> WHILE @@rowcount <> 0
>> BEGIN
>> --Whatever new start points have been found, the matching
>> --rows for the subsequent three days need to be marked as
>> --matching.
>> UPDATE Strange
>>   SET GroupStatus = 'F', --F for Following
>>       GroupStart = M.GroupStart
>>  FROM Strange, Strange as M
>> WHERE Strange.Tag = M.Tag
>>   AND M.GroupStatus = 'S'
>>   AND Strange.GroupStatus = ' '
>>   AND Strange.RecordedAt >= M.RecordedAt
>>   AND Strange.RecordedAt <= DATEADD(day,3,M.RecordedAt)
>>
>> --Now we mark new starting points, but we ignore rows
>> --that have already been assigned to batches.
>> UPDATE Strange
>>   SET GroupStatus = 'S', --S for Start
>>       GroupStart = RecordedAt
>> WHERE GroupStatus = ' '
>>   AND NOT EXISTS
>>       (select * from Strange as X
>>         where Strange.Tag = X.Tag
>>           and X.GroupStatus = ' '
>>           and X.RecordedAt >= dateadd(day,-3,Strange.RecordedAt)
>>           and X.RecordedAt <  Strange.RecordedAt)
>> --If we find no new batches, we are done.  Or at least, should be.
>> END
>>
>> SELECT Tag,
>>       GroupStart,
>>       MAX(RecordedAt) as LastInGroup,
>>       count(*) as Rows,
>>       count(*) - count(GroupStart) as Unassigned
>>  FROM Strange
>> GROUP BY Tag,
>>       GroupStart
>> ORDER BY Tag,
>>       GroupStart
>>
>>
>> On Sat, 18 Feb 2006 02:29:56 -0600, "Justin Weinberg"
>> <jweinberg@_spammyoff_mrgsoft.com> wrote:
>>
>>>
>>>Figure this out without using a cursor or loop that hits each record, and
>>>I'll gladly give you a free source license of GDI+ Architect (if you want
>>>one). See mrgsoft dot com for more info about that.
>>>
>>>I receive a data feed into my SQL Server with the bizarre property that
>>>the
>>>keys (called tags) can be reused arbitrarily after three days.  This is
>>>because the tags are being used worldwide, and the system is based on
>>>short
>>>barcodes which quickly get recirculated.
>>>
>>>To clarify, a tag might be used after three days, but not necessarily in
>>>three day intervals (I can't simply group by every three days using a
>>>number
>>>table for example, as I might accidently split keys that go together).
>>>Tags
>>>can be reused again after five days and then again after three days for
>>>example.
>>>
>>>My goal is to normalize this data by creating a new ID that groups tags
>>>used
>>>within three day periods.  Here's some example data that illustrates what
>>>I'm up against.
>>>
>>>RecordedAt                Tag
>>>2/1/2006 3:00 PM      ABC
>>>2/1/2006 3:30 PM      ABC
>>>2/2/2006 4:00 PM      FRED
>>>2/2/2006 5:00 PM      ABC
>>>5/1/2006 3:00 PM      ABC
>>>6/1/2006 4:30 PM      ABC
>>>6/2/2006 4:30 PM      ABC
>>>
>>>This desired output would look as follows:
>>>
>>>  GroupID        RecordedAt             Tag
>>>      1            2/1/2006 3:00 PM      ABC
>>>      1            2/1/2006 3:30 PM      ABC
>>>      2            2/2/2006 4:00 PM      FRED
>>>      1            2/2/2006 5:00 PM      ABC
>>>      3            5/1/2006 3:00 PM      ABC
>>>      4            6/1/2006 4:30 PM      ABC
>>>      4            6/2/2006 4:30 PM      ABC
>>>
>>>
>>>All tags used "within three days of each other" should go into the same
>>>group. If a tag is seen outside of three days of any other tag, it gets
>>>its
>>>own group.
>>>
>>>The only limitations I have is performance and the inability to change
>>>how
>>>data is inserted into the table.  In other words, I can't rewrite the
>>>insertion process into this table.  I can only respond after data gets
>>>there.
>>>
>>>I can alter the underlying table structure as needed, create new tables,
>>>etc.   For example, adding a rowID would probably be useful in the
>>>solution.
>>>
>>>I worked out a cursor based SP that does this task by taking the rows one
>>>at
>>>a time, but you can imagine how sluggish it is.  Eventually I want to
>>>code a
>>>trigger that handles the normalization.  But meanwhile, I'm just looking
>>>for
>>>a good script I can execute to populate the table with the groupIDs for
>>>existing data sets.
>>>
>>>The script below will populate a table with the sample data from above.
>>>
>>>Myself and your luggage thank you.
>>>
>>>Justin Weinberg
>>>MCAD/MCSD .NET
>>>jweinberg@spammyoff_mrgsoft.com
>>>
>>>
>>>if exists (select * from sysobjects where id =
>>>object_id('ExampleLineItemsJAW') and sysstat & 0xf = 3)
>>> drop table ExampleLineItemsJAW
>>>GO
>>>
>>>CREATE TABLE ExampleLineItemsJAW
>>>(
>>> GroupID int,
>>> RecordedAt datetime not null,
>>> Tag char(10) not null,
>>>)
>>>GO
>>>
>>>Set NOCOUNT ON
>>>
>>>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>>>VALUES ('2/1/2006 3:00 PM', 'ABC')
>>>
>>>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>>>VALUES ('2/1/2006 3:30 PM', 'ABC')
>>>
>>>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>>>VALUES ('2/2/2006 4:00 PM', 'FRED')
>>>
>>>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>>>VALUES ('2/2/2006 5:00 PM', 'ABC')
>>>
>>>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>>>VALUES ('5/1/2006 3:00 PM', 'ABC')
>>>
>>>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>>>VALUES ('6/1/2006 4:30 PM', 'ABC')
>>>
>>>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>>>VALUES ('6/2/2006 4:30 PM', 'ABC')
>>>
>>>SET NOCOUNT OFF
>>>GO
>>>
>>>
>>>
>>>
>
>
Author
19 Feb 2006 7:02 AM
Anith Sen
Do:

UPDATE tbl
   SET GroupID =
      ( SELECT COUNT(DISTINCT New)
          FROM ( SELECT t1.RecordedAt, t1.Tag, MIN( t2.RecordedAt )
                   FROM tbl t1
                  INNER JOIN tbl t2 ON t2.Tag = t1.Tag
                  WHERE DATEDIFF( d, t2.RecordedAt, t1.RecordedAt ) <= 3
                  GROUP BY t1.RecordedAt, t1.Tag )  v1 ( RecordedAt, Tag,
New )
         WHERE v1.New <= ( SELECT MIN( t3.RecordedAt )
                             FROM tbl t3
                            WHERE t3.Tag = tbl.Tag
                              AND DATEDIFF( d, t3.RecordedAt,
tbl.RecordedAt ) <= 3 ) ) ;

The easier way to understand this query is to write them as simple SELECT
statements. Perhaps a view will make it easier to understand:

CREATE VIEW vw ( record_time, tag, group_time ) AS
SELECT t1.RecordedAt, t1.Tag, MIN( t2.RecordedAt )
  FROM tbl t1
INNER JOIN tbl t2
    ON t2.Tag = t1.Tag
WHERE DATEDIFF( d, t2.RecordedAt, t1.RecordedAt ) <= 3
GROUP BY t1.RecordedAt, t1.Tag ;

Now the resultset from the following query based on this view will have the
values that will make up the GroupId column:

SELECT RecordedAt, Tag,
       ( SELECT COUNT( DISTINCT New )
           FROM vw v1 WHERE v1.New <= vw.New )
  FROM vw ;

In 2005, you can make the code a bit more simpler with a WITH CTE clause.

--
Anith
Author
20 Feb 2006 6:35 AM
05ponyGT
This is very cleaver stuff.The simple rank even fell into place for you.
I wonder if you saw this prior to getting the starting dates or you
just got lucky and it fell into your lap:) Just kidding its very,very good.
You were a little long winded in trying to explain it though:)
I even took a crack at it with the idea that a rank could be constructed
based on  the difference between consecutive dates based on sort
of tag,date and a running sum of the differences.So far my toy is
just off:)
Author
18 Feb 2006 4:40 PM
Alexander Kuznetsov
Justin,

I'm not sure what your requirements are, but maybe you could use 2
tables:

CREATE TABLE Tag
(
AvailableAt datetime not null,
Tag char(10) not null,
)

CREATE TABLE TagUsage
(
UsedAt datetime not null,
Tag char(10) not null,
)

Every time a tag is used, you insert a row into TagUsed and up
AvailableAt in Tag.
Author
18 Feb 2006 8:25 PM
--CELKO--
Another version of the same idea:

CREATE TABLE AvailableTags
(available_date DATETIME NOT NULL,
tag_nbr CHAR(10)  NOT NULL,
PRIMARY KEY (neartest_available_date, tag_nbr));

Look at the CURRENT_TIMESTAMP, and return rows that are now available.
Grab the one   that has been available the longest.  Then UPDATE the
available_date with the three-day formula.  This gets us down to one
table and a quick way to ask about the re-cycling.
Author
18 Feb 2006 9:37 PM
JXStern
First iteration, don't put the GroupId in each record at all.
Instead, create a second table like:

create table mygroups
(
  groupid int,
  startdate datetime,
  enddate datetime
)

Populate table.

When you want to know a data row's group, just search this small table
by date.

(actually, that's a little tricky, since you're effectively trying to
join a specific date to a table that has only a range.  I did this for
a guy the other day, and he had some freaky SQL that joined on a
between, or something like that, and it worked - slowly!  I had to
munge some freaky SQL to get the performance, but I'm leaving this as
an exercise for the reader, it's Saturday, for gosh sakes!  Maybe
Celko already has it in a book, I dunno.)


Second iteration, if you really want to avoid the joins at runtime,
denormalize your database by putting the groupid field into your data
table after all.  A reasonably simple update should match each row to
the mygroups table and do what you need.

You could also do that trick in a trigger at data row create time.

J.



On Sat, 18 Feb 2006 02:29:56 -0600, "Justin Weinberg"
<jweinberg@_spammyoff_mrgsoft.com> wrote:
Show quote
>Figure this out without using a cursor or loop that hits each record, and
>I'll gladly give you a free source license of GDI+ Architect (if you want
>one). See mrgsoft dot com for more info about that.
>
>I receive a data feed into my SQL Server with the bizarre property that the
>keys (called tags) can be reused arbitrarily after three days.  This is
>because the tags are being used worldwide, and the system is based on short
>barcodes which quickly get recirculated.
>
>To clarify, a tag might be used after three days, but not necessarily in
>three day intervals (I can't simply group by every three days using a number
>table for example, as I might accidently split keys that go together).  Tags
>can be reused again after five days and then again after three days for
>example.
>
>My goal is to normalize this data by creating a new ID that groups tags used
>within three day periods.  Here's some example data that illustrates what
>I'm up against.
>
>RecordedAt                Tag
>2/1/2006 3:00 PM      ABC
>2/1/2006 3:30 PM      ABC
>2/2/2006 4:00 PM      FRED
>2/2/2006 5:00 PM      ABC
>5/1/2006 3:00 PM      ABC
>6/1/2006 4:30 PM      ABC
>6/2/2006 4:30 PM      ABC
>
>This desired output would look as follows:
>
>  GroupID        RecordedAt             Tag
>      1            2/1/2006 3:00 PM      ABC
>      1            2/1/2006 3:30 PM      ABC
>      2            2/2/2006 4:00 PM      FRED
>      1            2/2/2006 5:00 PM      ABC
>      3            5/1/2006 3:00 PM      ABC
>      4            6/1/2006 4:30 PM      ABC
>      4            6/2/2006 4:30 PM      ABC
>
>
>All tags used "within three days of each other" should go into the same
>group. If a tag is seen outside of three days of any other tag, it gets its
>own group.
>
>The only limitations I have is performance and the inability to change how
>data is inserted into the table.  In other words, I can't rewrite the
>insertion process into this table.  I can only respond after data gets
>there.
>
>I can alter the underlying table structure as needed, create new tables,
>etc.   For example, adding a rowID would probably be useful in the solution.
>
>I worked out a cursor based SP that does this task by taking the rows one at
>a time, but you can imagine how sluggish it is.  Eventually I want to code a
>trigger that handles the normalization.  But meanwhile, I'm just looking for
>a good script I can execute to populate the table with the groupIDs for
>existing data sets.
>
>The script below will populate a table with the sample data from above.
>
>Myself and your luggage thank you.
>
>Justin Weinberg
>MCAD/MCSD .NET
>jweinberg@spammyoff_mrgsoft.com
>
>
>if exists (select * from sysobjects where id =
>object_id('ExampleLineItemsJAW') and sysstat & 0xf = 3)
> drop table ExampleLineItemsJAW
>GO
>
>CREATE TABLE ExampleLineItemsJAW
>(
> GroupID int,
> RecordedAt datetime not null,
> Tag char(10) not null,
>)
>GO
>
>Set NOCOUNT ON
>
>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>VALUES ('2/1/2006 3:00 PM', 'ABC')
>
>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>VALUES ('2/1/2006 3:30 PM', 'ABC')
>
>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>VALUES ('2/2/2006 4:00 PM', 'FRED')
>
>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>VALUES ('2/2/2006 5:00 PM', 'ABC')
>
>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>VALUES ('5/1/2006 3:00 PM', 'ABC')
>
>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>VALUES ('6/1/2006 4:30 PM', 'ABC')
>
>Insert into ExampleLineItemsJAW (RecordedAt, Tag)
>VALUES ('6/2/2006 4:30 PM', 'ABC')
>
>SET NOCOUNT OFF
>GO
>
>
>
>
Author
20 Feb 2006 2:03 PM
jsfromynr
Hi There,
I tried to comeout with this solution. See if it can help you.



Select Tag ,identity(int,1,1) MyID into dummyTable From
ExampleLineItemsJAW Group By Tag

Update XY1 Set GroupID = datediff(dd,BaseDate,RecordedAt)/3 +MyID
--Select datediff(dd,BaseDate,RecordedAt)/3 +MyID,*
>From ExampleLineItemsJAW XY1 Inner Join
(Select Tag,Min(convert(varchar,RecordedAt,112)) BaseDate From
ExampleLineItemsJAW Group By Tag) XY
On XY.Tag = XY1.Tag
Inner Join DummyTable DM on DM.Tag = XY1.Tag


Select * From ExampleLineItemsJAW

With Warm regards
Jatinder Singh

AddThis Social Bookmark Button