Home All Groups Group Topic Archive Search About

Most recent data point?

Author
13 Jan 2006 8:34 PM
Michael Bray
I have a table that stores data points for several different data sources. 

The general format is:

DECLARE @Data TABLE
(
        DataID int,
        TimeCollected int,
        DataValue decimal(9,9)
)

INSERT @Data VALUES (1, 1000, 0.75)
INSERT @Data VALUES (1, 1001, 0.69)
INSERT @Data VALUES (1, 1002, 0.77)
INSERT @Data VALUES (2, 1000, 0.45)
INSERT @Data VALUES (2, 1001, 0.49)
INSERT @Data VALUES (2, 1002, 0.53)
INSERT @Data VALUES (3, 1000, 0.50)
INSERT @Data VALUES (3, 1001, 0.35)
INSERT @Data VALUES (3, 1002, 0.41)

I want to pull the most recent datapoint for each of the DataIDs.  I can do
it like this:

SELECT d.*
FROM @Data d
INNER JOIN
(
    SELECT DataID, MAX(TimeCollected) AS MaxTime
    FROM @Data
    GROUP BY DataID
) dt ON (d.DataID=dt.DataID) AND (d.TimeCollected=dt.MaxTime)

Is there a more efficient way than this?  I may have hundreds of different
DataIDs with tens or hundreds of thousands of data points for each DataID.

Thanks!

-mdb

Author
13 Jan 2006 9:48 PM
Louis Davidson
First, THANK YOU FOR INCLUDING INSERTs.  You rock and made this so much more
fun/easy to do!

It is good (assuming you have unique values in the combination of columns
DataId,TimeCollected, otherwise you might get duplicates.)  I might suggest
to change to an exists, since you really aren't joining, you are selecting
from d, based on a criteria.  It makes the query easier to understand,
though it does seem to make it more complex.  (I redid the data with temp
tables so I could rerun the queries and look at the plan sans all of the
inserts.)  When I ran the two queries together, this one took 32% to the
others 68:

SELECT d.*
FROM #Data d
WHERE EXISTS
(SELECT *
FROM  (
         SELECT DataID, MAX(TimeCollected) AS MaxTime
         FROM #Data
         GROUP BY DataID
        ) dt
WHERE (d.DataID=dt.DataID) AND (d.TimeCollected=dt.MaxTime))

StmtText
---------------------------------------------------------------------------------------------------------------------------------------------------------
  |--Nested Loops(Left Semi Join, OUTER REFERENCES:([d].[DataID],
[d].[TimeCollected]))
       |--Table Scan(OBJECT:([tempdb].[dbo].[#Data] AS [d]))
       |--Filter(WHERE:([tempdb].[dbo].[#Data].[TimeCollected] as
[d].[TimeCollected]=[Expr1007]))
            |--Stream
Aggregate(DEFINE:([Expr1007]=MAX([tempdb].[dbo].[#Data].[TimeCollected])))
                 |--Table Scan(OBJECT:([tempdb].[dbo].[#Data]),
WHERE:([tempdb].[dbo].[#Data].[DataID] as
[d].[DataID]=[tempdb].[dbo].[#Data].[DataID]))


The biggest increase in cost was that the orignal query needed a sort, as it
took 78% of the execution:

StmtText
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SELECT d.*
FROM #Data d
INNER JOIN
(
SELECT DataID, MAX(TimeCollected) AS MaxTime
FROM #Data
GROUP BY DataID
) dt ON (d.DataID=dt.DataID) AND (d.TimeCollected=dt.MaxTime)

StmtText
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |--Top(TOP EXPRESSION:((1)))
       |--Segment
            |--Sort(ORDER BY:([d].[DataID] DESC, [d].[TimeCollected] DESC))
                 |--Table Scan(OBJECT:([tempdb].[dbo].[#Data] AS [d]),
WHERE:([tempdb].[dbo].[#Data].[DataID] as [d].[DataID] IS NOT NULL AND
[tempdb].[dbo].[#Data].[TimeCollected] as [d].[TimeCollected] IS NOT NULL))

When I added a unique constraint:
alter table #data add unique (dataId, timeCollected)
The plans were equivalent.


So you might test it out and see for yourself:
|--Top(TOP EXPRESSION:((1)))
|--Segment

|--Nested Loops(Inner Join, OUTER REFERENCES:([Bmk1000]))

|--Index
Seek(OBJECT:([tempdb].[dbo].[#Data].[UQ__#Data_____________15502E78] AS
[d]), SEEK:([d].[DataID] IsNotNull),
WHERE:([tempdb].[dbo].[#Data].[TimeCollected] as [d].[TimeCollected] IS NOT
NULL) ORDERED BACKWARD)

|--RID Lookup(OBJECT:([tempdb].[dbo].[#Data] AS [d]),
SEEK:([Bmk1000]=[Bmk1000]) LOOKUP ORDERED FORWARD)


----------------------------------------------------------------------------
Louis Davidson - http://spaces.msn.com/members/drsql/
SQL Server MVP
"Arguments are to be avoided: they are always vulgar and often convincing."
(Oscar Wilde)

Show quote
"Michael Bray" <mbray@makeDIntoDot_ctiusaDcom> wrote in message
news:Xns974A9E81CA7A0mbrayctiusacom@207.46.248.16...
>I have a table that stores data points for several different data sources.
>
> The general format is:
>
> DECLARE @Data TABLE
> (
>    DataID int,
>    TimeCollected int,
>    DataValue decimal(9,9)
> )
>
> INSERT @Data VALUES (1, 1000, 0.75)
> INSERT @Data VALUES (1, 1001, 0.69)
> INSERT @Data VALUES (1, 1002, 0.77)
> INSERT @Data VALUES (2, 1000, 0.45)
> INSERT @Data VALUES (2, 1001, 0.49)
> INSERT @Data VALUES (2, 1002, 0.53)
> INSERT @Data VALUES (3, 1000, 0.50)
> INSERT @Data VALUES (3, 1001, 0.35)
> INSERT @Data VALUES (3, 1002, 0.41)
>
> I want to pull the most recent datapoint for each of the DataIDs.  I can
> do
> it like this:
>
> SELECT d.*
> FROM @Data d
> INNER JOIN
> (
> SELECT DataID, MAX(TimeCollected) AS MaxTime
> FROM @Data
> GROUP BY DataID
> ) dt ON (d.DataID=dt.DataID) AND (d.TimeCollected=dt.MaxTime)
>
> Is there a more efficient way than this?  I may have hundreds of different
> DataIDs with tens or hundreds of thousands of data points for each DataID.
>
> Thanks!
>
> -mdb
Author
13 Jan 2006 10:19 PM
Michael Bray
"Louis Davidson" <dr_dontspamme_sql@hotmail.com> wrote in
news:emLVFsIGGHA.2064@TK2MSFTNGP09.phx.gbl:

> When I added a unique constraint:
> alter table #data add unique (dataId, timeCollected)
> The plans were equivalent.
>

That's actually quite funny.  When I tested your statement against my
statement on the REAL table (the example I provided was not the real data
table - it was just for presentation here in the newsgroup), I found that
your statement and mine had exactly the same execution plan.  Then I
checked the indexes on the table, and indeed, it does have an index on
(dataId, timeCollected). 

Bravo!

-mdb
Author
13 Jan 2006 10:43 PM
Louis Davidson
Cool.  I am always a bit worried when presenting a plan based on execution
on my single user enterprise (equivalent) edition laptop versus a real live
running system with disks that run faster than 4800 RPM :)

--
----------------------------------------------------------------------------
Louis Davidson - http://spaces.msn.com/members/drsql/
SQL Server MVP
"Arguments are to be avoided: they are always vulgar and often convincing."
(Oscar Wilde)

Show quote
"Michael Bray" <mbray@makeDIntoDot_ctiusaDcom> wrote in message
news:Xns974AB04EC7F73mbrayctiusacom@207.46.248.16...
> "Louis Davidson" <dr_dontspamme_sql@hotmail.com> wrote in
> news:emLVFsIGGHA.2064@TK2MSFTNGP09.phx.gbl:
>
>> When I added a unique constraint:
>> alter table #data add unique (dataId, timeCollected)
>> The plans were equivalent.
>>
>
> That's actually quite funny.  When I tested your statement against my
> statement on the REAL table (the example I provided was not the real data
> table - it was just for presentation here in the newsgroup), I found that
> your statement and mine had exactly the same execution plan.  Then I
> checked the indexes on the table, and indeed, it does have an index on
> (dataId, timeCollected).
>
> Bravo!
>
> -mdb

AddThis Social Bookmark Button