Home All Groups Group Topic Archive Search About

Self-join based on datetime

Author
4 Aug 2006 10:24 PM
stavros
I'm trying to write a self-joining query that simply returns records
from a table alongside the previous record (based on a timestamp) for
each employee.  I created a small mock-up table and wrote a query that
almost works, but not quite.  Here are some details:

Table:

   CREATE TABLE #TMPSALES (
      TID INT IDENTITY(1,1),
      EMPLOYEE VARCHAR(30),
      TSTAMP DATETIME,
      SALES INT
   )

Sample table data:

TID   EMPLOYEE       TSTAMP                  SALES
1     bob            2006-07-01 09:15:08     100
2     ed             2006-07-01 09:32:19     200
3     jack           2006-07-01 09:41:30     300
4     bob            2006-07-01 11:08:51     400
5     jack           2006-07-01 11:10:12     500
6     ed             2006-07-01 11:15:41     600
7     bob            2006-07-02 10:03:14     700
8     jack           2006-07-02 10:05:19     800
9     ed             2006-07-02 10:34:46     900
10    bob            2006-07-03 09:24:13     700
11    jack           2006-07-03 10:02:09     800
12    ed             2006-07-03 10:06:20     900


Desired query output:

TID   EMP    TSTAMP_CUR            SALES_CUR  TID_PREV  TSTAMP_PREV
      SALES_PREV
4     bob    2006-07-01 11:08:51   400        1         2006-07-01
09:15:08   100
5     jack   2006-07-01 11:10:12   500        3         2006-07-01
09:41:30   300
6     ed     2006-07-01 11:15:41   600        2         2006-07-01
09:32:19   200
7     bob    2006-07-02 10:03:14   700        4         2006-07-01
11:08:51   400
8     jack   2006-07-02 10:05:19   800        5         2006-07-01
11:10:12   500
9     ed     2006-07-02 10:34:46   900        6         2006-07-01
11:15:41   600
10    bob    2006-07-03 09:24:13   700        7         2006-07-02
10:03:14   700
11    jack   2006-07-03 10:02:09   800        8         2006-07-02
10:05:19   800
12    ed     2006-07-03 10:06:20   900        9         2006-07-02
10:34:46   900

My failed query attempt:

-------
   select p1.*, p2.*
   from #tmpsales p1
      join #tmpsales p2
      on    p1.employee = p2.employee
      and   p1.tstamp = (
            select max(tstamp)
            from #tmpsales
            where tstamp < p2.tstamp
         )
   order by p1.employee, p1.tstamp
--------

(My query's column selection and order doesn't match the desired
output, but you get the idea.)

For some reason the query has trouble with the datetime comparisons,
unless I do something like use CONVERT to lop off part of the time
portion, or make sure the sample data is in even seconds.  Also, I
suspect this is going to perform very poorly over a huge table, with
hundreds of employees joining over thousands of days.

Is there a better approach I can use here?

Author
4 Aug 2006 11:35 PM
Chris Lim
stav***@mailinator.com wrote:
> For some reason the query has trouble with the datetime comparisons,
> unless I do something like use CONVERT to lop off part of the time
> portion, or make sure the sample data is in even seconds.  Also, I

Try this. It's untested as I don't have time right now to try it
against your sample data (thanks for providing that by the way) but I
think it should solve your datetime comparison problem.

   select p1.*, p2.*
   from #tmpsales p1
      join #tmpsales p2
      on    p2.employee = p1.employee
      and   p2.tstamp = (
            select max(tstamp)
            from #tmpsales
            where employee = p2.employee
            and tstamp < p1.tstamp
         )
   order by p1.employee, p1.tstamp

Make sure you have an unique index on employee and tstamp since you are
joining on them.

> suspect this is going to perform very poorly over a huge table, with
> hundreds of employees joining over thousands of days.
>
> Is there a better approach I can use here?

If your TID's are in chronological order, at least within an employee,
you could do this.

   SELECT p1.*, p2.*
   FROM #tmpsales p1
   INNER JOIN (  SELECT p1.TID, prevTID = MAX(p2.TID)
                   FROM #tmpsales p1
                   INNER JOIN #tmpsales p2
                      ON    p2.employee = p1.employee
                      AND   p2.tstamp < p1.tstamp
                   GROUP BY p1.TID
               ) tmp
        ON  tmp.TID = p1.TID
    INNER JOIN #tmpsales p2
        ON  p2.TID = tmp.prevTID
   ORDER BY p1.employee, p1.tstamp

This avoids having a correlated sub-query which should improve
performance when your table is large.

In SQL Server 2005 you can use the OVER() clause to simplify the
queries.

Chris
Author
15 Aug 2006 4:29 PM
stavros
Thanks, that worked great!

Chris Lim wrote:
Show quote
> stav***@mailinator.com wrote:
> > For some reason the query has trouble with the datetime comparisons,
> > unless I do something like use CONVERT to lop off part of the time
> > portion, or make sure the sample data is in even seconds.  Also, I
>
> Try this. It's untested as I don't have time right now to try it
> against your sample data (thanks for providing that by the way) but I
> think it should solve your datetime comparison problem.
>
>    select p1.*, p2.*
>    from #tmpsales p1
>       join #tmpsales p2
>       on    p2.employee = p1.employee
>       and   p2.tstamp = (
>             select max(tstamp)
>             from #tmpsales
>             where employee = p2.employee
>             and tstamp < p1.tstamp
>          )
>    order by p1.employee, p1.tstamp
>
> Make sure you have an unique index on employee and tstamp since you are
> joining on them.
>
> > suspect this is going to perform very poorly over a huge table, with
> > hundreds of employees joining over thousands of days.
> >
> > Is there a better approach I can use here?
>
> If your TID's are in chronological order, at least within an employee,
> you could do this.
>
>    SELECT p1.*, p2.*
>    FROM #tmpsales p1
>    INNER JOIN (  SELECT p1.TID, prevTID = MAX(p2.TID)
>                    FROM #tmpsales p1
>                    INNER JOIN #tmpsales p2
>                       ON    p2.employee = p1.employee
>                       AND   p2.tstamp < p1.tstamp
>                    GROUP BY p1.TID
>                ) tmp
>         ON  tmp.TID = p1.TID
>     INNER JOIN #tmpsales p2
>         ON  p2.TID = tmp.prevTID
>    ORDER BY p1.employee, p1.tstamp
>
> This avoids having a correlated sub-query which should improve
> performance when your table is large.
>
> In SQL Server 2005 you can use the OVER() clause to simplify the
> queries.
>
> Chris
Author
4 Aug 2006 11:45 PM
Arnie Rowland
You were very close...


SELECT
     t1.TID
   , t1.Employee
   , TStamp_Cur  = t1.TStamp
   , Sales_Cur   = t1.Sales
   , TID_Prev    = t2.TID
   , TStamp_Prev = t2.TStamp
   , Sales_Prev  = t2.Sales
FROM #TmpSales t1
   JOIN #TmpSales t2
      ON (   t1.Employee = t2.Employee
         AND t2.TStamp = ( SELECT max( TStamp )
                           FROM #TmpSales
                           WHERE (   Employee = t1.Employee
                                 AND TStamp   < t1.TStamp
                                 )
                         )
         )
ORDER BY
     t1.Employee
   , t1.TStamp

--
Arnie Rowland, Ph.D.
Westwood Consulting, Inc

Most good judgment comes from experience.
Most experience comes from bad judgment.
- Anonymous


Show quote
<stav***@mailinator.com> wrote in message news:1154730289.226075.308300@p79g2000cwp.googlegroups.com...
> I'm trying to write a self-joining query that simply returns records
> from a table alongside the previous record (based on a timestamp) for
> each employee.  I created a small mock-up table and wrote a query that
> almost works, but not quite.  Here are some details:
>
> Table:
>
>   CREATE TABLE #TMPSALES (
>      TID INT IDENTITY(1,1),
>      EMPLOYEE VARCHAR(30),
>      TSTAMP DATETIME,
>      SALES INT
>   )
>
> Sample table data:
>
> TID   EMPLOYEE       TSTAMP                  SALES
> 1     bob            2006-07-01 09:15:08     100
> 2     ed             2006-07-01 09:32:19     200
> 3     jack           2006-07-01 09:41:30     300
> 4     bob            2006-07-01 11:08:51     400
> 5     jack           2006-07-01 11:10:12     500
> 6     ed             2006-07-01 11:15:41     600
> 7     bob            2006-07-02 10:03:14     700
> 8     jack           2006-07-02 10:05:19     800
> 9     ed             2006-07-02 10:34:46     900
> 10    bob            2006-07-03 09:24:13     700
> 11    jack           2006-07-03 10:02:09     800
> 12    ed             2006-07-03 10:06:20     900
>
>
> Desired query output:
>
> TID   EMP    TSTAMP_CUR            SALES_CUR  TID_PREV  TSTAMP_PREV
>      SALES_PREV
> 4     bob    2006-07-01 11:08:51   400        1         2006-07-01
> 09:15:08   100
> 5     jack   2006-07-01 11:10:12   500        3         2006-07-01
> 09:41:30   300
> 6     ed     2006-07-01 11:15:41   600        2         2006-07-01
> 09:32:19   200
> 7     bob    2006-07-02 10:03:14   700        4         2006-07-01
> 11:08:51   400
> 8     jack   2006-07-02 10:05:19   800        5         2006-07-01
> 11:10:12   500
> 9     ed     2006-07-02 10:34:46   900        6         2006-07-01
> 11:15:41   600
> 10    bob    2006-07-03 09:24:13   700        7         2006-07-02
> 10:03:14   700
> 11    jack   2006-07-03 10:02:09   800        8         2006-07-02
> 10:05:19   800
> 12    ed     2006-07-03 10:06:20   900        9         2006-07-02
> 10:34:46   900
>
> My failed query attempt:
>
> -------
>   select p1.*, p2.*
>   from #tmpsales p1
>      join #tmpsales p2
>      on    p1.employee = p2.employee
>      and   p1.tstamp = (
>            select max(tstamp)
>            from #tmpsales
>            where tstamp < p2.tstamp
>         )
>   order by p1.employee, p1.tstamp
> --------
>
> (My query's column selection and order doesn't match the desired
> output, but you get the idea.)
>
> For some reason the query has trouble with the datetime comparisons,
> unless I do something like use CONVERT to lop off part of the time
> portion, or make sure the sample data is in even seconds.  Also, I
> suspect this is going to perform very poorly over a huge table, with
> hundreds of employees joining over thousands of days.
>
> Is there a better approach I can use here?
>
Author
5 Aug 2006 7:20 PM
--CELKO--
>> Is there a better approach I can use here? <<

Your real problem is that your DDL stinks.  You have no idea that rows
are not records, how to name a data element, what a relational key is
(never IDENTITY by definition!!) or how to model time in SQL. Or even
what a NULL means.  Your table mimics a paper list on a clipboard and
not an RDBMS.

Let's get this right:

CREATE TABLE  Sales
(emp_name  VARCHAR(30) NOT NULL,
sales_start_time  DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
sales_end_time  DATETIME,  -- null means currently open
CHECK (sales_start_time, sales_end_time),
sales_cnt INTEGER NOT NULL
   CHECK  (sales_cnt > 0),
PRIMARY KEY (emp_name, sales_time));

1) a relational key is never, never, never IDENTITY by definition!!
That is how newbies mimic  record number or the line numbers on a
clipboard -- your case.

2) Remember Zeno and Einstein?  Time is always in durations modeled as
half-open intervals, so you should have that atomic fact in two scalar
(start, end) columns. When your event is not finished, use (<start
value>, NULL)

3) Rows and columns are nothing like records and fields.  We have
constraints, data types and things that to guarantee the data is good.
We do not use temp table; we use VIEWs on base tables, derived tables
or CTEs.

What you have done is called attribute splitting.  You took the single
fact about an event and split the (start, end) pair across two rows.
You could also screw up by splitting the values of an attribute into
tables, tables into separate DBs, etc.

AddThis Social Bookmark Button