Home All Groups Group Topic Archive Search About

working with gaps in date series

Author
6 Sep 2006 1:27 PM
Steve
I've seen some pretty creative SQL statements that locate
first/last/missing elements in a series but I haven't been able to
adapt any of them to work speedily with my data set.

Here's the problem:

We have a table of services for our clients (about 2 million rows).
The rows are simply the client's ID and the date of service.  Rows may
dupe as clients may have more than one service per date.

We need to be able to define a case start and end date for each client
ID.  The end date is defined as the last service date with no further
activity within some number (usually 180) of days.  Furthermore, I'd
need to enumerate the cases per client.


So using the example below, I'd be looking for results such as:

Client  Start       End          Case
55577   2/01/2004   5/11/2004    1
55577   1/09/2005   1/09/2005    2
55577   3/04/2006   OPEN    3
72395   4/04/2006   OPEN    1


In these cases, the OPEN dates indicate that there has not been a 180
day period of inactivity since the most recent date.


I currently do this in VB, looping through the ServList dataset and
populating a CaseList recordset.  It takes about 30mins to run the job.
However, I'd prefer doing it all in a stored procedure and I'd prefer
to do it without use of cursors.  Possible?

Thanks for any help/thoughts,
Steve




CREATE TABLE #ServList (
  ClientID int
, ServDate smalldatetime)


INSERT INTO #Servlist VALUES (55577, '2/01/2004')
INSERT INTO #Servlist VALUES (55577, '2/01/2004')
INSERT INTO #Servlist VALUES (55577, '5/11/2004')
INSERT INTO #Servlist VALUES (55577, '1/09/2005')
INSERT INTO #Servlist VALUES (55577, '3/04/2006')
INSERT INTO #Servlist VALUES (55577, '5/17/2006')
INSERT INTO #Servlist VALUES (72395, '4/04/2006')
INSERT INTO #Servlist VALUES (72395, '4/05/2006')
INSERT INTO #Servlist VALUES (72395, '4/06/2006')

Author
6 Sep 2006 1:55 PM
Alexander Kuznetsov
Steve,

start from this:

SELECT DISTINCT ClientID, ServDate AS Start
, (
  SELECT MIN(s1.ServDate)
    FROM #ServList s1
    WHERE s1.ClientID = s.ClientID
    AND s1.ServDate >= s.ServDate
---------- instead of getdate(), use last midnight
    AND s1.ServDate < DATEADD(DAY, -179, GETDATE())
    AND NOT EXISTS(SELECT 1 FROM #ServList s2 WHERE s2.ClientID =
s1.ClientID AND s2.ServDate > s1.ServDate AND s2.ServDate <=
DATEADD(DAY, 179, s1.ServDate))
)
FROM #ServList s
WHERE NOT EXISTS(SELECT 1 FROM #ServList s1 WHERE s1.ClientID =
s.ClientID AND s1.ServDate >= DATEADD(DAY, -179, s.ServDate) AND
s1.ServDate < s.ServDate)

ClientID    Start

----------- ------------------------------------------------------
------------------------------------------------------
55577       2004-02-01 00:00:00
2004-05-11 00:00:00
55577       2005-01-09 00:00:00
2005-01-09 00:00:00
55577       2006-03-04 00:00:00                                    NULL
72395       2006-04-04 00:00:00                                    NULL

(4 row(s) affected)

Enumerating should be easy.
Author
6 Sep 2006 5:53 PM
Steve
Alexander,

Perfect!  Against 2 million rows, yours ran in just over 3 minutes.
I'd been trying something similar but clearly missed the boat as mine
was 45 mins on only 100,000 rows.  Thanks for setting me straight.

-Steve

Alexander Kuznetsov wrote:
Show quote
> Steve,
>
> start from this:
>
> SELECT DISTINCT ClientID, ServDate AS Start
> , (
>   SELECT MIN(s1.ServDate)
>     FROM #ServList s1
>     WHERE s1.ClientID = s.ClientID
>     AND s1.ServDate >= s.ServDate
> ---------- instead of getdate(), use last midnight
>     AND s1.ServDate < DATEADD(DAY, -179, GETDATE())
>     AND NOT EXISTS(SELECT 1 FROM #ServList s2 WHERE s2.ClientID =
> s1.ClientID AND s2.ServDate > s1.ServDate AND s2.ServDate <=
> DATEADD(DAY, 179, s1.ServDate))
> )
> FROM #ServList s
> WHERE NOT EXISTS(SELECT 1 FROM #ServList s1 WHERE s1.ClientID =
> s.ClientID AND s1.ServDate >= DATEADD(DAY, -179, s.ServDate) AND
> s1.ServDate < s.ServDate)
>
> ClientID    Start
>
> ----------- ------------------------------------------------------
> ------------------------------------------------------
> 55577       2004-02-01 00:00:00
> 2004-05-11 00:00:00
> 55577       2005-01-09 00:00:00
> 2005-01-09 00:00:00
> 55577       2006-03-04 00:00:00                                    NULL
> 72395       2006-04-04 00:00:00                                    NULL
>
> (4 row(s) affected)
>
> Enumerating should be easy.
Author
6 Sep 2006 6:22 PM
Anith Sen
In SQL 2005, you can use the ROW_NUMBER() function to derive the case
column. For the remainder of the query, you can try:

SELECT clientid,
       CONVERT( VARCHAR, MIN( start_dt ), 101 ) AS "start",
       CASE WHEN DATEDIFF( d, end_dt, CURRENT_TIMESTAMP ) > 180
            THEN CONVERT( VARCHAR, end_dt, 101 )
            ELSE 'OPEN'
       END AS "end"
  FROM ( SELECT s1.ClientId, s1.ServDate, MAX( s2.ServDate )
           FROM #ServList s1
           JOIN #ServList s2
             ON s2.ClientID = s1.ClientID
            AND s2.ServDate <= DATEADD( d, 180, s1.ServDate )
          GROUP BY s1.ClientId, s1.ServDate ) D ( clientid, start_dt,
end_dt )
GROUP BY clientid, end_dt
ORDER BY clientid, end_dt ;

--
Anith
Author
6 Sep 2006 6:44 PM
Alexander Kuznetsov
Anith,

because of this line in your query:

            AND s2.ServDate <= DATEADD( d, 180, s1.ServDate )

your intervals cannot be longer than 181 days.
Consider this sample data, where service dates are 5 months apart:


INSERT INTO #Servlist VALUES (55577, '1/01/2004')
INSERT INTO #Servlist VALUES (55577, '6/01/2004')
INSERT INTO #Servlist VALUES (55577, '11/01/2004')
INSERT INTO #Servlist VALUES (55577, '4/01/2005')
INSERT INTO #Servlist VALUES (55577, '9/01/2005')
INSERT INTO #Servlist VALUES (55577, '2/01/2006')

Your query returns a set of intervals with common boundaries

clientid    start                          end

----------- ------------------------------
------------------------------
55577       01/01/2004                     06/01/2004
55577       06/01/2004                     11/01/2004
55577       11/01/2004                     04/01/2005
55577       04/01/2005                     09/01/2005
55577       09/01/2005                     02/01/2006

(5 row(s) affected)

I believe it should return just one interval:

ClientID    Start

----------- ------------------------------------------------------
------------------------------------------------------
55577       2004-01-01 00:00:00
2006-02-01 00:00:00

(1 row(s) affected)

Steve, did I get the requirements right?
Author
6 Sep 2006 8:51 PM
Anith Sen
>> your intervals cannot be longer than 181 days.

I thought that is the requirement when he meant "with no further activity
within some number (usually 180) of days.". Perhaps, I might have misread
it, but your response seems to satisfy his needs.

--
Anith
Author
8 Sep 2006 12:48 PM
Steve
Alexander,

Yes, you've got it exactly.
-Thanks again!

Alexander Kuznetsov wrote:
Show quote
> Anith,
>
> because of this line in your query:
>
>             AND s2.ServDate <= DATEADD( d, 180, s1.ServDate )
>
> your intervals cannot be longer than 181 days.
> Consider this sample data, where service dates are 5 months apart:
>
>
> INSERT INTO #Servlist VALUES (55577, '1/01/2004')
> INSERT INTO #Servlist VALUES (55577, '6/01/2004')
> INSERT INTO #Servlist VALUES (55577, '11/01/2004')
> INSERT INTO #Servlist VALUES (55577, '4/01/2005')
> INSERT INTO #Servlist VALUES (55577, '9/01/2005')
> INSERT INTO #Servlist VALUES (55577, '2/01/2006')
>
> Your query returns a set of intervals with common boundaries
>
> clientid    start                          end
>
> ----------- ------------------------------
> ------------------------------
> 55577       01/01/2004                     06/01/2004
> 55577       06/01/2004                     11/01/2004
> 55577       11/01/2004                     04/01/2005
> 55577       04/01/2005                     09/01/2005
> 55577       09/01/2005                     02/01/2006
>
> (5 row(s) affected)
>
> I believe it should return just one interval:
>
> ClientID    Start
>
> ----------- ------------------------------------------------------
> ------------------------------------------------------
> 55577       2004-01-01 00:00:00
> 2006-02-01 00:00:00
>
> (1 row(s) affected)
>
> Steve, did I get the requirements right?
Author
6 Sep 2006 9:36 PM
--CELKO--
>>  We have a table of services for our clients (about 2 million rows). The rows are simply the client's ID and the date of service.  Rows may dupe as clients may have more than one service per date.  <<

Clear specs, thank you!!  But weak DDL and you do not seem to know that
SQL uses ISO-8601 date formats, like all other ISO standards do, in the
sample data.  Here is my guess at the real DDL:

CREATE TABLE ServiceTickets
(client_id CHAR(5) NOT NULL
   REFERENCES Clients(client_id)
   ON UPDATE CASCADE
   ON DELETE CASCADE,
service_code CHAR(5) NOT NULL
   REFERENCES Services(service_code)
   ON UPDATE CASCADE,
service_date DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
service_period INTEGER DEFAULT 0 NOT NULL,
  PRIMARY KEY (client_id, service_code, service_date));

>> We need to be able to define a case start and end date for each client ID.  The end date is defined as the last service date with no further activity within some number (usually 180) of days.  Furthermore, I'd need to enumerate the cases per client. <<

Change the way you think for a minute.  Data and declarations, not
procedures and computations.  SQL, not VB.

>> I currently do this in VB, looping through the ServList dataset and populating a CaseList recordset.  It takes about 30 mins to run the job. <<

Build a table with (n) years of these reporting periods of 180 (or
whatever) days.  A spreadsheet is great for this kind of thing.

CREATE TABLE CasePeriods -- needs better name
(service_period INTEGER NOT NULL PRIMARY KEY
  CHECK (case_period_nbr > 0),
start_date DATETIME NOT NULL,
end_date DATETIME NOT NULL);

Now we can find the first service_period that a client has.

SELECT T.client_id, MIN(C.service_period)
  FROM ServiceTickets AS T, CasePeriods AS C
WHERE T.service_date BETWEEN  C.start_date AND C.end_date
   AND end_date < CURRENT_TIMESTAMP -- completed periods only
   AND service_period = 0 -- unassigned periods
   AND ??
GROUP BY T.client_id
HAVING COUNT(*) > 0;

I have a question about the rules.  If the guy comes in on day 1, then
comes back on day 180 and 181, are we in the same case or not?  If the
guy comes in on day 1, then comes back on day 180 and 182, are we in
the same case or not?

Test this, then use it in an UPDATE statement to put a value in the
service_period column for those rows that are within your (n) day
range.  We could hide this in a VIEW, but it looks complex enough to
materialize the service_period values.

>> In these cases, the OPEN dates indicate that there has not been a 180 day period of inactivity since the most recent date. <<

Ther ain't no such date as "Open"; this is why I used a zero reporting
period for the things in process.
Author
8 Sep 2006 1:14 PM
Steve
Joe,

Thanks for the response - it looks like you've guessed that we're
talking about medical claims data.  However, in my case I think that it
may be irrelevant as to the type of service_code.  In our DDL, the
primary key would actually be based on (client_id, service_code,
service_code_modifier_1, servdate, provider_id).  For simplicity, we
use a view that's keyed on distinct client_id, servdate.

What we're needing to do is look for periods of inactivity and declare
episodes of care.  So, as an example, If the guy comes in on day 1 and
the next time is day 180; that's in the same case.  And then if he
show's up on day 181, that's still in the same case (because he showed
up on day 180).  If a guy shows on day 1 and then the next day is 181,
those are 2 different cases.  What will break the series into the next
case is when a servdate (S2) is more than 180 days from the most recent
date (S1).  Case 1 then is defined as the client_id, the first date
(whatever that was, as long as it was within the past 180 days), and S1
as the closing date.  Case 2 will be defined as the client_id, S2 as
the startdate, and some new close date (or open if most recent date
within the past 6 months).

And, you are most correct in my need for a shift in thinking from
procedural-based execution vs set-based!!  Learning ...

Thanks very much for taking the time to assist!
-Steve Lord


--CELKO-- wrote:
-- snip --
Show quote
>
> I have a question about the rules.  If the guy comes in on day 1, then
> comes back on day 180 and 181, are we in the same case or not?  If the
> guy comes in on day 1, then comes back on day 180 and 182, are we in
> the same case or not?
>
> Test this, then use it in an UPDATE statement to put a value in the
> service_period column for those rows that are within your (n) day
> range.  We could hide this in a VIEW, but it looks complex enough to
> materialize the service_period values.
>
> >> In these cases, the OPEN dates indicate that there has not been a 180 day period of inactivity since the most recent date. <<
>
> Ther ain't no such date as "Open"; this is why I used a zero reporting
> period for the things in process.

AddThis Social Bookmark Button