Home All Groups Group Topic Archive Search About

Is there a smart set-based solution?

Author
29 Jul 2005 10:53 PM
Mowgli
I have a column of starting times below (converted to seconds for
simplicity):

500
505
510
535
910
939
944
977

I need to assign a Group ID to the above values, based on a time
interval of 30 seconds, so the correct result would be:

     Group ID
500, 1
505, 1
510, 1
535, 2
910, 3
939, 3
944, 4
977, 5

The value of Group ID isn't important (could be 100, 200, etc.) as long
as values in the same 30-second interval have the same Group ID. When a
new group is started, a new 30-second interval is also started.

I have created a solution using SQL loops (row-based processing) but
was wondering if there is an efficient/elegant set-based solution when
there are thousands of records?


Thanks

Mowgli

Author
30 Jul 2005 12:55 AM
Brian Selzer
No.  Any set based solution will involve nested loops because the value of
the group id is dependent on the values of the starting time in the previous
rows.  This is one of the rare cases in which a cursor will actually perform
faster than a set based solution because it can accomplish what you want in
a single pass through the table.

Show quote
"Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message
news:1122677634.927113.124950@g43g2000cwa.googlegroups.com...
> I have a column of starting times below (converted to seconds for
> simplicity):
>
> 500
> 505
> 510
> 535
> 910
> 939
> 944
> 977
>
> I need to assign a Group ID to the above values, based on a time
> interval of 30 seconds, so the correct result would be:
>
>      Group ID
> 500, 1
> 505, 1
> 510, 1
> 535, 2
> 910, 3
> 939, 3
> 944, 4
> 977, 5
>
> The value of Group ID isn't important (could be 100, 200, etc.) as long
> as values in the same 30-second interval have the same Group ID. When a
> new group is started, a new 30-second interval is also started.
>
> I have created a solution using SQL loops (row-based processing) but
> was wondering if there is an efficient/elegant set-based solution when
> there are thousands of records?
>
>
> Thanks
>
> Mowgli
>
Author
30 Jul 2005 1:02 AM
Brian Selzer
If there are a lot of indexes on the table, you may want to cache the
results generated by the cursor in a table variable so that they can be
committed in a single set-based statement.

Show quote
"Brian Selzer" <br***@selzer-software.com> wrote in message
news:uf4PvFKlFHA.3260@TK2MSFTNGP10.phx.gbl...
> No.  Any set based solution will involve nested loops because the value of
> the group id is dependent on the values of the starting time in the
previous
> rows.  This is one of the rare cases in which a cursor will actually
perform
> faster than a set based solution because it can accomplish what you want
in
> a single pass through the table.
>
> "Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message
> news:1122677634.927113.124950@g43g2000cwa.googlegroups.com...
> > I have a column of starting times below (converted to seconds for
> > simplicity):
> >
> > 500
> > 505
> > 510
> > 535
> > 910
> > 939
> > 944
> > 977
> >
> > I need to assign a Group ID to the above values, based on a time
> > interval of 30 seconds, so the correct result would be:
> >
> >      Group ID
> > 500, 1
> > 505, 1
> > 510, 1
> > 535, 2
> > 910, 3
> > 939, 3
> > 944, 4
> > 977, 5
> >
> > The value of Group ID isn't important (could be 100, 200, etc.) as long
> > as values in the same 30-second interval have the same Group ID. When a
> > new group is started, a new 30-second interval is also started.
> >
> > I have created a solution using SQL loops (row-based processing) but
> > was wondering if there is an efficient/elegant set-based solution when
> > there are thousands of records?
> >
> >
> > Thanks
> >
> > Mowgli
> >
>
>
Author
30 Jul 2005 2:16 AM
--CELKO--
>> I have created a solution using SQL loops (row-based processing) but was wondering if there is an efficient/elegant set-based solution when there are thousands of records [sic]? <<

Rows are not records.  If you keep thinking in the wrong terms, you
will always think that you have to use loops.  Create a simple table
for each group and then use a BETWEEN predicate.  You can get an entire
year into about a million row table.

CREATE TABLE TimeGroups
(group_id INTEGER NOT NULL PRIMARY KEY
start_time INTEGER NOT NULL,
end_time INTEGER NOT NULL,
CHECK (start_time +30 = end_time));

You should find that this is about 10 times faster than a cursor as the
set gets larger.
Author
30 Jul 2005 3:06 AM
Brian Selzer
Unfortunately, your solution doesn't answer the original question.  How do
you build the table in the first place?  In this case a cursor is the
fastest way to find the group id's, because the alternative is a self-join,
which will always require more than one pass through the table.  Not every
cursor is evil.  Actually, you'll find that the time that it takes for the
cursor based solution increases linearly with the number of rows, whereas
the time that it takes for the set-based solution increases geometrically.

By the way, E. F. Codd used the term "record" in some of his papers.  I
wonder, do you add [sic] when you refer to his work?

"--CELKO--" <jcelko***@earthlink.net> wrote in message
news:1122689807.624405.127910@g49g2000cwa.googlegroups.com...
> >> I have created a solution using SQL loops (row-based processing) but
was wondering if there is an efficient/elegant set-based solution when there
are thousands of records [sic]? <<
Show quote
>
> Rows are not records.  If you keep thinking in the wrong terms, you
> will always think that you have to use loops.  Create a simple table
> for each group and then use a BETWEEN predicate.  You can get an entire
> year into about a million row table.
>
> CREATE TABLE TimeGroups
> (group_id INTEGER NOT NULL PRIMARY KEY
>  start_time INTEGER NOT NULL,
>  end_time INTEGER NOT NULL,
>  CHECK (start_time +30 = end_time));
>
> You should find that this is about 10 times faster than a cursor as the
> set gets larger.
>
Author
30 Jul 2005 12:22 PM
--CELKO--
>> your solution doesn't answer the original question.  How do you build the table in the first place?  <<

I find a spreadsheet is pretty fast for stuff like this, especailly
temporal stuff.

>> In this case a cursor is the fastest way to find the group id's, because the alternative is a self-join, which will always require more than one pass through the table.<<

Doidn't i just igve a solution which requires a single join?

>> .. the cursor based solution increases linearly with the number of rows, whereas the time that it takes for the set-based solution increases geometrically. <<

yes, the cursor is linear, but the join can be done in parallel.  Given
a clusterd indexing the joins are no worse than linear.
Author
30 Jul 2005 1:22 PM
Brian Selzer
The question was: How do you load the spreadsheet?

Joining two tables is always slower than accessing one.  A merge or hash
join of two tables requires reading at least twice as much information than
an index scan of a single table--even with clustered indexes in the same
order and a parallel execution plan.  A loop join does a scan of one table
and repetitive index seeks into another.

I really wish you would give a few moments' thought to your responses,
because your answers don't make any sense.  I'd like to believe that you're
not stupid, but come on!  Even the mentally challenged can see that reading
two tables takes longer than reading one.  Are you so religiously opposed to
cursors that you would deliberately post misleading and incorect
information?

"--CELKO--" <jcelko***@earthlink.net> wrote in message
news:1122726160.642393.161830@g44g2000cwa.googlegroups.com...
> >> your solution doesn't answer the original question.  How do you build
the table in the first place?  <<
>
> I find a spreadsheet is pretty fast for stuff like this, especailly
> temporal stuff.
>
> >> In this case a cursor is the fastest way to find the group id's,
because the alternative is a self-join, which will always require more than
one pass through the table.<<
>
> Doidn't i just igve a solution which requires a single join?
>
> >> .. the cursor based solution increases linearly with the number of
rows, whereas the time that it takes for the set-based solution increases
geometrically. <<
Show quote
>
> yes, the cursor is linear, but the join can be done in parallel.  Given
> a clusterd indexing the joins are no worse than linear.
>
Author
30 Jul 2005 3:28 AM
Shahriar
Hi Mr. Celko
You wrote... Rows are not records.  If you keep thinking in the wrong terms,
you
will always think that you have to use loops.

Could you elaborate on that a bit.

many thanks
Shahriar

Show quote
"--CELKO--" <jcelko***@earthlink.net> wrote in message
news:1122689807.624405.127910@g49g2000cwa.googlegroups.com...
>>> I have created a solution using SQL loops (row-based processing) but was
>>> wondering if there is an efficient/elegant set-based solution when there
>>> are thousands of records [sic]? <<
>
> Rows are not records.  If you keep thinking in the wrong terms, you
> will always think that you have to use loops.  Create a simple table
> for each group and then use a BETWEEN predicate.  You can get an entire
> year into about a million row table.
>
> CREATE TABLE TimeGroups
> (group_id INTEGER NOT NULL PRIMARY KEY
> start_time INTEGER NOT NULL,
> end_time INTEGER NOT NULL,
> CHECK (start_time +30 = end_time));
>
> You should find that this is about 10 times faster than a cursor as the
> set gets larger.
>
Author
30 Jul 2005 12:24 PM
--CELKO--
Like most new ideas, the hard part of understanding what the relational
model is comes in un-learning what you know about file systems.  As
Artemus Ward (William Graham Sumner, 1840-1910) put it, "It ain't so
much the things we don't know that get us into trouble. It's the things
we know that just ain't so."

If you already have a background in data processing with traditional
file systems, the first things to un-learn are:

(0) Databases are not file sets.
(1) Tables are not files.
(2) Rows are not records.
(3) Columns are not fields.

Modern data processing began with punch cards, or Hollerith cards used
by the Bureau of the Census.  Their original size was that of a United
States Dollar bill.  This was set by their inventor, Herman Hollerith,
because he could get furniture to store the cards from the United
States Treasury Department, just across the street.  Likewise, physical
constraints limited each card to 80 columns of holes in which to record
a symbol.

The influence of the punch card lingered on long after the invention of
magnetic tapes and disk for data storage.  This is why early video
display  terminals were 80 columns across.  Even today, files which
were migrated from cards to magnetic tape files or disk storage still
use 80 column records.

But the influence was not just on the physical side of data processing.
The methods for handling data from the prior media were imitated in
the new  media.

Data processing first consisted of sorting and merging decks of punch
cards (later, sequential magnetic tape files) in a series of distinct
steps.  The result of each step feed into the next step in the process.


Relational databases do not work that way.  Each user connects to the
entire database all at once, not to one file at time in a sequence of
steps.  The users might not all have the same database access rights
once they are connected, however.  Magnetic tapes could not be shared
among users at the same time, but shared data is the point of a
database.

Tables versus Files

A file is closely related to its physical storage media.  A table may
or may  not be a physical file.  DB2 from IBM uses one file per table,
while Sybase puts several entire databases inside one file.  A table is
a <i>set<i> of rows of the same kind of thing.  A set has no ordering
and it makes no sense to ask for the first or last row.

A deck of punch cards is sequential, and so are magnetic tape files.
Therefore, a <i>physical<i> file of ordered sequential records also
became the <i>mental<i> model for data processing and it is still hard
to shake.  Anytime you look at data, it is in some physical ordering.

The various access methods for disk storage system came later, but even
these access methods could not shake the mental model.

Another conceptual difference is that a file is usually data that deals
with  a whole business process.  A file has to have enough data in
itself to support applications for that business process.  Files tend
to be "mixed"  data which can be described by the name of the business
process, such as "The Payroll file" or something like that.

Tables can be either entities or relationships within a business
process.  This means that the data which was held in one file is often
put into several tables.  Tables tend to be "pure" data which can be
described by single words.  The payroll would now have separate tables
for timecards, employees, projects and so forth.

Tables as Entities

An entity is physical or conceptual "thing" which has meaning be
itself.  A  person, a sale or a product would be an example.  In a
relational database, an entity is defined by its attributes, which are
shown as values in columns in rows in a table.

To remind users that tables are sets of entities, I like to use plural
or collective nouns that describe the function of the entities within
the system for the names of tables.  Thus "Employee" is a bad name
because it is singular; "Employees" is a better name because it is
plural; "Personnel" is best because it is collective and does not
summon up a mental picture of individual persons.

If you have tables with exactly the same structure, then they are sets
of the same kind of elements.  But you should have only one set for
each kind of data element!  Files, on the other hand, were PHYSICALLY
separate units of storage which coudl be alike -- each tape or disk
file represents a step in the PROCEDURE , such as moving from raw data,
to edited data, and finally to archived data.  In SQL, this should be a
status flag in a table.

Tables as Relationships

A relationship is shown in a table by columns which reference one or
more entity tables.  Without the entities, the relationship has no
meaning, but the relationship can have attributes of its own.  For
example, a show  business contract might have an agent, an employer and
a talent.  The method of payment is an attribute of the contract
itself, and not of any of the three parties.

Rows versus Records

Rows are not records.  A record is defined in the application program
which  reads it; a row is defined in the database schema and not by a
program at all.  The name of the field in the READ or INPUT statements
of the application; a row is named in the database schema.

All empty files look alike; they are a directory entry in the operating
system with a name and a length of zero bytes of storage.  Empty tables
still have columns, constraints, security privileges and other
structures, even tho they have no rows.

This is in keeping with the set theoretical model, in which the empty
set is a perfectly good set.  The difference between SQL's set model
and standard mathematical set theory is that set theory has only one
empty set, but in SQL  each table has a different structure, so they
cannot be used in places where non-empty versions of themselves could
not be used.

Another characteristic of rows in a table is that they are all alike in
structure and they are all the "same kind of thing" in the model.  In a
file system, records can vary in size, datatypes and structure by
having flags in the data stream that tell the program reading the data
how to interpret it.  The most common examples are Pascal's variant
record, C's struct syntax and Cobol's OCCURS clause.

The OCCURS keyword in Cobol and the Variant records in Pascal have a
number which tells the program how many time a record structure is to
be repeated in the current record.

Unions in 'C' are not variant records, but variant mappings for the
same physical memory. For example:

union x {int ival; char j[4];} myStuff;

defines myStuff to be either an integer (which are 4 bytes on most
modern C  compilers, but this code is non-portable) or an array of 4
bytes, depending on whether you say myStuff.ival or myStuff.j[0];

But even more than that, files often contained records which were
summaries of subsets of the other records -- so called control break
reports.  There is no requirement that the records in a file be related
in any way -- they are literally a stream of binary data whose meaning
is assigned by the program reading them.

Columns versus Fields

A field within a record is defined by the application program that
reads it.  A column in a row in a table is defined by the database
schema.  The datatypes in a column are always scalar.

The order of the application program variables in the READ or INPUT
statements is important because the values are read into the program
variables in that order.  In SQL, columns are referenced only by their
names.  Yes, there are shorthands like the SELECT * clause and INSERT
INTO <table name> statements which expand into a list of column names
in the physical order in which the column names appear within their
table declaration, but these are shorthands which resolve to named
lists.

The use of NULLs in SQL is also unique to the language.  Fields do not
support a missing data marker as part of the field, record or file
itself.  Nor do fields have constraints which can be added to them in
the record, like the DEFAULT and CHECK() clauses in SQL.

Relationships among tables within a database

Files are pretty passive creatures and will take whatever an
application program throws at them without much objection.  Files are
also independent of each other simply because they are connected to one
application program at a time and therefore have no idea what other
files looks like.

A database actively seeks to maintain the correctness of all its data.
The methods used are triggers, constraints and declarative referential
integrity.

Declarative referential integrity (DRI) says, in effect, that data in
one table has a particular relationship with data in a second (possibly
the same)  table.  It is also possible to have the database change
itself via referential actions associated with the DRI.

For example, a business rule might be that we do not sell products
which are not in inventory.  This rule would be enforce by a REFERENCES
clause on the Orders table which references the Inventory table and a
referential action of ON DELETE CASCADE

Triggers are a more general way of doing much the same thing as DRI.  A
trigger is a block of procedural code which is executed before, after
or instead of an INSERT INTO or UPDATE statement.  You can do anything
with a trigger that you can do with DRI and more.

However, there are problems with TRIGGERs.  While there is a standard
syntax  for them in the SQL-92 standard, most vendors have not
implemented it.  What they have is very proprietary syntax instead.
Secondly, a trigger cannot pass information to the optimizer like DRI.
In the example in this section, I know that for every product number in
the Orders table, I have that same product number in the Inventory
table.  The optimizer can use that information in setting up EXISTS()
predicates and JOINs in the queries.  There is no reasonable way to
parse procedural trigger code to determine this relationship.

The CREATE ASSERTION statement in SQL-92 will allow the database to
enforce conditions on the entire database as a whole.  An ASSERTION is
not like a CHECK() clause, but the difference is subtle.  A CHECK()
clause is executed when there are rows in the table to which it is
attached.  If the table is empty then all CHECK() clauses are
effectively TRUE.  Thus, if we wanted to be sure that the Inventory
table is never empty, and we wrote:

CREATE TABLE Inventory
( ...
  CONSTRAINT inventory_not_empty
       CHECK ((SELECT COUNT(*) FROM Inventory) > 0), ... );

it would not work.  However, we could write:

CREATE ASSERTION Inventory_not_empty
        CHECK ((SELECT COUNT(*) FROM Inventory) > 0);

and we would get the desired results.  The assertion is checked at the
schema level and not at the table level.
Author
30 Jul 2005 10:19 AM
Mowgli
CELKO

Thanks for your comments... I do try to think using the right mind"set"
but it can be difficult sometimes...
Author
30 Jul 2005 12:39 PM
--CELKO--
CREATE TABLE TimeGroups
(group_id INTEGER NOT NULL PRIMARY KEY -- cliuster this
start_time DECIMAL (4,1) NOT NULL
   CHECK (start_time = ((10 * event_time)/ 10)), -- no decimal
end_time IDECIMAL (4,1) NOT NULL,
CHECK (end_time = start_time - 0.1));

CREATE TABLE Events
(event_time DECIMAL (4,1) NOT NULL PRIMARY KEY -- cliuster this
   CHECK (event_time = ((10 * event_time)/ 10)) -- no decimal
);

INSERT INTO TimeGroups VALUES (1, 500.0, 529.9);
INSERT INTO TimeGroups VALUES (2, 530.0, 559.9);
INSERT INTO TimeGroups VALUES (3, 560.0, 589.9);
INSERT INTO TimeGroups VALUES (4, 590.0, 619.9);

Now the query is simply:

SELECT G.group_id, E.event_time
   FROM TimeGroups AS G,  Events AS E
WHERE  E.event_time  BETWEEN G.start_time AND G.end_time;
Author
30 Jul 2005 2:27 PM
Mowgli
CELKO

Yes, I hope to reach the stage where I can write a simple query like
that...
I first need to populate TimeGroups and only have the event start time.

Is the solution in "SQL for Smarties"?!
Author
30 Jul 2005 3:34 PM
--CELKO--
This is an example of using an auxillary table, whcih is covered in SQL
FOR SMARTIES.  The third edition will have more details.

Consider the problem of loading the Time Groups table.  The typical
approach is to write a loop, but if you have an auxilary Seqence table
you can write this:

INSERT INTO TimeGroups (group_id, start_time, end_time)
SELECT seq, (seq + @start_value), (seq + @start_value)+ 029.9
  FROM Sequence AS S
WHERE S.seq < @my_size;

You can also use this SELECT as a derived table in the other query.  Do
not create a table variable; they take disk time at run time, will not
port, hide from the otimizer and cannot be re-used by other queries.
But if you only want to use the groupings once, then write it as one
query:

SELECT G.group_id, E.event_time
   FROM (SELECT seq, (seq + @start_value), (seq + @start_value)+ 029.9
        FROM Sequence AS S
     WHERE S.seq < @my_size) AS G(group_id, start_time, end_time),
        Events AS E
WHERE  E.event_time  BETWEEN G.start_time AND G.end_time;

Another important SQL programming principle is that time is modeled as
durations.  If you fail to do this in your tables, you will have to
write complicated code that uses self-joins to construct these
durations, like Brian has shown you.
Author
30 Jul 2005 5:28 PM
Mowgli
CELKO

I vaguely remember sequence tables in the first edition of the book - I
can never find books when I need them!
Perhaps hypnosis will help.

One of the difficulties I face is the need for non-uniform grouping
intervals...
The first group is 500 to 529.9, next one is 535 to 564.9, etc.

But I can see how the techniques you have illustrated will assist me.
Thank you.
Author
30 Jul 2005 5:42 AM
Shahriar
Mowgli

I think this may just do it.
Table1 contains the field times.
Table2 contains 2 fields, Times and groupid


insert into table2 (times,groupid) (select times,times/30 from Table1)

Shahriar


Show quote
"Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message
news:1122677634.927113.124950@g43g2000cwa.googlegroups.com...
>I have a column of starting times below (converted to seconds for
> simplicity):
>
> 500
> 505
> 510
> 535
> 910
> 939
> 944
> 977
>
> I need to assign a Group ID to the above values, based on a time
> interval of 30 seconds, so the correct result would be:
>
>     Group ID
> 500, 1
> 505, 1
> 510, 1
> 535, 2
> 910, 3
> 939, 3
> 944, 4
> 977, 5
>
> The value of Group ID isn't important (could be 100, 200, etc.) as long
> as values in the same 30-second interval have the same Group ID. When a
> new group is started, a new 30-second interval is also started.
>
> I have created a solution using SQL loops (row-based processing) but
> was wondering if there is an efficient/elegant set-based solution when
> there are thousands of records?
>
>
> Thanks
>
> Mowgli
>
Author
30 Jul 2005 6:55 AM
Brian Selzer
I tried that: it doesn't work correctly.  The start time for each group
after the first depends on whether the start time of the first row for that
group is more than 30 seconds past the start time of the first row for the
previous group.  If you use times/30 the interval that includes 500 runs
from 480 through 509, so the row with the 510 start time is in the next
interval.  The expected results clearly show that the rows with 500 and 510
are in the same interval.

There is no set-based operation that can accurately calculate the group id
in a single pass.  With a cursor, you can cache the start time of the
current group in a variable, and bump the current group id if the start time
for the current cursor row is more than 30 seconds past the cached start
time for the current group.

"Shahriar" <HelloShahr***@hotmail.com> wrote in message
news:IqEGe.97$4e6.80@trnddc04...
Show quote
> Mowgli
>
> I think this may just do it.
> Table1 contains the field times.
> Table2 contains 2 fields, Times and groupid
>
>
> insert into table2 (times,groupid) (select times,times/30 from Table1)
>
> Shahriar
>
>
> "Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message
> news:1122677634.927113.124950@g43g2000cwa.googlegroups.com...
> >I have a column of starting times below (converted to seconds for
> > simplicity):
> >
> > 500
> > 505
> > 510
> > 535
> > 910
> > 939
> > 944
> > 977
> >
> > I need to assign a Group ID to the above values, based on a time
> > interval of 30 seconds, so the correct result would be:
> >
> >     Group ID
> > 500, 1
> > 505, 1
> > 510, 1
> > 535, 2
> > 910, 3
> > 939, 3
> > 944, 4
> > 977, 5
> >
> > The value of Group ID isn't important (could be 100, 200, etc.) as long
> > as values in the same 30-second interval have the same Group ID. When a
> > new group is started, a new 30-second interval is also started.
> >
> > I have created a solution using SQL loops (row-based processing) but
> > was wondering if there is an efficient/elegant set-based solution when
> > there are thousands of records?
> >
> >
> > Thanks
> >
> > Mowgli
> >
>
>
Author
30 Jul 2005 7:32 AM
Mowgli
Brian, Shahriar

Thanks for your input and for concentrating on the problem rather than
on my terminology!

The example I gave uses a time interval of 30 seconds but this can also
be variable (20, 40, 60, etc.).
Brian, perhaps you could demonstrate the inefficient set-based solution
to illustrate your point?
.... As the issues may not be clear to anyone who hasn't attempted this
type of "real-life" problem.
Author
30 Jul 2005 11:36 AM
Brian Selzer
Here's an example set-based solution.  It's pretty ugly, but it's the only
way I can think of to meet your requirement.  A cursor in this case is
looking better and better.

create table tableName (startSeconds int not null)
go
insert tableName values (500)
insert tableName values (505)
insert tableName values (510)
insert tableName values (535)
insert tableName values (910)
insert tableName values (939)
insert tableName values (944)
insert tableName values (977)
go
declare @interval int set @interval = 30
declare @x table
(
start int not null,
stop int not null,
groupID int identity(1, 1) not null primary key clustered
)
insert @x (start, stop)
select min(start) as start, stop
from
  (select case
     when d.start in
      (select distinct
       (select top 1 startSeconds
        from tableName
        where startSeconds >= b.startSeconds
         and startSeconds < b.startSeconds + @interval
        order by startSeconds desc) as stop
       from tableName b)
      then d.stop
     else d.start
    end as start, stop
   from
    (select a.startSeconds as start,
     (select top 1 startSeconds
      from tableName
      where startSeconds >= a.startSeconds
       and startSeconds < a.startSeconds + @interval
      order by startSeconds desc) as stop
     from tableName a) d) e
group by stop
order by start
select startSeconds, groupID
from tablename
  join @x
   on (startSeconds between start and stop)



Show quote
"Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message
news:1122708734.727361.324820@g43g2000cwa.googlegroups.com...
> Brian, Shahriar
>
> Thanks for your input and for concentrating on the problem rather than
> on my terminology!
>
> The example I gave uses a time interval of 30 seconds but this can also
> be variable (20, 40, 60, etc.).
> Brian, perhaps you could demonstrate the inefficient set-based solution
> to illustrate your point?
> ... As the issues may not be clear to anyone who hasn't attempted this
> type of "real-life" problem.
>
Author
30 Jul 2005 2:10 PM
Mowgli
Brian

That's the best attempt at calculating the Group ID values so far and
quick work as well...
Your solution works for most sets of "normal" data and even when
@interval is changed.
When I tried a seemingly innocuous set of data:

truncate table tableName
go
insert tableName values (1)
insert tableName values (29)
insert tableName values (30)
insert tableName values (31)

The result was...

startSeconds groupID
1            1
29           1
29           2
30           1
30           2
31           2


The table variable (@x) contains:

start       stop        groupID
1           30          1
29          31          2

So I assume that there is a problem populating @x...
However, your attempt is better than any of mine and probably close to
the final solution!
Author
30 Jul 2005 5:02 PM
Brian Selzer
Here's the cursor based solution.

declare @interval int set @interval = 30
declare @startSeconds int, @groupID int, @groupStart int
select @startSeconds = 0, @groupID = 0, @groupStart = - @interval * 2
declare @x table (startSeconds int not null, groupID int not null)
declare @c cursor
set @c = cursor local fast_forward for
select startSeconds from tableName order by startSeconds
open @c
if @@cursor_rows != 0
fetch @c into @startSeconds
while @@fetch_status = 0
begin
if @groupStart + @interval < @startSeconds
  select @groupID = @groupID + 1, @groupStart = @startSeconds
insert @x (startSeconds, groupID) values (@startSeconds, @groupID)
fetch @c into @startSeconds
end
close @c
deallocate @c
select * from @x

Show quote
"Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message
news:1122732601.149568.274190@o13g2000cwo.googlegroups.com...
> Brian
>
> That's the best attempt at calculating the Group ID values so far and
> quick work as well...
> Your solution works for most sets of "normal" data and even when
> @interval is changed.
> When I tried a seemingly innocuous set of data:
>
> truncate table tableName
> go
> insert tableName values (1)
> insert tableName values (29)
> insert tableName values (30)
> insert tableName values (31)
>
> The result was...
>
> startSeconds groupID
> 1            1
> 29           1
> 29           2
> 30           1
> 30           2
> 31           2
>
>
> The table variable (@x) contains:
>
> start       stop        groupID
> 1           30          1
> 29          31          2
>
> So I assume that there is a problem populating @x...
> However, your attempt is better than any of mine and probably close to
> the final solution!
>
Author
30 Jul 2005 5:40 PM
Brian Selzer
sorry, slight bug in previous post.  here's the correct version:

declare @interval int set @interval = 30
declare @startSeconds int, @groupID int, @groupStart int
select @startSeconds = 0, @groupID = 0, @groupStart = - @interval * 2
declare @x table (startSeconds int not null, groupID int not null)
declare @c cursor
set @c = cursor local fast_forward for
select startSeconds from tableName order by startSeconds
open @c
if @@cursor_rows != 0
begin
fetch @c into @startSeconds
while @@fetch_status = 0
begin
  if @groupStart + @interval <= @startSeconds
   select @groupID = @groupID + 1, @groupStart = @startSeconds
  insert @x (startSeconds, groupID) values (@startSeconds, @groupID)
  fetch @c into @startSeconds
end
end
close @c
deallocate @c
select * from @x
go

This makes one only one pass through the table, so it will perform much
better than the set-based solution for larger data sets.

Show quote
"Brian Selzer" <br***@selzer-software.com> wrote in message
news:#OZ7FiSlFHA.1044@tk2msftngp13.phx.gbl...
> Here's the cursor based solution.
>
> declare @interval int set @interval = 30
> declare @startSeconds int, @groupID int, @groupStart int
> select @startSeconds = 0, @groupID = 0, @groupStart = - @interval * 2
> declare @x table (startSeconds int not null, groupID int not null)
> declare @c cursor
> set @c = cursor local fast_forward for
>  select startSeconds from tableName order by startSeconds
> open @c
> if @@cursor_rows != 0
> fetch @c into @startSeconds
> while @@fetch_status = 0
> begin
>  if @groupStart + @interval < @startSeconds
>   select @groupID = @groupID + 1, @groupStart = @startSeconds
>  insert @x (startSeconds, groupID) values (@startSeconds, @groupID)
>  fetch @c into @startSeconds
> end
> close @c
> deallocate @c
> select * from @x
>
> "Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message
> news:1122732601.149568.274190@o13g2000cwo.googlegroups.com...
> > Brian
> >
> > That's the best attempt at calculating the Group ID values so far and
> > quick work as well...
> > Your solution works for most sets of "normal" data and even when
> > @interval is changed.
> > When I tried a seemingly innocuous set of data:
> >
> > truncate table tableName
> > go
> > insert tableName values (1)
> > insert tableName values (29)
> > insert tableName values (30)
> > insert tableName values (31)
> >
> > The result was...
> >
> > startSeconds groupID
> > 1            1
> > 29           1
> > 29           2
> > 30           1
> > 30           2
> > 31           2
> >
> >
> > The table variable (@x) contains:
> >
> > start       stop        groupID
> > 1           30          1
> > 29          31          2
> >
> > So I assume that there is a problem populating @x...
> > However, your attempt is better than any of mine and probably close to
> > the final solution!
> >
>
>
Author
30 Jul 2005 6:09 PM
Mowgli
Brian

Don't worry, the end of a group interval is open to interpretation!

I tested your code on my aging laptop (85,000 rows in the table) and it
completed in less than a minute. In terms of logic that's the most
efficient cursor solution I've come across. I don't think anyone can
produce a better row-based processing solution than that.

Thank you for your time.
Author
30 Jul 2005 7:09 PM
Brian Selzer
> Don't worry, the end of a group interval is open to interpretation!
Yeah, but the results were wrong, {1, 29, 30, 31} should produce {1, 1, 1,
2}, not {1, 1, 1, 1}, so I fixed it.

The solution will perform better if the source data has an index (preferably
clustered) on startSeconds.  A clustered index scan finds the first matching
row and then walks through the leaf pages, whereas a nonclustered index has
to look up each row in the clustered index.  Either type of index eliminates
the requisite sort of the input data.

I've honed my cursor technique over the years, and I have a few pointers,
for anybody who's interested.

(1) avoid selects from normal tables within a fetch loop.  Select all of the
needed rows using set-based operations before the first row is fetched.

(2) avoid updates to normal tables within a fetch loop.  Cache updates in
temp tables or table variables as the fetch loop executes, and then commit
them using set-based logic after the cursor is closed.  This way the server
can optimize index updates by committing all of the changes at once to each
separate index, instead of hundreds or thousands of single updates.  In
addition, triggers fire only once for each update, further enhancing
performance.  Also table variables are created and operated on in memory,
provided you have enough, so the impact of caching is usually minimal.

(3) avoid using dynamic cursors.  Use FAST_FORWARD cursors and implement
optimistic concurrency using rowversion columns.  Dynamic cursors hold on to
locks, which can cause blocking and deadlocks.

(4)  avoid executing a fetch loop within a transaction.  See (3).

(5) avoid executing stored procedures from within a fetch loop.  Another
stored procedure may access (see 1) or update (see 2) normal tables.  In
addition, the logic within them may change. Better to keep the code executed
within a fetch loop within the same procedure.  That way the performance of
the fetch loop isn't dependent on the performance of other procedures that
may change.

A scan with a cursor typically takes twice as long as a set-based scan, so
set-based operations are to be preferred over row-based; however, in very
rare instances a cursor can improve performance dramatically.  If you can
eliminate one or more iterations (loop joins) from an execution plan by
using a cursor, then the performance improvement can be astronomical.  It
depends on the number of rows in the result set on the left hand side of
each iteration and the number of operations on the right hand side.


Show quote
"Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message
news:1122746989.689049.99390@g44g2000cwa.googlegroups.com...
> Brian
>
> Don't worry, the end of a group interval is open to interpretation!
>
> I tested your code on my aging laptop (85,000 rows in the table) and it
> completed in less than a minute. In terms of logic that's the most
> efficient cursor solution I've come across. I don't think anyone can
> produce a better row-based processing solution than that.
>
> Thank you for your time.
>
Author
30 Jul 2005 7:30 PM
Mowgli
Brian

Thanks for the pointers - very much appreciated.
Author
30 Jul 2005 7:39 AM
Shahriar
Brian

I wasnt looking at his data.  What I understood from his question was that
he wants to group the times in chunks of 30 secs.  However, even so, its
still simple.
All you have to do is add the base number to the formula.

Something like this.
set @base=500
insert into table2 (times,groupid) (select times+@base,times/30 from
Table1)

Of course, you can also just dynamically get the smallest time and store it
in @base.

Result set based on his data set:
Time   Groupid
500    33
505    33
510    33
535    34
910    47
934    47
944    48
977    49


Shahriar


Show quote
"Brian Selzer" <br***@selzer-software.com> wrote in message
news:OiasqONlFHA.576@TK2MSFTNGP15.phx.gbl...
>I tried that: it doesn't work correctly.  The start time for each group
> after the first depends on whether the start time of the first row for
> that
> group is more than 30 seconds past the start time of the first row for the
> previous group.  If you use times/30 the interval that includes 500 runs
> from 480 through 509, so the row with the 510 start time is in the next
> interval.  The expected results clearly show that the rows with 500 and
> 510
> are in the same interval.
>
> There is no set-based operation that can accurately calculate the group id
> in a single pass.  With a cursor, you can cache the start time of the
> current group in a variable, and bump the current group id if the start
> time
> for the current cursor row is more than 30 seconds past the cached start
> time for the current group.
>
> "Shahriar" <HelloShahr***@hotmail.com> wrote in message
> news:IqEGe.97$4e6.80@trnddc04...
>> Mowgli
>>
>> I think this may just do it.
>> Table1 contains the field times.
>> Table2 contains 2 fields, Times and groupid
>>
>>
>> insert into table2 (times,groupid) (select times,times/30 from Table1)
>>
>> Shahriar
>>
>>
>> "Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message
>> news:1122677634.927113.124950@g43g2000cwa.googlegroups.com...
>> >I have a column of starting times below (converted to seconds for
>> > simplicity):
>> >
>> > 500
>> > 505
>> > 510
>> > 535
>> > 910
>> > 939
>> > 944
>> > 977
>> >
>> > I need to assign a Group ID to the above values, based on a time
>> > interval of 30 seconds, so the correct result would be:
>> >
>> >     Group ID
>> > 500, 1
>> > 505, 1
>> > 510, 1
>> > 535, 2
>> > 910, 3
>> > 939, 3
>> > 944, 4
>> > 977, 5
>> >
>> > The value of Group ID isn't important (could be 100, 200, etc.) as long
>> > as values in the same 30-second interval have the same Group ID. When a
>> > new group is started, a new 30-second interval is also started.
>> >
>> > I have created a solution using SQL loops (row-based processing) but
>> > was wondering if there is an efficient/elegant set-based solution when
>> > there are thousands of records?
>> >
>> >
>> > Thanks
>> >
>> > Mowgli
>> >
>>
>>
>
>
Author
30 Jul 2005 9:03 AM
Mowgli
Shahriar

Not sure if you mis-typed, do you mean...

set @base=500
select times, (times+@base) / 30 from Table1

That looks correct for the data provided but if we insert 2 more
values:

Time   Groupid
500    33
505    33
510    33
535    34
638    37
667    38
910    47
934    47
944    48
977    49

We find that 638 and 667 have different Group IDs when they should be
the same!
Author
30 Jul 2005 7:23 AM
Chandra
hi,

if u are not particular about the groupID, i have a solution for u.
but as per the requirement, group ID definately falls in 30sec intervals.

select num, (num-(select min(num)from #test))/30 GroupID from #test

please let me know if this answers your question

--
best Regards,
Chandra
http://chanduas.blogspot.com/
http://groups.msn.com/SQLResource/
---------------------------------------



Show quote
"Mowgli" wrote:

> I have a column of starting times below (converted to seconds for
> simplicity):
>
> 500
> 505
> 510
> 535
> 910
> 939
> 944
> 977
>
> I need to assign a Group ID to the above values, based on a time
> interval of 30 seconds, so the correct result would be:
>
>      Group ID
> 500, 1
> 505, 1
> 510, 1
> 535, 2
> 910, 3
> 939, 3
> 944, 4
> 977, 5
>
> The value of Group ID isn't important (could be 100, 200, etc.) as long
> as values in the same 30-second interval have the same Group ID. When a
> new group is started, a new 30-second interval is also started.
>
> I have created a solution using SQL loops (row-based processing) but
> was wondering if there is an efficient/elegant set-based solution when
> there are thousands of records?
>
>
> Thanks
>
> Mowgli
>
>
Author
30 Jul 2005 7:45 AM
Mowgli
Chandra,

Nice try! Your query gives us the following result set:

500, 0
505, 0
510, 0
535, 1
910, 13
939, 14
944, 14
977, 15

As you can see, 910 and 939 are in different groups when they should be
in the same group...
Perhaps you could tweak your original solution?
Author
30 Jul 2005 10:06 PM
Hugo Kornelis
On 29 Jul 2005 15:53:54 -0700, Mowgli wrote:

(snip)
>I have created a solution using SQL loops (row-based processing) but
>was wondering if there is an efficient/elegant set-based solution when
>there are thousands of records?

Hi Mowgli,

I agree with Brian that there is probably no efficient way to do this in
one single set-based SQL statement. But I don't think that this means
row by row processing is the only efficient way.

It's about midnight here now - too late for me to write SQL of this
level of complexity. But here's a rough outline of an alternative
approach that I'd suggest. I'm not sure if it will run faster than a
straight row-by-row version, but I gather that it's worth a try.

Step 1: For each row that is not within 30 seconds past any other row,
set the group value equal to the seconds value.

Step 2: Add all rows within 30 seconds of any of the group values to the
same group.

Step 3: Repeat step 1, but exclude all rows that are already in a group.

Step 4: Repeat step 2.

Repeat steps 3 and 4 until @@ROWCOUNT on either of them is 0. At that
point, all rows should have a group set.

A slightly different version of this would repeat only step 1 and
postpone the filling of all rows in the same group to the end. That
would reduce the number of UPDATE statements executed, but it would
probably increrase the complexity of the SQL for step 1.

Let me know if you want to try this, but need help writing the SQL
statements. I'll have a go at it when I have more time and when I am
less tired.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
30 Jul 2005 10:57 PM
Mowgli
Hugo

Yes, your alternative method has merit and I would welcome any attempt
you make to write the SQL statements.

I will have a go myself but it will probably take me longer... and I'm
not even tired!

Thanks
Author
31 Jul 2005 10:57 PM
Hugo Kornelis
On 30 Jul 2005 15:57:52 -0700, Mowgli wrote:

>Hugo
>
>Yes, your alternative method has merit and I would welcome any attempt
>you make to write the SQL statements.

Hi Mowgli,

I have put together the SQL for the two variations of my idea that I
posted, and tested them on your sample data + some extra rows. I think
the results are as you want them.

My guess is that the performance will depend for a large part on the
actual distribution of your data. If you have many gaps of over 30
seconds between rows, performance will be fast. If you have hundreds of
rows with no 30+-second gap in between them, either the cursor or
Steve's code will probably win.

Let me know how these versions work on your data!


Version 1:
=========

CREATE TABLE StartTimes
            (StartSeconds int NOT NULL PRIMARY KEY,
             GroupID int DEFAULT NULL)
go
INSERT INTO StartTimes (StartSeconds)
SELECT 500
UNION ALL
SELECT 505
UNION ALL
SELECT 510
UNION ALL
SELECT 529
UNION ALL
SELECT 530
UNION ALL
SELECT 531
UNION ALL
SELECT 535
UNION ALL
SELECT 550
UNION ALL
SELECT 910
UNION ALL
SELECT 939
UNION ALL
SELECT 944
UNION ALL
SELECT 977
go
Step1:
UPDATE StartTimes
SET    GroupID = StartSeconds
WHERE NOT EXISTS
(SELECT *
  FROM   StartTimes AS b
  WHERE  b.StartSeconds BETWEEN StartTimes.StartSeconds - 29
                            AND StartTimes.StartSeconds - 1
  AND    b.GroupID IS NULL)
AND    GroupID IS NULL
IF @@ROWCOUNT = 0 GOTO Done
Step2:
UPDATE StartTimes
SET    GroupID =
(SELECT MAX(GroupID)
  FROM   StartTimes AS b
  WHERE  b.StartSeconds BETWEEN StartTimes.StartSeconds - 29
                            AND StartTimes.StartSeconds - 1
  AND    b.GroupID = b.StartSeconds)
WHERE  GroupID IS NULL
IF @@ROWCOUNT > 0 GOTO Step1
Done:
-- Show results
SELECT   StartSeconds, GroupID
FROM     StartTimes
ORDER BY StartSeconds
go
DROP TABLE StartTimes
go


Version 2:
=========

CREATE TABLE StartTimes
            (StartSeconds int NOT NULL PRIMARY KEY,
             GroupID int DEFAULT NULL)
go
INSERT INTO StartTimes (StartSeconds)
SELECT 500
UNION ALL
SELECT 505
UNION ALL
SELECT 510
UNION ALL
SELECT 529
UNION ALL
SELECT 530
UNION ALL
SELECT 531
UNION ALL
SELECT 535
UNION ALL
SELECT 550
UNION ALL
SELECT 910
UNION ALL
SELECT 939
UNION ALL
SELECT 944
UNION ALL
SELECT 977
go
Step1:
UPDATE StartTimes
SET    GroupID = StartSeconds
WHERE  GroupID IS NULL
AND NOT EXISTS
(SELECT *
  FROM   StartTimes AS b
  WHERE  b.StartSeconds BETWEEN StartTimes.StartSeconds - 29
                            AND StartTimes.StartSeconds - 1
  AND (  b.GroupID IS NOT NULL
  OR (   b.GroupID IS NULL
     AND NOT EXISTS
   (SELECT *
    FROM   StartTimes AS c
    WHERE  c.StartSeconds BETWEEN b.StartSeconds - 29
                              AND b.StartSeconds - 1
    AND    c.GroupID = c.StartSeconds))))
IF @@ROWCOUNT > 0 GOTO Step1
Step2:
UPDATE StartTimes
SET    GroupID =
(SELECT MAX(GroupID)
  FROM   StartTimes AS b
  WHERE  b.StartSeconds BETWEEN StartTimes.StartSeconds - 29
                            AND StartTimes.StartSeconds - 1
  AND    b.GroupID = b.StartSeconds)
WHERE  GroupID IS NULL
Done:
-- Show results
SELECT   StartSeconds, GroupID
FROM     StartTimes
ORDER BY StartSeconds
go
DROP TABLE StartTimes
go


Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
1 Aug 2005 2:09 PM
Mowgli
Hugo

Thanks - your solutions do produce the correct results, so we can now
do a (brief) performance comparison:

1. Iterative Solution: 20 seconds (Steve)
2. Cursor Solution: 21 seconds (Brian)
3. Alternative Solution 1: 45 seconds (Hugo)
4. Alternative Solution 2: 126 seconds (Hugo)

I have excluded the time taken to display the generated data so as to
focus on the relative performance of each algorithm. I have also added
an extra step (timed) for the non-cursor solutions whereby the data to
be processed is placed in a temporary table (avoids some concurrency
issues).

For the Iterative/Cursor solutions, the above results are average times
for processing 85,000 rows of sample data. I also performed tests on
different group intervals and smaller data sets - the differences in
relative performance changed but the final ranking (in the main)
didn't.

In fact, the performance of the iterative and cursor solutions were
very similar!

The results of the "Alternative Solutions" were based on the processing
of only 1,700 rows - so these were much slower than the others.
However, I may know why that is...

Alternative Solution 1:
I think the UPDATE statement in step 2 processes too many rows at each
iteration. If there are 1,700 rows with start times that are close
together, with a 30 second interval setting, at each iteration hundreds
of rows could be updated. So NULL Group IDs may be assigned NULL many
times.

If the above solution can be tweaked I think it will perform much, much
better. It's too good an idea to be ignored!
Hugo, thanks for bringing it to my attention.
Author
1 Aug 2005 8:01 PM
Hugo Kornelis
On 1 Aug 2005 07:09:51 -0700, Mowgli wrote:

>Hugo
>
>Thanks - your solutions do produce the correct results, so we can now
>do a (brief) performance comparison:
>
>1. Iterative Solution: 20 seconds (Steve)
>2. Cursor Solution: 21 seconds (Brian)
>3. Alternative Solution 1: 45 seconds (Hugo)
>4. Alternative Solution 2: 126 seconds (Hugo)

Hi Mowgli,

Thanks for posting the results. I expected solution 2 to be slower than
solution 1. Good to see my suspicion confirmed.

(snip)
>The results of the "Alternative Solutions" were based on the processing
>of only 1,700 rows - so these were much slower than the others.
>However, I may know why that is...
>
>Alternative Solution 1:
>I think the UPDATE statement in step 2 processes too many rows at each
>iteration. If there are 1,700 rows with start times that are close
>together, with a 30 second interval setting, at each iteration hundreds
>of rows could be updated. So NULL Group IDs may be assigned NULL many
>times.

You got it. When posting my solution, I was aware that large numbers of
rows with small starttime differences would kill performance. For data
with lots of 30+-second intervals, my alternative solution would perform
better (or should I say: less bad). Maybe still not as fast as the other
solutions, though. That's because it starts at each 30+-sec gap
simultaneously. To gain more insight in how the two solutions work, add
a SELECT * FROM StartTimes at strategic places - but not just before the
IF @@ROWCOUNT statements! (I did that yesterday, and it took me several
minutes of staring at the screen before I finally realised why the *$^#%
thing insisted on endless looping....)

In the extreme case of many rows with no single 30+-second interval, my
solution would follow the exact same steps as David's - but it'd use
more complicated code to get there.
In the other extreme, where each of the final groups is 30+ seconds
apart from each other group, my solution would finish after the first
iteration. But unfortunately, your data is not like that. :-)


>If the above solution can be tweaked I think it will perform much, much
>better. It's too good an idea to be ignored!

To prevent the needless UPDATEs, you'd have to repeat the correlated
subquery of the SET in the WHERE clause of the UPDATE. Like this:

AND EXISTS (SELECT *
            FROM   StartTimes AS b
            WHERE  b.StartSeconds BETWEEN StartTimes.StartSeconds - 29
                                      AND StartTimes.StartSeconds - 1
            AND    b.GroupID = b.StartSeconds)

I doubt f it will help much, though.

If you do want to see how much faster you can get this, there are only
two things I could consider.

1. Indexing. My feeling is that the clustered unique index on the
StartSeconds column will be the best. (That index is automatically
created if you declare StartSeconds the primary key, as I did in my
sample data). But it never hurts to try sopme alternatives. Or to run
the code with profiler active and feed the output to the index tuning
wizard.

2. I think some speed can be gained by changing the GroupID column to
    GroupID int NOT NULL DEFAULT 0  -- Or any other unused value.
Then, change the queries to test for = 0 or <> 0 instead of IS NULL and
IS NOT NULL. I think that this will result in a big decrease of the
number of page splits necessary during the UPDATE statements.

However, at the end of the day, I suspect that David's code and the
cursor will still be faster. As explained above - this solution is
optimized for a data distribution that differs from yours.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
4 Aug 2005 7:00 PM
Mowgli
Hugo

Your suggested changes have improved the query processing on my data
sets:
4,000 rows in 45 seconds! (was 1,700 rows in 45 seconds previously)

But as you say, your solution works much better on less dense data sets.
Author
4 Aug 2005 7:50 PM
Hugo Kornelis
On 4 Aug 2005 12:00:13 -0700, Mowgli wrote:

>Hugo
>
>Your suggested changes have improved the query processing on my data
>sets:
>4,000 rows in 45 seconds! (was 1,700 rows in 45 seconds previously)
>
>But as you say, your solution works much better on less dense data sets.

Hi Mowgli,

Thanks for letting me know how it worked out!

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
30 Jul 2005 11:08 PM
Alejandro Mesa
Mowgli,

Here is a possible solution. I am using a view to make the main select
readable. If you need a range diff from 30, you can create a inline udf that
expect this value as a parameter. I am posting both methods.

create table t1 (
c1 int
)
go

insert into t1 values(500)
insert into t1 values(505)
insert into t1 values(510)
insert into t1 values(535)
insert into t1 values(910)
insert into t1 values(939)
insert into t1 values(944)
insert into t1 values(977)
go

create view v1
as
select
    c1,
    (select min(b.c1) from t1 as b where b.c1 between a.c1 - 30 and a.c1) as c2
from
    t1 as a
go

-- using a view
select
    c1,
    (select count(distinct b.c2) from v1 as b where b.c2 <= a.c2) as rank
from
    v1 as a
order by
    rank, c1
go

create function f1 (
@i int
)
returns table
as
return (
select
    c1,
    (select min(b.c1) from t1 as b where b.c1 between a.c1 - @i and a.c1) as c2
from
    t1 as a
)
go

-- using a udf
select
    c1,
    (select count(distinct b.c2) from dbo.f1(30) as b where b.c2 <= a.c2) as rank
from
    dbo.f1(30) as a
order by
    rank, c1
go

select
    c1,
    (select count(distinct b.c2) from dbo.f1(25) as b where b.c2 <= a.c2) as rank
from
    dbo.f1(25) as a
order by
    rank, c1
go

drop view v1
go

drop funcion f1
go

drop table t1
go


AMB


Show quote
"Mowgli" wrote:

> I have a column of starting times below (converted to seconds for
> simplicity):
>
> 500
> 505
> 510
> 535
> 910
> 939
> 944
> 977
>
> I need to assign a Group ID to the above values, based on a time
> interval of 30 seconds, so the correct result would be:
>
>      Group ID
> 500, 1
> 505, 1
> 510, 1
> 535, 2
> 910, 3
> 939, 3
> 944, 4
> 977, 5
>
> The value of Group ID isn't important (could be 100, 200, etc.) as long
> as values in the same 30-second interval have the same Group ID. When a
> new group is started, a new 30-second interval is also started.
>
> I have created a solution using SQL loops (row-based processing) but
> was wondering if there is an efficient/elegant set-based solution when
> there are thousands of records?
>
>
> Thanks
>
> Mowgli
>
>
Author
31 Jul 2005 12:03 AM
Alejandro Mesa
You can also use an inner join instead a correlated subquery, to calculate
the rank. May be having a clustered index by c1 can improve performance.

select
    a.c1,
    count(distinct b.c2) as rank
from
    dbo.f1(30) as a
    inner join
    dbo.f1(30) as b
    on b.c2 <= a.c2
group by
    a.c1
order by
    rank, a.c1
go


AMB

Show quote
"Alejandro Mesa" wrote:

> Mowgli,
>
> Here is a possible solution. I am using a view to make the main select
> readable. If you need a range diff from 30, you can create a inline udf that
> expect this value as a parameter. I am posting both methods.
>
> create table t1 (
> c1 int
> )
> go
>
> insert into t1 values(500)
> insert into t1 values(505)
> insert into t1 values(510)
> insert into t1 values(535)
> insert into t1 values(910)
> insert into t1 values(939)
> insert into t1 values(944)
> insert into t1 values(977)
> go
>
> create view v1
> as
> select
>     c1,
>     (select min(b.c1) from t1 as b where b.c1 between a.c1 - 30 and a.c1) as c2
> from
>     t1 as a
> go
>
> -- using a view
> select
>     c1,
>     (select count(distinct b.c2) from v1 as b where b.c2 <= a.c2) as rank
> from
>     v1 as a
> order by
>     rank, c1
> go
>
> create function f1 (
> @i int
> )
> returns table
> as
> return (
> select
>     c1,
>     (select min(b.c1) from t1 as b where b.c1 between a.c1 - @i and a.c1) as c2
> from
>     t1 as a
> )
> go
>
> -- using a udf
> select
>     c1,
>     (select count(distinct b.c2) from dbo.f1(30) as b where b.c2 <= a.c2) as rank
> from
>     dbo.f1(30) as a
> order by
>     rank, c1
> go
>
> select
>     c1,
>     (select count(distinct b.c2) from dbo.f1(25) as b where b.c2 <= a.c2) as rank
> from
>     dbo.f1(25) as a
> order by
>     rank, c1
> go
>
> drop view v1
> go
>
> drop funcion f1
> go
>
> drop table t1
> go
>
>
> AMB
>
>
> "Mowgli" wrote:
>
> > I have a column of starting times below (converted to seconds for
> > simplicity):
> >
> > 500
> > 505
> > 510
> > 535
> > 910
> > 939
> > 944
> > 977
> >
> > I need to assign a Group ID to the above values, based on a time
> > interval of 30 seconds, so the correct result would be:
> >
> >      Group ID
> > 500, 1
> > 505, 1
> > 510, 1
> > 535, 2
> > 910, 3
> > 939, 3
> > 944, 4
> > 977, 5
> >
> > The value of Group ID isn't important (could be 100, 200, etc.) as long
> > as values in the same 30-second interval have the same Group ID. When a
> > new group is started, a new 30-second interval is also started.
> >
> > I have created a solution using SQL loops (row-based processing) but
> > was wondering if there is an efficient/elegant set-based solution when
> > there are thousands of records?
> >
> >
> > Thanks
> >
> > Mowgli
> >
> >
Author
31 Jul 2005 12:46 AM
Mowgli
Alejandro

Yes, it's a good idea to compare inner join and correlated subquery
performance as it can be difficult predicting which is best (if the SQL
optimiser doesn't choose for you).
Your solutions work well for "normal" data but when using the following
data set:

truncate table t1
go
insert t1 values (1)
insert t1 values (2)
insert t1 values (3)
insert t1 values (29)
insert t1 values (30)
insert t1 values (32)
insert t1 values (33)


The result was...

c1          rank
----------- -----------
1           1
2           1
3           1
29          1
30          1
32          2
33          3


We would expect values 32 and 33 to have the same rank.
I think a set-based solution to this problem is quite tricky... but I
do like the elegance of your queries - very neat.
Author
31 Jul 2005 12:10 PM
Alejandro Mesa
Well, now I understand what "pure coincidence" means.


AMB

Show quote
"Mowgli" wrote:

> Alejandro
>
> Yes, it's a good idea to compare inner join and correlated subquery
> performance as it can be difficult predicting which is best (if the SQL
> optimiser doesn't choose for you).
> Your solutions work well for "normal" data but when using the following
> data set:
>
> truncate table t1
> go
> insert t1 values (1)
> insert t1 values (2)
> insert t1 values (3)
> insert t1 values (29)
> insert t1 values (30)
> insert t1 values (32)
> insert t1 values (33)
>
>
> The result was...
>
> c1          rank
> ----------- -----------
> 1           1
> 2           1
> 3           1
> 29          1
> 30          1
> 32          2
> 33          3
>
>
> We would expect values 32 and 33 to have the same rank.
> I think a set-based solution to this problem is quite tricky... but I
> do like the elegance of your queries - very neat.
>
>
Author
31 Jul 2005 2:42 AM
Steve Kass
Mowgli,

  Someone may have suggested this, but here's an iterative solution
that doesn't use cursors.  You won't find a non-recursive set-based
solution, I don't think, because the definition of the group ID itself
is recursive.

This solution will do a few reads for each group, and may or may
not be faster than a cursor-based solution, since the range size
here can comprise at most 30 (a small number) of rows.  If your
actual scenario has groups with very many rows, this should be
a better solution than one using cursors.

-- sample data
create table StartTimes (
  s int primary key
)

insert into StartTimes
select distinct cast(10*Freight as int)
from Northwind..Orders
where EmployeeID < 4
and Freight between 50 and 300
go

-- peek at sample data
select top 10 * from StartTimes
order by s
go

  declare @t table (s int, groupID int)

  declare @group int set @group = 1
  declare @upto int set @upto = -2147483648
  declare @lastupto int

  while @@rowcount > 0 begin
    set @lastupto = @upto
    set @upto = (
      select min(s) from StartTimes
      where s > @upto
    )
    if @upto is null or @upto > 2147483617 begin
      insert into @t
      select s, @group
      from StartTimes
      where s > @lastupto
      break
    end
    set @upto = @upto + 30

    insert into @t
    select s, @group
    from StartTimes
    where s > @lastupto
    and s <= @upto
    set @group = @group + 1 --break
  end
  select * from @t
go

drop table StartTimes


-- Steve Kass
-- Drew University
-- 36F7CD49-68D3-4451-9FAE-5E48B524A1A5

Mowgli wrote:

Show quote
>I have a column of starting times below (converted to seconds for
>simplicity):
>
>500
>505
>510
>535
>910
>939
>944
>977
>
>I need to assign a Group ID to the above values, based on a time
>interval of 30 seconds, so the correct result would be:
>
>     Group ID
>500, 1
>505, 1
>510, 1
>535, 2
>910, 3
>939, 3
>944, 4
>977, 5
>
>The value of Group ID isn't important (could be 100, 200, etc.) as long
>as values in the same 30-second interval have the same Group ID. When a
>new group is started, a new 30-second interval is also started.
>
>I have created a solution using SQL loops (row-based processing) but
>was wondering if there is an efficient/elegant set-based solution when
>there are thousands of records?
>
>
>Thanks
>
>Mowgli
>

>
Author
31 Jul 2005 7:02 AM
Mowgli
Steve

I modified the group interval from 30 to 29 so that your code output
could be directly compared with other solutions.

I compared your code performance with Brian's cursor solution submitted
earlier (85,000 rows in the table) and your code
ran around 35% faster. The proviso is that the start time column in the
source table needs to be indexed.
Without the index, the cursor solution is at least 10 times faster -
but that was to be expected.

I don't think that your iterative solution can be improved much more as
it's currently very efficient and (as you pointed out) improves as the
interval length increases. I'm not sure how Hugo's suggested strategy
will compare to yours in practice but I would love to find out.

Thanks very much for your contribution!
Author
31 Jul 2005 12:30 PM
Brian Selzer
Be careful when using iterated solutions instead of fast_forward cursors.
By its nature, an iterated solution lowers the transaction isolation level
of the batch to READ COMMITTED unless there is a transaction outstanding for
the duration of the loop.  This can lead to inconsistent results.  Another
transaction may change a row that you've already read and a row that you
haven't yet read.  For example, another transaction may decrease one account
by $1000 and increase another account by $1000.  If you've already read out
the one row before the decrease occurs, but read out the other row after the
increase, your results will be off by $1000.  The normal way to prevent this
kind of problem is to increase the transaction isolation level for reads to
REPEATABLE READ or SERIALIZABLE.  REPEATABLE READ ensures that no rows that
have already been read can be changed or deleted at least until the SELECT
statement finishes executing.  SERIALIZABLE also prevents the insert of a
row into a range that has already been read.

With a fast_forward cursor, you can simply add WITH(HOLDLOCK) to the select
statement that populates the cursor.  Since a fast_forward cursor creates a
snapshot in tempdb from which rows are fetched, the shared locks are only
held for as long as it takes to fill the snapshot.  There is no need to have
an outstanding transaction during the fetch loop.  An iterated solution, on
the other hand, provides no such mechanism.  To prevent the problem
described above, you must have an outstanding transaction for the duration
of the loop.  In addition, the uncertainty inherent in the READ COMMITTED
transaction isolation level is amplified when using an iterated solution,
because the interval between the time that the first row is read and the
time that the last row is read is extended to the entire duration of the
loop.

If performance is paramount, and the situation described above isn't
relevant to the problem at hand, then by all means use the iterated solution
instead of the fast_forward cursor.  I'm not trying to say that the iterated
solution is a bad one; I just wanted to be sure that you were aware of the
ramifications.


Show quote
"Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message
news:1122793353.689637.241840@z14g2000cwz.googlegroups.com...
> Steve
>
> I modified the group interval from 30 to 29 so that your code output
> could be directly compared with other solutions.
>
> I compared your code performance with Brian's cursor solution submitted
> earlier (85,000 rows in the table) and your code
> ran around 35% faster. The proviso is that the start time column in the
> source table needs to be indexed.
> Without the index, the cursor solution is at least 10 times faster -
> but that was to be expected.
>
> I don't think that your iterative solution can be improved much more as
> it's currently very efficient and (as you pointed out) improves as the
> interval length increases. I'm not sure how Hugo's suggested strategy
> will compare to yours in practice but I would love to find out.
>
> Thanks very much for your contribution!
>
Author
31 Jul 2005 1:07 PM
Mowgli
Brian

Yes, those are valid points.
In this case, the table that contains the start times does contain
millions of records - this table could be accessed/updated by other
processes. The grouping logic is likely to only process a small subset
of these records (thousands) when run at a given time.

We could create a snapshot of the relevant data (in a temp table or
table variable) for the iterative solution as well - which may mean the
overall performance of the iterative solution could be similar/worse
than the cursor in some cases.

I much prefer performance testing a number of solutions than just the
one!
Author
31 Jul 2005 11:00 PM
Hugo Kornelis
On 31 Jul 2005 06:07:12 -0700, Mowgli wrote:

>In this case, the table that contains the start times does contain
>millions of records
(snip)
> The grouping logic is likely to only process a small subset
>of these records (thousands) when run at a given time.

Hi Mowgli,

In that case, most solutions would probably see a huge performance
increase if you first copy only the subset of rows that have to be
grouped to a temporary table. Especially if you also customize the
indexes for that temp table to the algorithm chosen.

Of course, that would increase the chance of wrong results due to data
being changed while the grouping algorithm is working even further. You
will have to weigh the pro's and con's.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
31 Jul 2005 1:11 PM
Mowgli
Brian


Yes, those are valid points.
In this case, the table that contains the start times does have
millions of rows - this table could be accessed/updated by other
processes. The grouping logic is likely to only process a small subset
of these rows (thousands) when run at a given time.


We could create a snapshot of the relevant data (in a temp table or
table variable) for the iterative solution as well - which may mean the

overall performance of the iterative solution could be similar/worse
than the cursor in some cases.


I much prefer performance testing a number of solutions than just the
one!

AddThis Social Bookmark Button