|
database
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Is there a smart set-based solution?simplicity): 500 505 510 535 910 939 944 977 I need to assign a Group ID to the above values, based on a time interval of 30 seconds, so the correct result would be: Group ID 500, 1 505, 1 510, 1 535, 2 910, 3 939, 3 944, 4 977, 5 The value of Group ID isn't important (could be 100, 200, etc.) as long as values in the same 30-second interval have the same Group ID. When a new group is started, a new 30-second interval is also started. I have created a solution using SQL loops (row-based processing) but was wondering if there is an efficient/elegant set-based solution when there are thousands of records? Thanks Mowgli No. Any set based solution will involve nested loops because the value of
the group id is dependent on the values of the starting time in the previous rows. This is one of the rare cases in which a cursor will actually perform faster than a set based solution because it can accomplish what you want in a single pass through the table. Show quote "Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message news:1122677634.927113.124950@g43g2000cwa.googlegroups.com... > I have a column of starting times below (converted to seconds for > simplicity): > > 500 > 505 > 510 > 535 > 910 > 939 > 944 > 977 > > I need to assign a Group ID to the above values, based on a time > interval of 30 seconds, so the correct result would be: > > Group ID > 500, 1 > 505, 1 > 510, 1 > 535, 2 > 910, 3 > 939, 3 > 944, 4 > 977, 5 > > The value of Group ID isn't important (could be 100, 200, etc.) as long > as values in the same 30-second interval have the same Group ID. When a > new group is started, a new 30-second interval is also started. > > I have created a solution using SQL loops (row-based processing) but > was wondering if there is an efficient/elegant set-based solution when > there are thousands of records? > > > Thanks > > Mowgli > If there are a lot of indexes on the table, you may want to cache the
results generated by the cursor in a table variable so that they can be committed in a single set-based statement. Show quote "Brian Selzer" <br***@selzer-software.com> wrote in message news:uf4PvFKlFHA.3260@TK2MSFTNGP10.phx.gbl... > No. Any set based solution will involve nested loops because the value of > the group id is dependent on the values of the starting time in the previous > rows. This is one of the rare cases in which a cursor will actually perform > faster than a set based solution because it can accomplish what you want in > a single pass through the table. > > "Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message > news:1122677634.927113.124950@g43g2000cwa.googlegroups.com... > > I have a column of starting times below (converted to seconds for > > simplicity): > > > > 500 > > 505 > > 510 > > 535 > > 910 > > 939 > > 944 > > 977 > > > > I need to assign a Group ID to the above values, based on a time > > interval of 30 seconds, so the correct result would be: > > > > Group ID > > 500, 1 > > 505, 1 > > 510, 1 > > 535, 2 > > 910, 3 > > 939, 3 > > 944, 4 > > 977, 5 > > > > The value of Group ID isn't important (could be 100, 200, etc.) as long > > as values in the same 30-second interval have the same Group ID. When a > > new group is started, a new 30-second interval is also started. > > > > I have created a solution using SQL loops (row-based processing) but > > was wondering if there is an efficient/elegant set-based solution when > > there are thousands of records? > > > > > > Thanks > > > > Mowgli > > > > >> I have created a solution using SQL loops (row-based processing) but was wondering if there is an efficient/elegant set-based solution when there are thousands of records [sic]? << Rows are not records. If you keep thinking in the wrong terms, youwill always think that you have to use loops. Create a simple table for each group and then use a BETWEEN predicate. You can get an entire year into about a million row table. CREATE TABLE TimeGroups (group_id INTEGER NOT NULL PRIMARY KEY start_time INTEGER NOT NULL, end_time INTEGER NOT NULL, CHECK (start_time +30 = end_time)); You should find that this is about 10 times faster than a cursor as the set gets larger. Unfortunately, your solution doesn't answer the original question. How do
you build the table in the first place? In this case a cursor is the fastest way to find the group id's, because the alternative is a self-join, which will always require more than one pass through the table. Not every cursor is evil. Actually, you'll find that the time that it takes for the cursor based solution increases linearly with the number of rows, whereas the time that it takes for the set-based solution increases geometrically. By the way, E. F. Codd used the term "record" in some of his papers. I wonder, do you add [sic] when you refer to his work? "--CELKO--" <jcelko***@earthlink.net> wrote in message was wondering if there is an efficient/elegant set-based solution when therenews:1122689807.624405.127910@g49g2000cwa.googlegroups.com... > >> I have created a solution using SQL loops (row-based processing) but are thousands of records [sic]? << Show quote > > Rows are not records. If you keep thinking in the wrong terms, you > will always think that you have to use loops. Create a simple table > for each group and then use a BETWEEN predicate. You can get an entire > year into about a million row table. > > CREATE TABLE TimeGroups > (group_id INTEGER NOT NULL PRIMARY KEY > start_time INTEGER NOT NULL, > end_time INTEGER NOT NULL, > CHECK (start_time +30 = end_time)); > > You should find that this is about 10 times faster than a cursor as the > set gets larger. > >> your solution doesn't answer the original question. How do you build the table in the first place? << I find a spreadsheet is pretty fast for stuff like this, especaillytemporal stuff. >> In this case a cursor is the fastest way to find the group id's, because the alternative is a self-join, which will always require more than one pass through the table.<< Doidn't i just igve a solution which requires a single join?>> .. the cursor based solution increases linearly with the number of rows, whereas the time that it takes for the set-based solution increases geometrically. << yes, the cursor is linear, but the join can be done in parallel. Givena clusterd indexing the joins are no worse than linear. The question was: How do you load the spreadsheet?
Joining two tables is always slower than accessing one. A merge or hash join of two tables requires reading at least twice as much information than an index scan of a single table--even with clustered indexes in the same order and a parallel execution plan. A loop join does a scan of one table and repetitive index seeks into another. I really wish you would give a few moments' thought to your responses, because your answers don't make any sense. I'd like to believe that you're not stupid, but come on! Even the mentally challenged can see that reading two tables takes longer than reading one. Are you so religiously opposed to cursors that you would deliberately post misleading and incorect information? "--CELKO--" <jcelko***@earthlink.net> wrote in message the table in the first place? <<news:1122726160.642393.161830@g44g2000cwa.googlegroups.com... > >> your solution doesn't answer the original question. How do you build > because the alternative is a self-join, which will always require more than> I find a spreadsheet is pretty fast for stuff like this, especailly > temporal stuff. > > >> In this case a cursor is the fastest way to find the group id's, one pass through the table.<< > rows, whereas the time that it takes for the set-based solution increases> Doidn't i just igve a solution which requires a single join? > > >> .. the cursor based solution increases linearly with the number of geometrically. << Show quote > > yes, the cursor is linear, but the join can be done in parallel. Given > a clusterd indexing the joins are no worse than linear. > Hi Mr. Celko
You wrote... Rows are not records. If you keep thinking in the wrong terms, you will always think that you have to use loops. Could you elaborate on that a bit. many thanks Shahriar Show quote "--CELKO--" <jcelko***@earthlink.net> wrote in message news:1122689807.624405.127910@g49g2000cwa.googlegroups.com... >>> I have created a solution using SQL loops (row-based processing) but was >>> wondering if there is an efficient/elegant set-based solution when there >>> are thousands of records [sic]? << > > Rows are not records. If you keep thinking in the wrong terms, you > will always think that you have to use loops. Create a simple table > for each group and then use a BETWEEN predicate. You can get an entire > year into about a million row table. > > CREATE TABLE TimeGroups > (group_id INTEGER NOT NULL PRIMARY KEY > start_time INTEGER NOT NULL, > end_time INTEGER NOT NULL, > CHECK (start_time +30 = end_time)); > > You should find that this is about 10 times faster than a cursor as the > set gets larger. > Like most new ideas, the hard part of understanding what the relational
model is comes in un-learning what you know about file systems. As Artemus Ward (William Graham Sumner, 1840-1910) put it, "It ain't so much the things we don't know that get us into trouble. It's the things we know that just ain't so." If you already have a background in data processing with traditional file systems, the first things to un-learn are: (0) Databases are not file sets. (1) Tables are not files. (2) Rows are not records. (3) Columns are not fields. Modern data processing began with punch cards, or Hollerith cards used by the Bureau of the Census. Their original size was that of a United States Dollar bill. This was set by their inventor, Herman Hollerith, because he could get furniture to store the cards from the United States Treasury Department, just across the street. Likewise, physical constraints limited each card to 80 columns of holes in which to record a symbol. The influence of the punch card lingered on long after the invention of magnetic tapes and disk for data storage. This is why early video display terminals were 80 columns across. Even today, files which were migrated from cards to magnetic tape files or disk storage still use 80 column records. But the influence was not just on the physical side of data processing. The methods for handling data from the prior media were imitated in the new media. Data processing first consisted of sorting and merging decks of punch cards (later, sequential magnetic tape files) in a series of distinct steps. The result of each step feed into the next step in the process. Relational databases do not work that way. Each user connects to the entire database all at once, not to one file at time in a sequence of steps. The users might not all have the same database access rights once they are connected, however. Magnetic tapes could not be shared among users at the same time, but shared data is the point of a database. Tables versus Files A file is closely related to its physical storage media. A table may or may not be a physical file. DB2 from IBM uses one file per table, while Sybase puts several entire databases inside one file. A table is a <i>set<i> of rows of the same kind of thing. A set has no ordering and it makes no sense to ask for the first or last row. A deck of punch cards is sequential, and so are magnetic tape files. Therefore, a <i>physical<i> file of ordered sequential records also became the <i>mental<i> model for data processing and it is still hard to shake. Anytime you look at data, it is in some physical ordering. The various access methods for disk storage system came later, but even these access methods could not shake the mental model. Another conceptual difference is that a file is usually data that deals with a whole business process. A file has to have enough data in itself to support applications for that business process. Files tend to be "mixed" data which can be described by the name of the business process, such as "The Payroll file" or something like that. Tables can be either entities or relationships within a business process. This means that the data which was held in one file is often put into several tables. Tables tend to be "pure" data which can be described by single words. The payroll would now have separate tables for timecards, employees, projects and so forth. Tables as Entities An entity is physical or conceptual "thing" which has meaning be itself. A person, a sale or a product would be an example. In a relational database, an entity is defined by its attributes, which are shown as values in columns in rows in a table. To remind users that tables are sets of entities, I like to use plural or collective nouns that describe the function of the entities within the system for the names of tables. Thus "Employee" is a bad name because it is singular; "Employees" is a better name because it is plural; "Personnel" is best because it is collective and does not summon up a mental picture of individual persons. If you have tables with exactly the same structure, then they are sets of the same kind of elements. But you should have only one set for each kind of data element! Files, on the other hand, were PHYSICALLY separate units of storage which coudl be alike -- each tape or disk file represents a step in the PROCEDURE , such as moving from raw data, to edited data, and finally to archived data. In SQL, this should be a status flag in a table. Tables as Relationships A relationship is shown in a table by columns which reference one or more entity tables. Without the entities, the relationship has no meaning, but the relationship can have attributes of its own. For example, a show business contract might have an agent, an employer and a talent. The method of payment is an attribute of the contract itself, and not of any of the three parties. Rows versus Records Rows are not records. A record is defined in the application program which reads it; a row is defined in the database schema and not by a program at all. The name of the field in the READ or INPUT statements of the application; a row is named in the database schema. All empty files look alike; they are a directory entry in the operating system with a name and a length of zero bytes of storage. Empty tables still have columns, constraints, security privileges and other structures, even tho they have no rows. This is in keeping with the set theoretical model, in which the empty set is a perfectly good set. The difference between SQL's set model and standard mathematical set theory is that set theory has only one empty set, but in SQL each table has a different structure, so they cannot be used in places where non-empty versions of themselves could not be used. Another characteristic of rows in a table is that they are all alike in structure and they are all the "same kind of thing" in the model. In a file system, records can vary in size, datatypes and structure by having flags in the data stream that tell the program reading the data how to interpret it. The most common examples are Pascal's variant record, C's struct syntax and Cobol's OCCURS clause. The OCCURS keyword in Cobol and the Variant records in Pascal have a number which tells the program how many time a record structure is to be repeated in the current record. Unions in 'C' are not variant records, but variant mappings for the same physical memory. For example: union x {int ival; char j[4];} myStuff; defines myStuff to be either an integer (which are 4 bytes on most modern C compilers, but this code is non-portable) or an array of 4 bytes, depending on whether you say myStuff.ival or myStuff.j[0]; But even more than that, files often contained records which were summaries of subsets of the other records -- so called control break reports. There is no requirement that the records in a file be related in any way -- they are literally a stream of binary data whose meaning is assigned by the program reading them. Columns versus Fields A field within a record is defined by the application program that reads it. A column in a row in a table is defined by the database schema. The datatypes in a column are always scalar. The order of the application program variables in the READ or INPUT statements is important because the values are read into the program variables in that order. In SQL, columns are referenced only by their names. Yes, there are shorthands like the SELECT * clause and INSERT INTO <table name> statements which expand into a list of column names in the physical order in which the column names appear within their table declaration, but these are shorthands which resolve to named lists. The use of NULLs in SQL is also unique to the language. Fields do not support a missing data marker as part of the field, record or file itself. Nor do fields have constraints which can be added to them in the record, like the DEFAULT and CHECK() clauses in SQL. Relationships among tables within a database Files are pretty passive creatures and will take whatever an application program throws at them without much objection. Files are also independent of each other simply because they are connected to one application program at a time and therefore have no idea what other files looks like. A database actively seeks to maintain the correctness of all its data. The methods used are triggers, constraints and declarative referential integrity. Declarative referential integrity (DRI) says, in effect, that data in one table has a particular relationship with data in a second (possibly the same) table. It is also possible to have the database change itself via referential actions associated with the DRI. For example, a business rule might be that we do not sell products which are not in inventory. This rule would be enforce by a REFERENCES clause on the Orders table which references the Inventory table and a referential action of ON DELETE CASCADE Triggers are a more general way of doing much the same thing as DRI. A trigger is a block of procedural code which is executed before, after or instead of an INSERT INTO or UPDATE statement. You can do anything with a trigger that you can do with DRI and more. However, there are problems with TRIGGERs. While there is a standard syntax for them in the SQL-92 standard, most vendors have not implemented it. What they have is very proprietary syntax instead. Secondly, a trigger cannot pass information to the optimizer like DRI. In the example in this section, I know that for every product number in the Orders table, I have that same product number in the Inventory table. The optimizer can use that information in setting up EXISTS() predicates and JOINs in the queries. There is no reasonable way to parse procedural trigger code to determine this relationship. The CREATE ASSERTION statement in SQL-92 will allow the database to enforce conditions on the entire database as a whole. An ASSERTION is not like a CHECK() clause, but the difference is subtle. A CHECK() clause is executed when there are rows in the table to which it is attached. If the table is empty then all CHECK() clauses are effectively TRUE. Thus, if we wanted to be sure that the Inventory table is never empty, and we wrote: CREATE TABLE Inventory ( ... CONSTRAINT inventory_not_empty CHECK ((SELECT COUNT(*) FROM Inventory) > 0), ... ); it would not work. However, we could write: CREATE ASSERTION Inventory_not_empty CHECK ((SELECT COUNT(*) FROM Inventory) > 0); and we would get the desired results. The assertion is checked at the schema level and not at the table level. CELKO
Thanks for your comments... I do try to think using the right mind"set" but it can be difficult sometimes... CREATE TABLE TimeGroups
(group_id INTEGER NOT NULL PRIMARY KEY -- cliuster this start_time DECIMAL (4,1) NOT NULL CHECK (start_time = ((10 * event_time)/ 10)), -- no decimal end_time IDECIMAL (4,1) NOT NULL, CHECK (end_time = start_time - 0.1)); CREATE TABLE Events (event_time DECIMAL (4,1) NOT NULL PRIMARY KEY -- cliuster this CHECK (event_time = ((10 * event_time)/ 10)) -- no decimal ); INSERT INTO TimeGroups VALUES (1, 500.0, 529.9); INSERT INTO TimeGroups VALUES (2, 530.0, 559.9); INSERT INTO TimeGroups VALUES (3, 560.0, 589.9); INSERT INTO TimeGroups VALUES (4, 590.0, 619.9); Now the query is simply: SELECT G.group_id, E.event_time FROM TimeGroups AS G, Events AS E WHERE E.event_time BETWEEN G.start_time AND G.end_time; CELKO
Yes, I hope to reach the stage where I can write a simple query like that... I first need to populate TimeGroups and only have the event start time. Is the solution in "SQL for Smarties"?! This is an example of using an auxillary table, whcih is covered in SQL
FOR SMARTIES. The third edition will have more details. Consider the problem of loading the Time Groups table. The typical approach is to write a loop, but if you have an auxilary Seqence table you can write this: INSERT INTO TimeGroups (group_id, start_time, end_time) SELECT seq, (seq + @start_value), (seq + @start_value)+ 029.9 FROM Sequence AS S WHERE S.seq < @my_size; You can also use this SELECT as a derived table in the other query. Do not create a table variable; they take disk time at run time, will not port, hide from the otimizer and cannot be re-used by other queries. But if you only want to use the groupings once, then write it as one query: SELECT G.group_id, E.event_time FROM (SELECT seq, (seq + @start_value), (seq + @start_value)+ 029.9 FROM Sequence AS S WHERE S.seq < @my_size) AS G(group_id, start_time, end_time), Events AS E WHERE E.event_time BETWEEN G.start_time AND G.end_time; Another important SQL programming principle is that time is modeled as durations. If you fail to do this in your tables, you will have to write complicated code that uses self-joins to construct these durations, like Brian has shown you. CELKO
I vaguely remember sequence tables in the first edition of the book - I can never find books when I need them! Perhaps hypnosis will help. One of the difficulties I face is the need for non-uniform grouping intervals... The first group is 500 to 529.9, next one is 535 to 564.9, etc. But I can see how the techniques you have illustrated will assist me. Thank you. Mowgli
I think this may just do it. Table1 contains the field times. Table2 contains 2 fields, Times and groupid insert into table2 (times,groupid) (select times,times/30 from Table1) Shahriar Show quote "Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message news:1122677634.927113.124950@g43g2000cwa.googlegroups.com... >I have a column of starting times below (converted to seconds for > simplicity): > > 500 > 505 > 510 > 535 > 910 > 939 > 944 > 977 > > I need to assign a Group ID to the above values, based on a time > interval of 30 seconds, so the correct result would be: > > Group ID > 500, 1 > 505, 1 > 510, 1 > 535, 2 > 910, 3 > 939, 3 > 944, 4 > 977, 5 > > The value of Group ID isn't important (could be 100, 200, etc.) as long > as values in the same 30-second interval have the same Group ID. When a > new group is started, a new 30-second interval is also started. > > I have created a solution using SQL loops (row-based processing) but > was wondering if there is an efficient/elegant set-based solution when > there are thousands of records? > > > Thanks > > Mowgli > I tried that: it doesn't work correctly. The start time for each group
after the first depends on whether the start time of the first row for that group is more than 30 seconds past the start time of the first row for the previous group. If you use times/30 the interval that includes 500 runs from 480 through 509, so the row with the 510 start time is in the next interval. The expected results clearly show that the rows with 500 and 510 are in the same interval. There is no set-based operation that can accurately calculate the group id in a single pass. With a cursor, you can cache the start time of the current group in a variable, and bump the current group id if the start time for the current cursor row is more than 30 seconds past the cached start time for the current group. "Shahriar" <HelloShahr***@hotmail.com> wrote in message news:IqEGe.97$4e6.80@trnddc04...Show quote > Mowgli > > I think this may just do it. > Table1 contains the field times. > Table2 contains 2 fields, Times and groupid > > > insert into table2 (times,groupid) (select times,times/30 from Table1) > > Shahriar > > > "Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message > news:1122677634.927113.124950@g43g2000cwa.googlegroups.com... > >I have a column of starting times below (converted to seconds for > > simplicity): > > > > 500 > > 505 > > 510 > > 535 > > 910 > > 939 > > 944 > > 977 > > > > I need to assign a Group ID to the above values, based on a time > > interval of 30 seconds, so the correct result would be: > > > > Group ID > > 500, 1 > > 505, 1 > > 510, 1 > > 535, 2 > > 910, 3 > > 939, 3 > > 944, 4 > > 977, 5 > > > > The value of Group ID isn't important (could be 100, 200, etc.) as long > > as values in the same 30-second interval have the same Group ID. When a > > new group is started, a new 30-second interval is also started. > > > > I have created a solution using SQL loops (row-based processing) but > > was wondering if there is an efficient/elegant set-based solution when > > there are thousands of records? > > > > > > Thanks > > > > Mowgli > > > > Brian, Shahriar
Thanks for your input and for concentrating on the problem rather than on my terminology! The example I gave uses a time interval of 30 seconds but this can also be variable (20, 40, 60, etc.). Brian, perhaps you could demonstrate the inefficient set-based solution to illustrate your point? .... As the issues may not be clear to anyone who hasn't attempted this type of "real-life" problem. Here's an example set-based solution. It's pretty ugly, but it's the only
way I can think of to meet your requirement. A cursor in this case is looking better and better. create table tableName (startSeconds int not null) go insert tableName values (500) insert tableName values (505) insert tableName values (510) insert tableName values (535) insert tableName values (910) insert tableName values (939) insert tableName values (944) insert tableName values (977) go declare @interval int set @interval = 30 declare @x table ( start int not null, stop int not null, groupID int identity(1, 1) not null primary key clustered ) insert @x (start, stop) select min(start) as start, stop from (select case when d.start in (select distinct (select top 1 startSeconds from tableName where startSeconds >= b.startSeconds and startSeconds < b.startSeconds + @interval order by startSeconds desc) as stop from tableName b) then d.stop else d.start end as start, stop from (select a.startSeconds as start, (select top 1 startSeconds from tableName where startSeconds >= a.startSeconds and startSeconds < a.startSeconds + @interval order by startSeconds desc) as stop from tableName a) d) e group by stop order by start select startSeconds, groupID from tablename join @x on (startSeconds between start and stop) Show quote "Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message news:1122708734.727361.324820@g43g2000cwa.googlegroups.com... > Brian, Shahriar > > Thanks for your input and for concentrating on the problem rather than > on my terminology! > > The example I gave uses a time interval of 30 seconds but this can also > be variable (20, 40, 60, etc.). > Brian, perhaps you could demonstrate the inefficient set-based solution > to illustrate your point? > ... As the issues may not be clear to anyone who hasn't attempted this > type of "real-life" problem. > Brian
That's the best attempt at calculating the Group ID values so far and quick work as well... Your solution works for most sets of "normal" data and even when @interval is changed. When I tried a seemingly innocuous set of data: truncate table tableName go insert tableName values (1) insert tableName values (29) insert tableName values (30) insert tableName values (31) The result was... startSeconds groupID 1 1 29 1 29 2 30 1 30 2 31 2 The table variable (@x) contains: start stop groupID 1 30 1 29 31 2 So I assume that there is a problem populating @x... However, your attempt is better than any of mine and probably close to the final solution! Here's the cursor based solution.
declare @interval int set @interval = 30 declare @startSeconds int, @groupID int, @groupStart int select @startSeconds = 0, @groupID = 0, @groupStart = - @interval * 2 declare @x table (startSeconds int not null, groupID int not null) declare @c cursor set @c = cursor local fast_forward for select startSeconds from tableName order by startSeconds open @c if @@cursor_rows != 0 fetch @c into @startSeconds while @@fetch_status = 0 begin if @groupStart + @interval < @startSeconds select @groupID = @groupID + 1, @groupStart = @startSeconds insert @x (startSeconds, groupID) values (@startSeconds, @groupID) fetch @c into @startSeconds end close @c deallocate @c select * from @x Show quote "Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message news:1122732601.149568.274190@o13g2000cwo.googlegroups.com... > Brian > > That's the best attempt at calculating the Group ID values so far and > quick work as well... > Your solution works for most sets of "normal" data and even when > @interval is changed. > When I tried a seemingly innocuous set of data: > > truncate table tableName > go > insert tableName values (1) > insert tableName values (29) > insert tableName values (30) > insert tableName values (31) > > The result was... > > startSeconds groupID > 1 1 > 29 1 > 29 2 > 30 1 > 30 2 > 31 2 > > > The table variable (@x) contains: > > start stop groupID > 1 30 1 > 29 31 2 > > So I assume that there is a problem populating @x... > However, your attempt is better than any of mine and probably close to > the final solution! > sorry, slight bug in previous post. here's the correct version:
declare @interval int set @interval = 30 declare @startSeconds int, @groupID int, @groupStart int select @startSeconds = 0, @groupID = 0, @groupStart = - @interval * 2 declare @x table (startSeconds int not null, groupID int not null) declare @c cursor set @c = cursor local fast_forward for select startSeconds from tableName order by startSeconds open @c if @@cursor_rows != 0 begin fetch @c into @startSeconds while @@fetch_status = 0 begin if @groupStart + @interval <= @startSeconds select @groupID = @groupID + 1, @groupStart = @startSeconds insert @x (startSeconds, groupID) values (@startSeconds, @groupID) fetch @c into @startSeconds end end close @c deallocate @c select * from @x go This makes one only one pass through the table, so it will perform much better than the set-based solution for larger data sets. Show quote "Brian Selzer" <br***@selzer-software.com> wrote in message news:#OZ7FiSlFHA.1044@tk2msftngp13.phx.gbl... > Here's the cursor based solution. > > declare @interval int set @interval = 30 > declare @startSeconds int, @groupID int, @groupStart int > select @startSeconds = 0, @groupID = 0, @groupStart = - @interval * 2 > declare @x table (startSeconds int not null, groupID int not null) > declare @c cursor > set @c = cursor local fast_forward for > select startSeconds from tableName order by startSeconds > open @c > if @@cursor_rows != 0 > fetch @c into @startSeconds > while @@fetch_status = 0 > begin > if @groupStart + @interval < @startSeconds > select @groupID = @groupID + 1, @groupStart = @startSeconds > insert @x (startSeconds, groupID) values (@startSeconds, @groupID) > fetch @c into @startSeconds > end > close @c > deallocate @c > select * from @x > > "Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message > news:1122732601.149568.274190@o13g2000cwo.googlegroups.com... > > Brian > > > > That's the best attempt at calculating the Group ID values so far and > > quick work as well... > > Your solution works for most sets of "normal" data and even when > > @interval is changed. > > When I tried a seemingly innocuous set of data: > > > > truncate table tableName > > go > > insert tableName values (1) > > insert tableName values (29) > > insert tableName values (30) > > insert tableName values (31) > > > > The result was... > > > > startSeconds groupID > > 1 1 > > 29 1 > > 29 2 > > 30 1 > > 30 2 > > 31 2 > > > > > > The table variable (@x) contains: > > > > start stop groupID > > 1 30 1 > > 29 31 2 > > > > So I assume that there is a problem populating @x... > > However, your attempt is better than any of mine and probably close to > > the final solution! > > > > Brian
Don't worry, the end of a group interval is open to interpretation! I tested your code on my aging laptop (85,000 rows in the table) and it completed in less than a minute. In terms of logic that's the most efficient cursor solution I've come across. I don't think anyone can produce a better row-based processing solution than that. Thank you for your time. > Don't worry, the end of a group interval is open to interpretation! Yeah, but the results were wrong, {1, 29, 30, 31} should produce {1, 1, 1,2}, not {1, 1, 1, 1}, so I fixed it. The solution will perform better if the source data has an index (preferably clustered) on startSeconds. A clustered index scan finds the first matching row and then walks through the leaf pages, whereas a nonclustered index has to look up each row in the clustered index. Either type of index eliminates the requisite sort of the input data. I've honed my cursor technique over the years, and I have a few pointers, for anybody who's interested. (1) avoid selects from normal tables within a fetch loop. Select all of the needed rows using set-based operations before the first row is fetched. (2) avoid updates to normal tables within a fetch loop. Cache updates in temp tables or table variables as the fetch loop executes, and then commit them using set-based logic after the cursor is closed. This way the server can optimize index updates by committing all of the changes at once to each separate index, instead of hundreds or thousands of single updates. In addition, triggers fire only once for each update, further enhancing performance. Also table variables are created and operated on in memory, provided you have enough, so the impact of caching is usually minimal. (3) avoid using dynamic cursors. Use FAST_FORWARD cursors and implement optimistic concurrency using rowversion columns. Dynamic cursors hold on to locks, which can cause blocking and deadlocks. (4) avoid executing a fetch loop within a transaction. See (3). (5) avoid executing stored procedures from within a fetch loop. Another stored procedure may access (see 1) or update (see 2) normal tables. In addition, the logic within them may change. Better to keep the code executed within a fetch loop within the same procedure. That way the performance of the fetch loop isn't dependent on the performance of other procedures that may change. A scan with a cursor typically takes twice as long as a set-based scan, so set-based operations are to be preferred over row-based; however, in very rare instances a cursor can improve performance dramatically. If you can eliminate one or more iterations (loop joins) from an execution plan by using a cursor, then the performance improvement can be astronomical. It depends on the number of rows in the result set on the left hand side of each iteration and the number of operations on the right hand side. Show quote "Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message news:1122746989.689049.99390@g44g2000cwa.googlegroups.com... > Brian > > Don't worry, the end of a group interval is open to interpretation! > > I tested your code on my aging laptop (85,000 rows in the table) and it > completed in less than a minute. In terms of logic that's the most > efficient cursor solution I've come across. I don't think anyone can > produce a better row-based processing solution than that. > > Thank you for your time. > Brian
I wasnt looking at his data. What I understood from his question was that he wants to group the times in chunks of 30 secs. However, even so, its still simple. All you have to do is add the base number to the formula. Something like this. set @base=500 insert into table2 (times,groupid) (select times+@base,times/30 from Table1) Of course, you can also just dynamically get the smallest time and store it in @base. Result set based on his data set: Time Groupid 500 33 505 33 510 33 535 34 910 47 934 47 944 48 977 49 Shahriar Show quote "Brian Selzer" <br***@selzer-software.com> wrote in message news:OiasqONlFHA.576@TK2MSFTNGP15.phx.gbl... >I tried that: it doesn't work correctly. The start time for each group > after the first depends on whether the start time of the first row for > that > group is more than 30 seconds past the start time of the first row for the > previous group. If you use times/30 the interval that includes 500 runs > from 480 through 509, so the row with the 510 start time is in the next > interval. The expected results clearly show that the rows with 500 and > 510 > are in the same interval. > > There is no set-based operation that can accurately calculate the group id > in a single pass. With a cursor, you can cache the start time of the > current group in a variable, and bump the current group id if the start > time > for the current cursor row is more than 30 seconds past the cached start > time for the current group. > > "Shahriar" <HelloShahr***@hotmail.com> wrote in message > news:IqEGe.97$4e6.80@trnddc04... >> Mowgli >> >> I think this may just do it. >> Table1 contains the field times. >> Table2 contains 2 fields, Times and groupid >> >> >> insert into table2 (times,groupid) (select times,times/30 from Table1) >> >> Shahriar >> >> >> "Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message >> news:1122677634.927113.124950@g43g2000cwa.googlegroups.com... >> >I have a column of starting times below (converted to seconds for >> > simplicity): >> > >> > 500 >> > 505 >> > 510 >> > 535 >> > 910 >> > 939 >> > 944 >> > 977 >> > >> > I need to assign a Group ID to the above values, based on a time >> > interval of 30 seconds, so the correct result would be: >> > >> > Group ID >> > 500, 1 >> > 505, 1 >> > 510, 1 >> > 535, 2 >> > 910, 3 >> > 939, 3 >> > 944, 4 >> > 977, 5 >> > >> > The value of Group ID isn't important (could be 100, 200, etc.) as long >> > as values in the same 30-second interval have the same Group ID. When a >> > new group is started, a new 30-second interval is also started. >> > >> > I have created a solution using SQL loops (row-based processing) but >> > was wondering if there is an efficient/elegant set-based solution when >> > there are thousands of records? >> > >> > >> > Thanks >> > >> > Mowgli >> > >> >> > > Shahriar
Not sure if you mis-typed, do you mean... set @base=500 select times, (times+@base) / 30 from Table1 That looks correct for the data provided but if we insert 2 more values: Time Groupid 500 33 505 33 510 33 535 34 638 37 667 38 910 47 934 47 944 48 977 49 We find that 638 and 667 have different Group IDs when they should be the same! hi,
if u are not particular about the groupID, i have a solution for u. but as per the requirement, group ID definately falls in 30sec intervals. select num, (num-(select min(num)from #test))/30 GroupID from #test please let me know if this answers your question -- Show quotebest Regards, Chandra http://chanduas.blogspot.com/ http://groups.msn.com/SQLResource/ --------------------------------------- "Mowgli" wrote: > I have a column of starting times below (converted to seconds for > simplicity): > > 500 > 505 > 510 > 535 > 910 > 939 > 944 > 977 > > I need to assign a Group ID to the above values, based on a time > interval of 30 seconds, so the correct result would be: > > Group ID > 500, 1 > 505, 1 > 510, 1 > 535, 2 > 910, 3 > 939, 3 > 944, 4 > 977, 5 > > The value of Group ID isn't important (could be 100, 200, etc.) as long > as values in the same 30-second interval have the same Group ID. When a > new group is started, a new 30-second interval is also started. > > I have created a solution using SQL loops (row-based processing) but > was wondering if there is an efficient/elegant set-based solution when > there are thousands of records? > > > Thanks > > Mowgli > > Chandra,
Nice try! Your query gives us the following result set: 500, 0 505, 0 510, 0 535, 1 910, 13 939, 14 944, 14 977, 15 As you can see, 910 and 939 are in different groups when they should be in the same group... Perhaps you could tweak your original solution? On 29 Jul 2005 15:53:54 -0700, Mowgli wrote:
(snip) >I have created a solution using SQL loops (row-based processing) but Hi Mowgli,>was wondering if there is an efficient/elegant set-based solution when >there are thousands of records? I agree with Brian that there is probably no efficient way to do this in one single set-based SQL statement. But I don't think that this means row by row processing is the only efficient way. It's about midnight here now - too late for me to write SQL of this level of complexity. But here's a rough outline of an alternative approach that I'd suggest. I'm not sure if it will run faster than a straight row-by-row version, but I gather that it's worth a try. Step 1: For each row that is not within 30 seconds past any other row, set the group value equal to the seconds value. Step 2: Add all rows within 30 seconds of any of the group values to the same group. Step 3: Repeat step 1, but exclude all rows that are already in a group. Step 4: Repeat step 2. Repeat steps 3 and 4 until @@ROWCOUNT on either of them is 0. At that point, all rows should have a group set. A slightly different version of this would repeat only step 1 and postpone the filling of all rows in the same group to the end. That would reduce the number of UPDATE statements executed, but it would probably increrase the complexity of the SQL for step 1. Let me know if you want to try this, but need help writing the SQL statements. I'll have a go at it when I have more time and when I am less tired. Best, Hugo -- (Remove _NO_ and _SPAM_ to get my e-mail address) Hugo
Yes, your alternative method has merit and I would welcome any attempt you make to write the SQL statements. I will have a go myself but it will probably take me longer... and I'm not even tired! Thanks On 30 Jul 2005 15:57:52 -0700, Mowgli wrote:
>Hugo Hi Mowgli,> >Yes, your alternative method has merit and I would welcome any attempt >you make to write the SQL statements. I have put together the SQL for the two variations of my idea that I posted, and tested them on your sample data + some extra rows. I think the results are as you want them. My guess is that the performance will depend for a large part on the actual distribution of your data. If you have many gaps of over 30 seconds between rows, performance will be fast. If you have hundreds of rows with no 30+-second gap in between them, either the cursor or Steve's code will probably win. Let me know how these versions work on your data! Version 1: ========= CREATE TABLE StartTimes (StartSeconds int NOT NULL PRIMARY KEY, GroupID int DEFAULT NULL) go INSERT INTO StartTimes (StartSeconds) SELECT 500 UNION ALL SELECT 505 UNION ALL SELECT 510 UNION ALL SELECT 529 UNION ALL SELECT 530 UNION ALL SELECT 531 UNION ALL SELECT 535 UNION ALL SELECT 550 UNION ALL SELECT 910 UNION ALL SELECT 939 UNION ALL SELECT 944 UNION ALL SELECT 977 go Step1: UPDATE StartTimes SET GroupID = StartSeconds WHERE NOT EXISTS (SELECT * FROM StartTimes AS b WHERE b.StartSeconds BETWEEN StartTimes.StartSeconds - 29 AND StartTimes.StartSeconds - 1 AND b.GroupID IS NULL) AND GroupID IS NULL IF @@ROWCOUNT = 0 GOTO Done Step2: UPDATE StartTimes SET GroupID = (SELECT MAX(GroupID) FROM StartTimes AS b WHERE b.StartSeconds BETWEEN StartTimes.StartSeconds - 29 AND StartTimes.StartSeconds - 1 AND b.GroupID = b.StartSeconds) WHERE GroupID IS NULL IF @@ROWCOUNT > 0 GOTO Step1 Done: -- Show results SELECT StartSeconds, GroupID FROM StartTimes ORDER BY StartSeconds go DROP TABLE StartTimes go Version 2: ========= CREATE TABLE StartTimes (StartSeconds int NOT NULL PRIMARY KEY, GroupID int DEFAULT NULL) go INSERT INTO StartTimes (StartSeconds) SELECT 500 UNION ALL SELECT 505 UNION ALL SELECT 510 UNION ALL SELECT 529 UNION ALL SELECT 530 UNION ALL SELECT 531 UNION ALL SELECT 535 UNION ALL SELECT 550 UNION ALL SELECT 910 UNION ALL SELECT 939 UNION ALL SELECT 944 UNION ALL SELECT 977 go Step1: UPDATE StartTimes SET GroupID = StartSeconds WHERE GroupID IS NULL AND NOT EXISTS (SELECT * FROM StartTimes AS b WHERE b.StartSeconds BETWEEN StartTimes.StartSeconds - 29 AND StartTimes.StartSeconds - 1 AND ( b.GroupID IS NOT NULL OR ( b.GroupID IS NULL AND NOT EXISTS (SELECT * FROM StartTimes AS c WHERE c.StartSeconds BETWEEN b.StartSeconds - 29 AND b.StartSeconds - 1 AND c.GroupID = c.StartSeconds)))) IF @@ROWCOUNT > 0 GOTO Step1 Step2: UPDATE StartTimes SET GroupID = (SELECT MAX(GroupID) FROM StartTimes AS b WHERE b.StartSeconds BETWEEN StartTimes.StartSeconds - 29 AND StartTimes.StartSeconds - 1 AND b.GroupID = b.StartSeconds) WHERE GroupID IS NULL Done: -- Show results SELECT StartSeconds, GroupID FROM StartTimes ORDER BY StartSeconds go DROP TABLE StartTimes go Best, Hugo -- (Remove _NO_ and _SPAM_ to get my e-mail address) Hugo
Thanks - your solutions do produce the correct results, so we can now do a (brief) performance comparison: 1. Iterative Solution: 20 seconds (Steve) 2. Cursor Solution: 21 seconds (Brian) 3. Alternative Solution 1: 45 seconds (Hugo) 4. Alternative Solution 2: 126 seconds (Hugo) I have excluded the time taken to display the generated data so as to focus on the relative performance of each algorithm. I have also added an extra step (timed) for the non-cursor solutions whereby the data to be processed is placed in a temporary table (avoids some concurrency issues). For the Iterative/Cursor solutions, the above results are average times for processing 85,000 rows of sample data. I also performed tests on different group intervals and smaller data sets - the differences in relative performance changed but the final ranking (in the main) didn't. In fact, the performance of the iterative and cursor solutions were very similar! The results of the "Alternative Solutions" were based on the processing of only 1,700 rows - so these were much slower than the others. However, I may know why that is... Alternative Solution 1: I think the UPDATE statement in step 2 processes too many rows at each iteration. If there are 1,700 rows with start times that are close together, with a 30 second interval setting, at each iteration hundreds of rows could be updated. So NULL Group IDs may be assigned NULL many times. If the above solution can be tweaked I think it will perform much, much better. It's too good an idea to be ignored! Hugo, thanks for bringing it to my attention. On 1 Aug 2005 07:09:51 -0700, Mowgli wrote:
>Hugo Hi Mowgli,> >Thanks - your solutions do produce the correct results, so we can now >do a (brief) performance comparison: > >1. Iterative Solution: 20 seconds (Steve) >2. Cursor Solution: 21 seconds (Brian) >3. Alternative Solution 1: 45 seconds (Hugo) >4. Alternative Solution 2: 126 seconds (Hugo) Thanks for posting the results. I expected solution 2 to be slower than solution 1. Good to see my suspicion confirmed. (snip) >The results of the "Alternative Solutions" were based on the processing You got it. When posting my solution, I was aware that large numbers of>of only 1,700 rows - so these were much slower than the others. >However, I may know why that is... > >Alternative Solution 1: >I think the UPDATE statement in step 2 processes too many rows at each >iteration. If there are 1,700 rows with start times that are close >together, with a 30 second interval setting, at each iteration hundreds >of rows could be updated. So NULL Group IDs may be assigned NULL many >times. rows with small starttime differences would kill performance. For data with lots of 30+-second intervals, my alternative solution would perform better (or should I say: less bad). Maybe still not as fast as the other solutions, though. That's because it starts at each 30+-sec gap simultaneously. To gain more insight in how the two solutions work, add a SELECT * FROM StartTimes at strategic places - but not just before the IF @@ROWCOUNT statements! (I did that yesterday, and it took me several minutes of staring at the screen before I finally realised why the *$^#% thing insisted on endless looping....) In the extreme case of many rows with no single 30+-second interval, my solution would follow the exact same steps as David's - but it'd use more complicated code to get there. In the other extreme, where each of the final groups is 30+ seconds apart from each other group, my solution would finish after the first iteration. But unfortunately, your data is not like that. :-) >If the above solution can be tweaked I think it will perform much, much To prevent the needless UPDATEs, you'd have to repeat the correlated>better. It's too good an idea to be ignored! subquery of the SET in the WHERE clause of the UPDATE. Like this: AND EXISTS (SELECT * FROM StartTimes AS b WHERE b.StartSeconds BETWEEN StartTimes.StartSeconds - 29 AND StartTimes.StartSeconds - 1 AND b.GroupID = b.StartSeconds) I doubt f it will help much, though. If you do want to see how much faster you can get this, there are only two things I could consider. 1. Indexing. My feeling is that the clustered unique index on the StartSeconds column will be the best. (That index is automatically created if you declare StartSeconds the primary key, as I did in my sample data). But it never hurts to try sopme alternatives. Or to run the code with profiler active and feed the output to the index tuning wizard. 2. I think some speed can be gained by changing the GroupID column to GroupID int NOT NULL DEFAULT 0 -- Or any other unused value. Then, change the queries to test for = 0 or <> 0 instead of IS NULL and IS NOT NULL. I think that this will result in a big decrease of the number of page splits necessary during the UPDATE statements. However, at the end of the day, I suspect that David's code and the cursor will still be faster. As explained above - this solution is optimized for a data distribution that differs from yours. Best, Hugo -- (Remove _NO_ and _SPAM_ to get my e-mail address) Hugo
Your suggested changes have improved the query processing on my data sets: 4,000 rows in 45 seconds! (was 1,700 rows in 45 seconds previously) But as you say, your solution works much better on less dense data sets. On 4 Aug 2005 12:00:13 -0700, Mowgli wrote:
>Hugo Hi Mowgli,> >Your suggested changes have improved the query processing on my data >sets: >4,000 rows in 45 seconds! (was 1,700 rows in 45 seconds previously) > >But as you say, your solution works much better on less dense data sets. Thanks for letting me know how it worked out! Best, Hugo -- (Remove _NO_ and _SPAM_ to get my e-mail address) Mowgli,
Here is a possible solution. I am using a view to make the main select readable. If you need a range diff from 30, you can create a inline udf that expect this value as a parameter. I am posting both methods. create table t1 ( c1 int ) go insert into t1 values(500) insert into t1 values(505) insert into t1 values(510) insert into t1 values(535) insert into t1 values(910) insert into t1 values(939) insert into t1 values(944) insert into t1 values(977) go create view v1 as select c1, (select min(b.c1) from t1 as b where b.c1 between a.c1 - 30 and a.c1) as c2 from t1 as a go -- using a view select c1, (select count(distinct b.c2) from v1 as b where b.c2 <= a.c2) as rank from v1 as a order by rank, c1 go create function f1 ( @i int ) returns table as return ( select c1, (select min(b.c1) from t1 as b where b.c1 between a.c1 - @i and a.c1) as c2 from t1 as a ) go -- using a udf select c1, (select count(distinct b.c2) from dbo.f1(30) as b where b.c2 <= a.c2) as rank from dbo.f1(30) as a order by rank, c1 go select c1, (select count(distinct b.c2) from dbo.f1(25) as b where b.c2 <= a.c2) as rank from dbo.f1(25) as a order by rank, c1 go drop view v1 go drop funcion f1 go drop table t1 go AMB Show quote "Mowgli" wrote: > I have a column of starting times below (converted to seconds for > simplicity): > > 500 > 505 > 510 > 535 > 910 > 939 > 944 > 977 > > I need to assign a Group ID to the above values, based on a time > interval of 30 seconds, so the correct result would be: > > Group ID > 500, 1 > 505, 1 > 510, 1 > 535, 2 > 910, 3 > 939, 3 > 944, 4 > 977, 5 > > The value of Group ID isn't important (could be 100, 200, etc.) as long > as values in the same 30-second interval have the same Group ID. When a > new group is started, a new 30-second interval is also started. > > I have created a solution using SQL loops (row-based processing) but > was wondering if there is an efficient/elegant set-based solution when > there are thousands of records? > > > Thanks > > Mowgli > > You can also use an inner join instead a correlated subquery, to calculate
the rank. May be having a clustered index by c1 can improve performance. select a.c1, count(distinct b.c2) as rank from dbo.f1(30) as a inner join dbo.f1(30) as b on b.c2 <= a.c2 group by a.c1 order by rank, a.c1 go AMB Show quote "Alejandro Mesa" wrote: > Mowgli, > > Here is a possible solution. I am using a view to make the main select > readable. If you need a range diff from 30, you can create a inline udf that > expect this value as a parameter. I am posting both methods. > > create table t1 ( > c1 int > ) > go > > insert into t1 values(500) > insert into t1 values(505) > insert into t1 values(510) > insert into t1 values(535) > insert into t1 values(910) > insert into t1 values(939) > insert into t1 values(944) > insert into t1 values(977) > go > > create view v1 > as > select > c1, > (select min(b.c1) from t1 as b where b.c1 between a.c1 - 30 and a.c1) as c2 > from > t1 as a > go > > -- using a view > select > c1, > (select count(distinct b.c2) from v1 as b where b.c2 <= a.c2) as rank > from > v1 as a > order by > rank, c1 > go > > create function f1 ( > @i int > ) > returns table > as > return ( > select > c1, > (select min(b.c1) from t1 as b where b.c1 between a.c1 - @i and a.c1) as c2 > from > t1 as a > ) > go > > -- using a udf > select > c1, > (select count(distinct b.c2) from dbo.f1(30) as b where b.c2 <= a.c2) as rank > from > dbo.f1(30) as a > order by > rank, c1 > go > > select > c1, > (select count(distinct b.c2) from dbo.f1(25) as b where b.c2 <= a.c2) as rank > from > dbo.f1(25) as a > order by > rank, c1 > go > > drop view v1 > go > > drop funcion f1 > go > > drop table t1 > go > > > AMB > > > "Mowgli" wrote: > > > I have a column of starting times below (converted to seconds for > > simplicity): > > > > 500 > > 505 > > 510 > > 535 > > 910 > > 939 > > 944 > > 977 > > > > I need to assign a Group ID to the above values, based on a time > > interval of 30 seconds, so the correct result would be: > > > > Group ID > > 500, 1 > > 505, 1 > > 510, 1 > > 535, 2 > > 910, 3 > > 939, 3 > > 944, 4 > > 977, 5 > > > > The value of Group ID isn't important (could be 100, 200, etc.) as long > > as values in the same 30-second interval have the same Group ID. When a > > new group is started, a new 30-second interval is also started. > > > > I have created a solution using SQL loops (row-based processing) but > > was wondering if there is an efficient/elegant set-based solution when > > there are thousands of records? > > > > > > Thanks > > > > Mowgli > > > > Alejandro
Yes, it's a good idea to compare inner join and correlated subquery performance as it can be difficult predicting which is best (if the SQL optimiser doesn't choose for you). Your solutions work well for "normal" data but when using the following data set: truncate table t1 go insert t1 values (1) insert t1 values (2) insert t1 values (3) insert t1 values (29) insert t1 values (30) insert t1 values (32) insert t1 values (33) The result was... c1 rank ----------- ----------- 1 1 2 1 3 1 29 1 30 1 32 2 33 3 We would expect values 32 and 33 to have the same rank. I think a set-based solution to this problem is quite tricky... but I do like the elegance of your queries - very neat. Well, now I understand what "pure coincidence" means.
AMB Show quote "Mowgli" wrote: > Alejandro > > Yes, it's a good idea to compare inner join and correlated subquery > performance as it can be difficult predicting which is best (if the SQL > optimiser doesn't choose for you). > Your solutions work well for "normal" data but when using the following > data set: > > truncate table t1 > go > insert t1 values (1) > insert t1 values (2) > insert t1 values (3) > insert t1 values (29) > insert t1 values (30) > insert t1 values (32) > insert t1 values (33) > > > The result was... > > c1 rank > ----------- ----------- > 1 1 > 2 1 > 3 1 > 29 1 > 30 1 > 32 2 > 33 3 > > > We would expect values 32 and 33 to have the same rank. > I think a set-based solution to this problem is quite tricky... but I > do like the elegance of your queries - very neat. > > Mowgli,
Someone may have suggested this, but here's an iterative solution that doesn't use cursors. You won't find a non-recursive set-based solution, I don't think, because the definition of the group ID itself is recursive. This solution will do a few reads for each group, and may or may not be faster than a cursor-based solution, since the range size here can comprise at most 30 (a small number) of rows. If your actual scenario has groups with very many rows, this should be a better solution than one using cursors. -- sample data create table StartTimes ( s int primary key ) insert into StartTimes select distinct cast(10*Freight as int) from Northwind..Orders where EmployeeID < 4 and Freight between 50 and 300 go -- peek at sample data select top 10 * from StartTimes order by s go declare @t table (s int, groupID int) declare @group int set @group = 1 declare @upto int set @upto = -2147483648 declare @lastupto int while @@rowcount > 0 begin set @lastupto = @upto set @upto = ( select min(s) from StartTimes where s > @upto ) if @upto is null or @upto > 2147483617 begin insert into @t select s, @group from StartTimes where s > @lastupto break end set @upto = @upto + 30 insert into @t select s, @group from StartTimes where s > @lastupto and s <= @upto set @group = @group + 1 --break end select * from @t go drop table StartTimes -- Steve Kass -- Drew University -- 36F7CD49-68D3-4451-9FAE-5E48B524A1A5 Mowgli wrote: Show quote >I have a column of starting times below (converted to seconds for >simplicity): > >500 >505 >510 >535 >910 >939 >944 >977 > >I need to assign a Group ID to the above values, based on a time >interval of 30 seconds, so the correct result would be: > > Group ID >500, 1 >505, 1 >510, 1 >535, 2 >910, 3 >939, 3 >944, 4 >977, 5 > >The value of Group ID isn't important (could be 100, 200, etc.) as long >as values in the same 30-second interval have the same Group ID. When a >new group is started, a new 30-second interval is also started. > >I have created a solution using SQL loops (row-based processing) but >was wondering if there is an efficient/elegant set-based solution when >there are thousands of records? > > >Thanks > >Mowgli > > > Steve
I modified the group interval from 30 to 29 so that your code output could be directly compared with other solutions. I compared your code performance with Brian's cursor solution submitted earlier (85,000 rows in the table) and your code ran around 35% faster. The proviso is that the start time column in the source table needs to be indexed. Without the index, the cursor solution is at least 10 times faster - but that was to be expected. I don't think that your iterative solution can be improved much more as it's currently very efficient and (as you pointed out) improves as the interval length increases. I'm not sure how Hugo's suggested strategy will compare to yours in practice but I would love to find out. Thanks very much for your contribution! Be careful when using iterated solutions instead of fast_forward cursors.
By its nature, an iterated solution lowers the transaction isolation level of the batch to READ COMMITTED unless there is a transaction outstanding for the duration of the loop. This can lead to inconsistent results. Another transaction may change a row that you've already read and a row that you haven't yet read. For example, another transaction may decrease one account by $1000 and increase another account by $1000. If you've already read out the one row before the decrease occurs, but read out the other row after the increase, your results will be off by $1000. The normal way to prevent this kind of problem is to increase the transaction isolation level for reads to REPEATABLE READ or SERIALIZABLE. REPEATABLE READ ensures that no rows that have already been read can be changed or deleted at least until the SELECT statement finishes executing. SERIALIZABLE also prevents the insert of a row into a range that has already been read. With a fast_forward cursor, you can simply add WITH(HOLDLOCK) to the select statement that populates the cursor. Since a fast_forward cursor creates a snapshot in tempdb from which rows are fetched, the shared locks are only held for as long as it takes to fill the snapshot. There is no need to have an outstanding transaction during the fetch loop. An iterated solution, on the other hand, provides no such mechanism. To prevent the problem described above, you must have an outstanding transaction for the duration of the loop. In addition, the uncertainty inherent in the READ COMMITTED transaction isolation level is amplified when using an iterated solution, because the interval between the time that the first row is read and the time that the last row is read is extended to the entire duration of the loop. If performance is paramount, and the situation described above isn't relevant to the problem at hand, then by all means use the iterated solution instead of the fast_forward cursor. I'm not trying to say that the iterated solution is a bad one; I just wanted to be sure that you were aware of the ramifications. Show quote "Mowgli" <wmasterpr***@yahoo.co.uk> wrote in message news:1122793353.689637.241840@z14g2000cwz.googlegroups.com... > Steve > > I modified the group interval from 30 to 29 so that your code output > could be directly compared with other solutions. > > I compared your code performance with Brian's cursor solution submitted > earlier (85,000 rows in the table) and your code > ran around 35% faster. The proviso is that the start time column in the > source table needs to be indexed. > Without the index, the cursor solution is at least 10 times faster - > but that was to be expected. > > I don't think that your iterative solution can be improved much more as > it's currently very efficient and (as you pointed out) improves as the > interval length increases. I'm not sure how Hugo's suggested strategy > will compare to yours in practice but I would love to find out. > > Thanks very much for your contribution! > Brian
Yes, those are valid points. In this case, the table that contains the start times does contain millions of records - this table could be accessed/updated by other processes. The grouping logic is likely to only process a small subset of these records (thousands) when run at a given time. We could create a snapshot of the relevant data (in a temp table or table variable) for the iterative solution as well - which may mean the overall performance of the iterative solution could be similar/worse than the cursor in some cases. I much prefer performance testing a number of solutions than just the one! On 31 Jul 2005 06:07:12 -0700, Mowgli wrote:
>In this case, the table that contains the start times does contain Hi Mowgli,>millions of records (snip) > The grouping logic is likely to only process a small subset >of these records (thousands) when run at a given time. In that case, most solutions would probably see a huge performance increase if you first copy only the subset of rows that have to be grouped to a temporary table. Especially if you also customize the indexes for that temp table to the algorithm chosen. Of course, that would increase the chance of wrong results due to data being changed while the grouping algorithm is working even further. You will have to weigh the pro's and con's. Best, Hugo -- (Remove _NO_ and _SPAM_ to get my e-mail address) Brian
Yes, those are valid points. In this case, the table that contains the start times does have millions of rows - this table could be accessed/updated by other processes. The grouping logic is likely to only process a small subset of these rows (thousands) when run at a given time. We could create a snapshot of the relevant data (in a temp table or table variable) for the iterative solution as well - which may mean the overall performance of the iterative solution could be similar/worse than the cursor in some cases. I much prefer performance testing a number of solutions than just the one! |
|||||||||||||||||||||||