Home All Groups Group Topic Archive Search About

Wrong Number of Rows with Correlated Subquery + Group By

Author
7 Jul 2006 6:49 PM
mbrierst
While working on calculating medians for a table (in SQL server 2000),
we ended up with a query like the one below.  The query as written
returns one row for each row in the temp table (7 rows total).  We just
want one row for each category.  If you uncomment the "distinct",
you'll get two rows as desired.  But if you instead uncomment the
"group by t.category", you get 14 rows (# rows * number categories).  I
want to know WHY this happens.  It doesn't make any sense to me.  The
group by and distinct should act exactly the same in this case.  Is
this a bug, or is there something about subqueries that I don't
understand?

create table #temp (category int, i int)
insert #temp select 1, 1
insert #temp select 1, 3
insert #temp select 1, 5
insert #temp select 1, 5
insert #temp select 2, 1
insert #temp select 2, 3
insert #temp select 2, 3
select /*distinct*/ t.category,
    ((select max(x1.i)
      from (select top 50 percent i
              from #temp t1
              where t1.category = t.category
              order by t1.i
           ) x1) +
    (select min(x2.i)
      from (select top 50 percent i
              from #temp t2
              where t2.category = t.category
              order by t2.i desc
           ) x2))/2.0 median
  from #temp t
  /*group by t.category*/
  order by category
drop table #temp

Author
7 Jul 2006 8:19 PM
Arnie Rowland
Try this instead. (If you could change the datatype for the column 'i' to decimal, you would have to CAST in the query.)

NOTE: This provides the 'Financial' median -NOT the true or statistical median.

DECLARE @MyTable table
   (   Category int
     , i        int
   )

INSERT INTO @MyTable VALUES ( 1, 1 )
INSERT INTO @MyTable VALUES ( 1, 3 )
INSERT INTO @MyTable VALUES ( 1, 5 )
INSERT INTO @MyTable VALUES ( 1, 5 )
INSERT INTO @MyTable VALUES ( 2, 1 )
INSERT INTO @MyTable VALUES ( 2, 3 )
INSERT INTO @MyTable VALUES ( 2, 3 )

SELECT
     Category
   , avg( cast( i AS decimal ))                       AS 'Average'
   , ceiling( sum( cast(i as decimal) ) / count(1))   AS 'Median'
FROM @MyTable
GROUP BY Category

I included the average for verification purposes.

--
Arnie Rowland*
"To be successful, your heart must accompany your knowledge."



Show quoteHide quote
<mbrie***@gmail.com> wrote in message news:1152298174.409940.159850@m79g2000cwm.googlegroups.com...
> While working on calculating medians for a table (in SQL server 2000),
> we ended up with a query like the one below.  The query as written
> returns one row for each row in the temp table (7 rows total).  We just
> want one row for each category.  If you uncomment the "distinct",
> you'll get two rows as desired.  But if you instead uncomment the
> "group by t.category", you get 14 rows (# rows * number categories).  I
> want to know WHY this happens.  It doesn't make any sense to me.  The
> group by and distinct should act exactly the same in this case.  Is
> this a bug, or is there something about subqueries that I don't
> understand?
>
> create table #temp (category int, i int)
> insert #temp select 1, 1
> insert #temp select 1, 3
> insert #temp select 1, 5
> insert #temp select 1, 5
> insert #temp select 2, 1
> insert #temp select 2, 3
> insert #temp select 2, 3
> select /*distinct*/ t.category,
>    ((select max(x1.i)
>      from (select top 50 percent i
>              from #temp t1
>              where t1.category = t.category
>              order by t1.i
>           ) x1) +
>    (select min(x2.i)
>      from (select top 50 percent i
>              from #temp t2
>              where t2.category = t.category
>              order by t2.i desc
>           ) x2))/2.0 median
>  from #temp t
>  /*group by t.category*/
>  order by category
> drop table #temp
>
Are all your drivers up to date? click for free checkup

Author
7 Jul 2006 8:28 PM
mbrierst
Arnie Rowland wrote:
> Try this instead. (If you could change the datatype for the column 'i' to decimal, you would have to CAST in the query.)

Thanks for your suggestion, but I'm not interested in another way of
doing this calculation.  I know there are many ways to work around this
issue, and many ways to calculate the median.  What I really want to
know is WHY it acts like it does, with too many rows returned when
using a group by.
Author
7 Jul 2006 8:52 PM
Arnie Rowland
Sorry, that wasn't complete. I was trying to do something with this, but it still doesn't work ... -Bummer.  (Well, it worked with the limited dataset provided. But once I began expanding the data, I saw that it's just not so simple...-Friday afternoon, what can I say...)

SELECT
     Category
   , avg( cast( i AS decimal )) AS 'Average'
   , CASE ( count(1) % 2 )
      WHEN 0 THEN ceiling( sum( cast(i AS decimal)) / count(1))
      ELSE floor( sum( cast(i AS decimal)) / count(1))
     END  AS 'Median'
FROM @MyTable
GROUP BY Category
--
Arnie Rowland*
"To be successful, your heart must accompany your knowledge."



  "Arnie Rowland" <ar***@1568.com> wrote in message news:OPtaZKgoGHA.2268@TK2MSFTNGP04.phx.gbl...
  Try this instead. (If you could change the datatype for the column 'i' to decimal, you would have to CAST in the query.)

  NOTE: This provides the 'Financial' median -NOT the true or statistical median.

  DECLARE @MyTable table
     (   Category int
       , i        int
     )

  INSERT INTO @MyTable VALUES ( 1, 1 )
  INSERT INTO @MyTable VALUES ( 1, 3 )
  INSERT INTO @MyTable VALUES ( 1, 5 )
  INSERT INTO @MyTable VALUES ( 1, 5 )
  INSERT INTO @MyTable VALUES ( 2, 1 )
  INSERT INTO @MyTable VALUES ( 2, 3 )
  INSERT INTO @MyTable VALUES ( 2, 3 )

  SELECT
       Category
     , avg( cast( i AS decimal ))                       AS 'Average'
     , ceiling( sum( cast(i as decimal) ) / count(1))   AS 'Median'
  FROM @MyTable
  GROUP BY Category

  I included the average for verification purposes.

  --
  Arnie Rowland*
  "To be successful, your heart must accompany your knowledge."



Show quoteHide quote
  <mbrie***@gmail.com> wrote in message news:1152298174.409940.159850@m79g2000cwm.googlegroups.com...
  > While working on calculating medians for a table (in SQL server 2000),
  > we ended up with a query like the one below.  The query as written
  > returns one row for each row in the temp table (7 rows total).  We just
  > want one row for each category.  If you uncomment the "distinct",
  > you'll get two rows as desired.  But if you instead uncomment the
  > "group by t.category", you get 14 rows (# rows * number categories).  I
  > want to know WHY this happens.  It doesn't make any sense to me.  The
  > group by and distinct should act exactly the same in this case.  Is
  > this a bug, or is there something about subqueries that I don't
  > understand?
  >
  > create table #temp (category int, i int)
  > insert #temp select 1, 1
  > insert #temp select 1, 3
  > insert #temp select 1, 5
  > insert #temp select 1, 5
  > insert #temp select 2, 1
  > insert #temp select 2, 3
  > insert #temp select 2, 3
  > select /*distinct*/ t.category,
  >    ((select max(x1.i)
  >      from (select top 50 percent i
  >              from #temp t1
  >              where t1.category = t.category
  >              order by t1.i
  >           ) x1) +
  >    (select min(x2.i)
  >      from (select top 50 percent i
  >              from #temp t2
  >              where t2.category = t.category
  >              order by t2.i desc
  >           ) x2))/2.0 median
  >  from #temp t
  >  /*group by t.category*/
  >  order by category
  > drop table #temp
  >
Author
7 Jul 2006 10:06 PM
Anith Sen
>>  I want to know WHY this happens.  It doesn't make any sense to me.  The
>> group by and distinct should act exactly the same in this case.  Is this
>> a bug, or is there something about subqueries that I don't understand?

This is a bug and is fixed in SQL 2005.

In SQL 2000 re-writing is your best option, since the query formulation is
error-prone and the order of evaluation is potentially a bug. Just compare
the execution plans and it is easy to find out the extra branch in the
execution plan introduced by the GROUP BY that treats the correlation as an
entirely separate subset.

SELECT t.category,
       ( SELECT MAX( x1.i )
           FROM ( SELECT i FROM #temp t1
                   WHERE t1.category = t.category ) x1 )
  FROM #temp t;

SELECT t.category,
       ( SELECT MAX( x1.i )
           FROM ( SELECT i FROM #temp t1
                   WHERE t1.category = t.category ) x1 )
  FROM #temp t
GROUP BY t.category;

Even if it works, it is not easy to give a logical explanation for the
behaviour since you have no keys and you use ordered subsets in derived
tables -- this is essentially similar to "faking" ORDER BY clause in view
with a TOP clause in the SELECT.

Also, in SQL 2005, you can simply use the ROW_NUMBER() function to get this
done more efficiently.

--
Anith
Author
10 Jul 2006 1:12 PM
mbrierst
Anith Sen wrote:
> >>  I want to know WHY this happens.  It doesn't make any sense to me.  The
> >> group by and distinct should act exactly the same in this case.  Is this
> >> a bug, or is there something about subqueries that I don't understand?
>
> This is a bug and is fixed in SQL 2005.
>

Thanks very much for verifying that this is a bug.  It's not important,
but do you by any chance have a reference to somewhere on the microsoft
site where this might be mentioned?

Bookmark and Share

Post Thread options