Home All Groups Group Topic Archive Search About

Computing Medians The Joe Celko Way

Author
7 Sep 2006 7:48 PM
JLS
I have been researching the best way to calculate median values using
SQL.  Most of the threads I come across include a semi-canned response
from Joe Celko which I have some questions about.

First, Celko says,

Show quote
> The statistical median of a column with duplicate values can be found
> with a query based on the same ideas, but you have to adjust the HAVING
> clause to allow for overlap; thus, the left statistical median is found
> by

> SELECT P1.weight
>       FROM Parts AS P1, Parts AS P2
>    GROUP BY P1.weight
> HAVING SUM(CASE WHEN P2.weight <= P1.weight
>          THEN 1 ELSE 0 END)
>             >= ((COUNT(*) + 1) / 2)
>    AND SUM(CASE WHEN P2.weight >= P1.weight
>          THEN 1 ELSE 0 END)
>             >= (COUNT(*)/2 + 1);

I adapted this SQL to my situation but, unless I am mistaken, it
returns both the left and right statistical medians.  For the life of
me, I can't figure out how to get it to return only the left or right
statistical median.  Luckily, I don't need those specifically.

Celko continues...

Show quote
> I would recommend using a combination of the right and left statistical
> medians to return a set of values about the center of the data, and then
> averaging them,
> thus:
>
> SELECT AVG(P1.weight)
>    FROM Parts AS P1, Parts AS P2
> HAVING (SUM(CASE WHEN P2.weight <= P1.weight -- left median
>       THEN 1 ELSE 0 END)
>          >= ((COUNT(*) + 1) / 2)
>    AND SUM(CASE WHEN P2.weight >= P1.weight
>       THEN 1 ELSE 0 END)
>          >= (COUNT(*)/2 + 1))
>    OR (SUM(CASE WHEN P2.weight >= P1.weight  -- right median
>       THEN 1 ELSE 0 END)
>          >= ((COUNT(*) + 1) / 2)
>    AND SUM(CASE WHEN P2.weight <= P1.weight
>       THEN 1 ELSE 0 END)
>           >= (COUNT(*)/2 + 1));

There is a certain "air code" quality to this section (it appears to be
missing a GROUP BY clause and maybe some parenthesis) but again, unless
I am mistaken, the ORed section is redundant (i.e. the section
preceding the OR is identical, aside from some non-critical reordering,
to the section following the OR).

So, finally, my question.  Has anyone used a method similar to this for
computing medians?  Is there anything subtly (or grossly, I suppose)
wrong with the following SQL for calculating  a Financial Median?

SELECT AVG(P1.weight)
      FROM Parts AS P1, Parts AS P2
   GROUP BY P1.weight
HAVING SUM(CASE WHEN P2.weight <= P1.weight
         THEN 1 ELSE 0 END)
            >= ((COUNT(*) + 1) / 2)
   AND SUM(CASE WHEN P2.weight >= P1.weight
         THEN 1 ELSE 0 END)
            >= (COUNT(*)/2 + 1);

I think other formulas (such as AVG(DISTINCT P1.weight) could be
substituted in the SELECT clause depending on which "form" of median
was desired but I am more concerned about whether I will always get the
right "starting" records for the median calculation than with the
specific formula for the median calculation.

Author
7 Sep 2006 10:31 PM
Erland Sommarskog
JLS (jlspub***@hotmail.com) writes:
Show quote
> So, finally, my question.  Has anyone used a method similar to this for
> computing medians?  Is there anything subtly (or grossly, I suppose)
> wrong with the following SQL for calculating  a Financial Median?
>
> SELECT AVG(P1.weight)
>       FROM Parts AS P1, Parts AS P2
>    GROUP BY P1.weight
> HAVING SUM(CASE WHEN P2.weight <= P1.weight
>          THEN 1 ELSE 0 END)
>             >= ((COUNT(*) + 1) / 2)
>    AND SUM(CASE WHEN P2.weight >= P1.weight
>          THEN 1 ELSE 0 END)
>             >= (COUNT(*)/2 + 1);
>
> I think other formulas (such as AVG(DISTINCT P1.weight) could be
> substituted in the SELECT clause depending on which "form" of median
> was desired but I am more concerned about whether I will always get the
> right "starting" records for the median calculation than with the
> specific formula for the median calculation.

On SQL 2005 you can use row_number(). Here is a query which builds on
an idea from SQL Server Itzik Ben-Gan:

   SELECT listlen, datatype,
          cnt = MIN (cnt), median = avg(tookms), avgms = MIN(avgms)
   FROM   (SELECT listlen, datatype, optype, method, calltype,
                  cnt   = COUNT(*)    OVER (PARTITION BY listlen, datatype),
                  avgms = AVG(tookms) OVER (PARTITION BY listlen, datatype),
                  tookms,
                 rowno  = row_number() OVER (PARTITION BY listlen, datatype
                                             ORDER BY tookms)
         FROM   timings) AS x
   WHERE  rowno IN (cnt/2 + 1, (cnt + 1)/2)

tookms is the value I'm computing the median of. The idea is that if
the number of elements are odd, outer WHERE clause will only find one
row, and if there is an even number it will hit two rows, and the median
is the average of those two rows.


--
Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
Author
8 Sep 2006 12:35 PM
JLS
Thanks for giving me another take on how this could be done.
Unfortunately, I haven't migrated to SQL 2005 yet.  However, since you
brought it up, I understand that SQL 2005 permits user-defined
aggregates.  Could the Median function be implemented as a user-defined
aggregate?  If it can, that would be great because I wouldn't have to
"re-code" my median logic for various grouping scenarios.  From reading
BOL, I get the impression that user-defined aggregate functions only
"see" the data one record at a time and in unsorted order so the
calculation would require a fair amount of behind the scenes storage
space for large result sets but it seems like it might work.  Any
thoughts?

Erland Sommarskog wrote:
Show quote
> On SQL 2005 you can use row_number(). Here is a query which builds on
> an idea from SQL Server Itzik Ben-Gan:
>
>    SELECT listlen, datatype,
>           cnt = MIN (cnt), median = avg(tookms), avgms = MIN(avgms)
>    FROM   (SELECT listlen, datatype, optype, method, calltype,
>                   cnt   = COUNT(*)    OVER (PARTITION BY listlen, datatype),
>                   avgms = AVG(tookms) OVER (PARTITION BY listlen, datatype),
>                   tookms,
>                  rowno  = row_number() OVER (PARTITION BY listlen, datatype
>                                              ORDER BY tookms)
>          FROM   timings) AS x
>    WHERE  rowno IN (cnt/2 + 1, (cnt + 1)/2)
>
> tookms is the value I'm computing the median of. The idea is that if
> the number of elements are odd, outer WHERE clause will only find one
> row, and if there is an even number it will hit two rows, and the median
> is the average of those two rows.
>
>
> --
> Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se
>
> Books Online for SQL Server 2005 at
> http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
> Books Online for SQL Server 2000 at
> http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
Author
8 Sep 2006 2:02 PM
Erland Sommarskog
JLS (jlspub***@hotmail.com) writes:
> Thanks for giving me another take on how this could be done.
> Unfortunately, I haven't migrated to SQL 2005 yet.  However, since you
> brought it up, I understand that SQL 2005 permits user-defined
> aggregates.  Could the Median function be implemented as a user-defined
> aggregate?  If it can, that would be great because I wouldn't have to
> "re-code" my median logic for various grouping scenarios.  From reading
> BOL, I get the impression that user-defined aggregate functions only
> "see" the data one record at a time and in unsorted order so the
> calculation would require a fair amount of behind the scenes storage
> space for large result sets but it seems like it might work.  Any
> thoughts?

I have not looked into UDAs much myself, but my impression is the same as
yours. As long as the data comes running, you can only accumulate, and then
at end you can start computing. Which means that if you are ask to asked the
median on 100 million values, you will eat quite some memory.




--
Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
Author
14 Sep 2006 7:40 PM
Conor Cunningham [MS]
You are correct, UDAggs in SQL 2005 only "see" one row at a time.  The
internal state could work around that, but unfortunately it too is limited
in size so you can't just store them all and compute the value at the end.

Row_number can be used to see the rows in order, and I'd recommend you use
that to build your own solution in 2005.

For previous releases, the techniques published by others as seen in this
newsgroup are what I recommend.

Thanks,

Conor
SQL Server Query Processor Team


Show quote
"Erland Sommarskog" <esq***@sommarskog.se> wrote in message
news:Xns9838A2FCA1BF6Yazorman@127.0.0.1...
> JLS (jlspub***@hotmail.com) writes:
>> Thanks for giving me another take on how this could be done.
>> Unfortunately, I haven't migrated to SQL 2005 yet.  However, since you
>> brought it up, I understand that SQL 2005 permits user-defined
>> aggregates.  Could the Median function be implemented as a user-defined
>> aggregate?  If it can, that would be great because I wouldn't have to
>> "re-code" my median logic for various grouping scenarios.  From reading
>> BOL, I get the impression that user-defined aggregate functions only
>> "see" the data one record at a time and in unsorted order so the
>> calculation would require a fair amount of behind the scenes storage
>> space for large result sets but it seems like it might work.  Any
>> thoughts?
>
> I have not looked into UDAs much myself, but my impression is the same as
> yours. As long as the data comes running, you can only accumulate, and
> then
> at end you can start computing. Which means that if you are ask to asked
> the
> median on 100 million values, you will eat quite some memory.
>
>
>
>
> --
> Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se
>
> Books Online for SQL Server 2005 at
> http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
> Books Online for SQL Server 2000 at
> http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx

AddThis Social Bookmark Button