|
database
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Computing Medians The Joe Celko WaySQL. Most of the threads I come across include a semi-canned response from Joe Celko which I have some questions about. First, Celko says, Show quoteHide quote > The statistical median of a column with duplicate values can be found I adapted this SQL to my situation but, unless I am mistaken, it> with a query based on the same ideas, but you have to adjust the HAVING > clause to allow for overlap; thus, the left statistical median is found > by > SELECT P1.weight > FROM Parts AS P1, Parts AS P2 > GROUP BY P1.weight > HAVING SUM(CASE WHEN P2.weight <= P1.weight > THEN 1 ELSE 0 END) > >= ((COUNT(*) + 1) / 2) > AND SUM(CASE WHEN P2.weight >= P1.weight > THEN 1 ELSE 0 END) > >= (COUNT(*)/2 + 1); returns both the left and right statistical medians. For the life of me, I can't figure out how to get it to return only the left or right statistical median. Luckily, I don't need those specifically. Celko continues... Show quoteHide quote > I would recommend using a combination of the right and left statistical There is a certain "air code" quality to this section (it appears to be> medians to return a set of values about the center of the data, and then > averaging them, > thus: > > SELECT AVG(P1.weight) > FROM Parts AS P1, Parts AS P2 > HAVING (SUM(CASE WHEN P2.weight <= P1.weight -- left median > THEN 1 ELSE 0 END) > >= ((COUNT(*) + 1) / 2) > AND SUM(CASE WHEN P2.weight >= P1.weight > THEN 1 ELSE 0 END) > >= (COUNT(*)/2 + 1)) > OR (SUM(CASE WHEN P2.weight >= P1.weight -- right median > THEN 1 ELSE 0 END) > >= ((COUNT(*) + 1) / 2) > AND SUM(CASE WHEN P2.weight <= P1.weight > THEN 1 ELSE 0 END) > >= (COUNT(*)/2 + 1)); missing a GROUP BY clause and maybe some parenthesis) but again, unless I am mistaken, the ORed section is redundant (i.e. the section preceding the OR is identical, aside from some non-critical reordering, to the section following the OR). So, finally, my question. Has anyone used a method similar to this for computing medians? Is there anything subtly (or grossly, I suppose) wrong with the following SQL for calculating a Financial Median? SELECT AVG(P1.weight) FROM Parts AS P1, Parts AS P2 GROUP BY P1.weight HAVING SUM(CASE WHEN P2.weight <= P1.weight THEN 1 ELSE 0 END) >= ((COUNT(*) + 1) / 2) AND SUM(CASE WHEN P2.weight >= P1.weightTHEN 1 ELSE 0 END) >= (COUNT(*)/2 + 1); I think other formulas (such as AVG(DISTINCT P1.weight) could besubstituted in the SELECT clause depending on which "form" of median was desired but I am more concerned about whether I will always get the right "starting" records for the median calculation than with the specific formula for the median calculation. JLS (jlspub***@hotmail.com) writes:
Show quoteHide quote > So, finally, my question. Has anyone used a method similar to this for On SQL 2005 you can use row_number(). Here is a query which builds on> computing medians? Is there anything subtly (or grossly, I suppose) > wrong with the following SQL for calculating a Financial Median? > > SELECT AVG(P1.weight) > FROM Parts AS P1, Parts AS P2 > GROUP BY P1.weight > HAVING SUM(CASE WHEN P2.weight <= P1.weight > THEN 1 ELSE 0 END) > >= ((COUNT(*) + 1) / 2) > AND SUM(CASE WHEN P2.weight >= P1.weight > THEN 1 ELSE 0 END) > >= (COUNT(*)/2 + 1); > > I think other formulas (such as AVG(DISTINCT P1.weight) could be > substituted in the SELECT clause depending on which "form" of median > was desired but I am more concerned about whether I will always get the > right "starting" records for the median calculation than with the > specific formula for the median calculation. an idea from SQL Server Itzik Ben-Gan: SELECT listlen, datatype, cnt = MIN (cnt), median = avg(tookms), avgms = MIN(avgms) FROM (SELECT listlen, datatype, optype, method, calltype, cnt = COUNT(*) OVER (PARTITION BY listlen, datatype), avgms = AVG(tookms) OVER (PARTITION BY listlen, datatype), tookms, rowno = row_number() OVER (PARTITION BY listlen, datatype ORDER BY tookms) FROM timings) AS x WHERE rowno IN (cnt/2 + 1, (cnt + 1)/2) tookms is the value I'm computing the median of. The idea is that if the number of elements are odd, outer WHERE clause will only find one row, and if there is an even number it will hit two rows, and the median is the average of those two rows. -- Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx Thanks for giving me another take on how this could be done.
Unfortunately, I haven't migrated to SQL 2005 yet. However, since you brought it up, I understand that SQL 2005 permits user-defined aggregates. Could the Median function be implemented as a user-defined aggregate? If it can, that would be great because I wouldn't have to "re-code" my median logic for various grouping scenarios. From reading BOL, I get the impression that user-defined aggregate functions only "see" the data one record at a time and in unsorted order so the calculation would require a fair amount of behind the scenes storage space for large result sets but it seems like it might work. Any thoughts? Erland Sommarskog wrote: Show quoteHide quote > On SQL 2005 you can use row_number(). Here is a query which builds on > an idea from SQL Server Itzik Ben-Gan: > > SELECT listlen, datatype, > cnt = MIN (cnt), median = avg(tookms), avgms = MIN(avgms) > FROM (SELECT listlen, datatype, optype, method, calltype, > cnt = COUNT(*) OVER (PARTITION BY listlen, datatype), > avgms = AVG(tookms) OVER (PARTITION BY listlen, datatype), > tookms, > rowno = row_number() OVER (PARTITION BY listlen, datatype > ORDER BY tookms) > FROM timings) AS x > WHERE rowno IN (cnt/2 + 1, (cnt + 1)/2) > > tookms is the value I'm computing the median of. The idea is that if > the number of elements are odd, outer WHERE clause will only find one > row, and if there is an even number it will hit two rows, and the median > is the average of those two rows. > > > -- > Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se > > Books Online for SQL Server 2005 at > http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx > Books Online for SQL Server 2000 at > http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx JLS (jlspub***@hotmail.com) writes:
> Thanks for giving me another take on how this could be done. I have not looked into UDAs much myself, but my impression is the same as> Unfortunately, I haven't migrated to SQL 2005 yet. However, since you > brought it up, I understand that SQL 2005 permits user-defined > aggregates. Could the Median function be implemented as a user-defined > aggregate? If it can, that would be great because I wouldn't have to > "re-code" my median logic for various grouping scenarios. From reading > BOL, I get the impression that user-defined aggregate functions only > "see" the data one record at a time and in unsorted order so the > calculation would require a fair amount of behind the scenes storage > space for large result sets but it seems like it might work. Any > thoughts? yours. As long as the data comes running, you can only accumulate, and then at end you can start computing. Which means that if you are ask to asked the median on 100 million values, you will eat quite some memory. -- Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx You are correct, UDAggs in SQL 2005 only "see" one row at a time. The
internal state could work around that, but unfortunately it too is limited in size so you can't just store them all and compute the value at the end. Row_number can be used to see the rows in order, and I'd recommend you use that to build your own solution in 2005. For previous releases, the techniques published by others as seen in this newsgroup are what I recommend. Thanks, Conor SQL Server Query Processor Team Show quoteHide quote "Erland Sommarskog" <esq***@sommarskog.se> wrote in message news:Xns9838A2FCA1BF6Yazorman@127.0.0.1... > JLS (jlspub***@hotmail.com) writes: >> Thanks for giving me another take on how this could be done. >> Unfortunately, I haven't migrated to SQL 2005 yet. However, since you >> brought it up, I understand that SQL 2005 permits user-defined >> aggregates. Could the Median function be implemented as a user-defined >> aggregate? If it can, that would be great because I wouldn't have to >> "re-code" my median logic for various grouping scenarios. From reading >> BOL, I get the impression that user-defined aggregate functions only >> "see" the data one record at a time and in unsorted order so the >> calculation would require a fair amount of behind the scenes storage >> space for large result sets but it seems like it might work. Any >> thoughts? > > I have not looked into UDAs much myself, but my impression is the same as > yours. As long as the data comes running, you can only accumulate, and > then > at end you can start computing. Which means that if you are ask to asked > the > median on 100 million values, you will eat quite some memory. > > > > > -- > Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se > > Books Online for SQL Server 2005 at > http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx > Books Online for SQL Server 2000 at > http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
UDF that take a UDT as a parameter (CLR)
Table reaching max value of identity very early Crosstab, Pivot? Paging with ROW_NUMBER on filtered records (SQL SERVER 2005) how do I test the cost of udf? Problem tuning procedures with temporary tables sql query similar to excel pivot table current Date query FOR XML AUTO, ELEMENTS Problem Swap field values in UPDATE |
|||||||||||||||||||||||