|
database
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Outer Join and WHERE clause strange behaviortable by joining it with a view DDL for the table: CREATE TABLE w3cexlog ( pacifictime datetime, [cs-method] varchar(20), target varchar(8000), query varchar(2000), username varchar(100), browser varchar(8000), referrer varchar(2000), status int, substatus int, win32status int, [sc-bytes] bigint, [cs-bytes] bigint, [time-taken] bigint ) The view is through ADSI, and pulls the Active Directory OU, samAccountName (which corresponds to username), and full name for the user. If the view where a base table, it's DDL would look like this CREATE TABLE PortalUsers ( name nvarchar(512), samAccountName nvarchar(512), OU nvarchar(8000) ) I want to find out how many hits where generated by users in each OU, along with visitors from each OU, by joining PortalUsers to w3cexlog on PortalUsers.samAccountName = w3cexlog.username. (The web site uses AD authentication, so they must match). Here is the query: SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS visitors FROM portalusers p LEFT JOIN w3cexlog w on p.samaccountname=w.username GROUP BY p.ou ORDER BY 2 DESC, 1 This query gives the expected results; OUs for which none of its users produced a "hit" still show in the output, but with zero's in the "hits" and "visitors" columns. However, if I add a WHERE clause to query a specific date range: SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS visitors FROM portalusers p LEFT JOIN w3cexlog w on p.samaccountname=w.username WERE w.pacifictime BETWEEN '20060102' AND '20060112' GROUP BY p.ou ORDER BY 2 DESC, 1 The OUs that produces no hits are now gone from the result set! That WHERE clause wouldn't have removed any rows for the OUs that didn't have hits; they where never there to begin with; so why does the behavior of the outer join change? In order to get around this, I constructed the following query SELECT t1.OU, t2.hits, t2.visitors FROM ( SELECT DISTINCT OU FROM PortalUsers ) t1 LEFT JOIN ( SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS visitors FROM portalusers p LEFT JOIN w3cexlog w on p.samaccountname=w.username WERE w.pacifictime BETWEEN '20060102' AND '20060112' GROUP BY p.ou ) t2 ON t1.OU = t2.OU ORDER BY 2 DESC, 1 But the performance of this is terrible. --
Show quote
"Mark Williams" <MarkWilli***@discussions.microsoft.com> wrote in message This is the way things work. An outer join may contain resultsnews:5002E6BF-4C4F-48B1-B333-D7B398C42430@microsoft.com... >I am writing a stored procedure to get aggregrate data out of a web site >log > table by joining it with a view > > > SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS > visitors > FROM portalusers p LEFT JOIN w3cexlog w on > p.samaccountname=w.username > WERE w.pacifictime BETWEEN '20060102' AND '20060112' > GROUP BY p.ou > ORDER BY 2 DESC, 1 > > The OUs that produces no hits are now gone from the result set! That WHERE > clause wouldn't have removed any rows for the OUs that didn't have hits; > they > where never there to begin with; so why does the behavior of the outer > join > change? > Mark, that fail the ON condition, but it will never contain results that fail the WHERE condition. The WHERE condition is a real filter than can be thought of as being applied to the result of the outer join. If you have result rows with 0 hits, all the result columns from w3cexlog in those result rows contain NULL. NULL will never be between '20060102' and '20060112', so you'll see no rows that failed the ON condition. Maybe what you want is just .... on p.samaccountname=w.username AND w.pacifictime BETWEEN '20060102' AND '20060112' .... Steve Kass Drew University Another option is to leave the join unchanged and alter the where caluse like
this: .... WHERE (w.pacifictime BETWEEN '20060102' AND '20060112' or w.pacifictime is null) .... ML --- http://milambda.blogspot.com/ It will work, but I wouldn't suggest this as a good option, I'm afraid.
This is like adding "and I really meant that is was an outer join" to the where clause. Or did you see some advantage with putting it in the where clause, like some problem that doing it this way might solve? -- Show quote---------------------------------------------------------------------------- Louis Davidson - http://spaces.msn.com/members/drsql/ SQL Server MVP "Arguments are to be avoided: they are always vulgar and often convincing." (Oscar Wilde) "ML" <M*@discussions.microsoft.com> wrote in message news:3F48CD19-D89E-4535-8FFD-0045DF823CA1@microsoft.com... > Another option is to leave the join unchanged and alter the where caluse > like > this: > > ... > WHERE (w.pacifictime BETWEEN '20060102' AND '20060112' or w.pacifictime is > null) > ... > > > ML > > --- > http://milambda.blogspot.com/ Another option is to move the condition to the ON clause:
SELECT t1.OU, t2.hits, t2.visitors FROM ( SELECT DISTINCT OU FROM PortalUsers ) t1 LEFT JOIN ( SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS visitors, pacifictime FROM portalusers p LEFT JOIN w3cexlog w on p.samaccountname=w.username GROUP BY p.ou ) t2 ON (t1.OU = t2.OU ) AND t2.pacifictime BETWEEN '20060102' AND '20060112' ORDER BY 2 DESC, 1 That might perform better Since both methods yield same results the only argument would be performance.
So, which one would perform better? ML --- http://milambda.blogspot.com/ On Sun, 15 Jan 2006 14:09:01 -0800, ML wrote:
>Since both methods yield same results the only argument would be performance. Hi ML,>So, which one would perform better? Have you testeed them? (I'll gladly admit I didn't - but due to the OR in your method, I'd be surprised if it turns out to be the faster version). Of course, they don't yield the same results if the column is nullible. -- Hugo Kornelis, SQL Server MVP > Of course, they don't yield the same results if the column is nullible. Which column, again? The one used in the join or the one used as the filter?ML --- http://milambda.blogspot.com/ On Mon, 16 Jan 2006 02:10:02 -0800, ML wrote:
>> Of course, they don't yield the same results if the column is nullible. Hi ML,> >Which column, again? The one used in the join or the one used as the filter? The one used in the filter. Here's a repro. Both queries return three rows, but only one is equal. CREATE TABLE Tab1 (JoinCol int NOT NULL, OtherCol int) CREATE TABLE Tab2 (JoinCol int NOT NULL, OtherCol int) go INSERT INTO Tab1 (JoinCol, OtherCol) SELECT 1, 1 UNION ALL SELECT 2, 0 UNION ALL SELECT 3, NULL INSERT INTO Tab2 (JoinCol, OtherCol) SELECT 1, 1 UNION ALL SELECT 1, NULL UNION ALL SELECT 2, 0 UNION ALL SELECT 3, NULL go SELECT * FROM Tab1 LEFT JOIN Tab2 ON Tab2.JoinCol = Tab1.JoinCol WHERE Tab2.OtherCol = 1 OR Tab2.OtherCol IS NULL go SELECT * FROM Tab1 LEFT JOIN Tab2 ON Tab2.JoinCol = Tab1.JoinCol AND Tab2.OtherCol = 1 go DROP TABLE Tab2 DROP TABLE Tab1 go -- Hugo Kornelis, SQL Server MVP Oh, I see! Thank you for the explanation. So, is it safe to assume that the
correct method of filtering outer sets would be to do it in the join? ML --- http://milambda.blogspot.com/ On Mon, 16 Jan 2006 15:21:02 -0800, ML wrote:
>Oh, I see! Thank you for the explanation. So, is it safe to assume that the Hi ML,>correct method of filtering outer sets would be to do it in the join? That depends on the results that you need to get. Your method is correct if you want to join to rows with matching account id and time in the requested range OR no time stored in the DB. My method (or Louis' adapted version of yours) is correct if you want to join to rows with matching account id and time in the requested range. In the latter case, I still rpefer my syntax to Louis' syntax becuase it shows much clearer what the join conditions are. But that's a personal issue, and in a performance-sensitive DB, I'd choose Louis' version without hesitation if it turned out to be faster. -- Hugo Kornelis, SQL Server MVP Beyond what was discussed about how if the pacifictime column was nullable
(you would want to check for the join columns being null) the argument is style and symantics. I would argue that it is very important to write readable queries that can be understood by the next reader AND that make sense. Performance is important, and if the less clear version peformed measurably better (if anything it would likely be no better than equal, but sometimes there can be reasons to do things that don't seem kosher for performance sake, but this is unlikely to be one of those cases. -- Show quote---------------------------------------------------------------------------- Louis Davidson - http://spaces.msn.com/members/drsql/ SQL Server MVP "Arguments are to be avoided: they are always vulgar and often convincing." (Oscar Wilde) "ML" <M*@discussions.microsoft.com> wrote in message news:61118E3B-FAED-4124-86F5-26219564531A@microsoft.com... > Since both methods yield same results the only argument would be > performance. > So, which one would perform better? > > > ML > > --- > http://milambda.blogspot.com/ When discussing the term "readability" two things IMHO should be taken into
account: 1) what is the widely accepted standard? Above all, the base of readability is whether a piece of code can be understood by most peers in the specific industry; and 2) what is the propriatary standard in a segment of industry (e.g. a company)? In addition to the widely accepted standards special rules and/or principles can constitute a standard used by a company - be it internally or in co-operation with other companies. This is what I meant in my pervious post. When performance is the key and the product cannot simply follow the two bases of readability an exception to the rules is made and should be designated as such, or should constitute a new (special) rule. Well, at least that's my personal rule. ML --- http://milambda.blogspot.com/ I hadn't been following this thread since after the initial responses. Here
are how the queries perform (15578 rows in w3cexlog, with nonclustered indexes on pacifictime and username) SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS visitors FROM portalusers p LEFT JOIN w3cexlog w ON p.samaccountname=w.username WHERE (w.pacifictime BETWEEN '20060102' AND '20060114') OR w.pacifictime IS NULL GROUP BY p.ou ORDER BY 2 DESC, 1 Account to SQL profiler, the above batch shows a duration of 1250 milliseconds, CPU time of 1031 milliseconds. SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS visitors FROM portalusers p LEFT JOIN w3cexlog w ON p.samaccountname=w.username AND w.pacifictime BETWEEN '20060102' AND '20060114' GROUP BY p.ou ORDER BY 2 DESC, 1 According to SQL Profiler, the above batch has a duration of 1233 milliseconds, and CPU time of 961 seconds. The query with the date range restriction applied to the join appears to run ever-so-slightly faster. As the table gets larger over time, a clearer performance difference may appear. For the record, none of the columns in the table are nullable, so there will never be a pacifictime or username of NULL. Alexander, your query as written will not work because it would have to group the 'hits' by date. I modified your suggestion into SELECT t1.OU, COUNT(t2.username) as "hits", COUNT(DISTINCT t2.username) AS "visitors" FROM ( SELECT DISTINCT OU FROM PortalUsers ) t1 LEFT JOIN ( SELECT p.ou, w.username, w.pacifictime FROM portalusers p LEFT JOIN w3cexlog w on p.samaccountname=w.username ) t2 ON (t1.OU = t2.OU ) AND t2.pacifictime BETWEEN '20060102' AND '20060112' GROUP BY t1.OU ORDER BY 2 DESC, 1 Profiler shows a duration of 2250 milliseconds for this query. In my very non-professional opinion, the query that has the date restriction in the join is slightly more readable. Show quote "ML" wrote: > When discussing the term "readability" two things IMHO should be taken into > account: > > 1) what is the widely accepted standard? Above all, the base of readability > is whether a piece of code can be understood by most peers in the specific > industry; and > > 2) what is the propriatary standard in a segment of industry (e.g. a > company)? In addition to the widely accepted standards special rules and/or > principles can constitute a standard used by a company - be it internally or > in co-operation with other companies. > > This is what I meant in my pervious post. When performance is the key and > the product cannot simply follow the two bases of readability an exception to > the rules is made and should be designated as such, or should constitute a > new (special) rule. > > Well, at least that's my personal rule. > > > ML > > --- > http://milambda.blogspot.com/ >I hadn't been following this thread since after the initial responses. This is often the case :) These things take on a life of their own as we start discussing philosopy instead of helping anymore (but it is fun nevertheless!) > milliseconds, and CPU time of 961 seconds. The query with the date range It may also get worse. Query performance gets wierd as results get bigger > restriction applied to the join appears to run ever-so-slightly faster. As > the table gets larger over time, a clearer performance difference may > appear. and bigger. Luckily it is fun for most DBA's, and it keeps us employed. > For the record, none of the columns in the table are nullable, so there This to me is the biggest concern when writing queries like this. The > will > never be a pacifictime or username of NULL. NULLibility of the column should not come into play by accident. Like I said somewhere else, just change the where clause version to a primary key column. Show quote "Mark Williams" <MarkWilli***@discussions.microsoft.com> wrote in message news:3D6794C0-4400-4154-9AF1-9AC0A71B46B7@microsoft.com... >I hadn't been following this thread since after the initial responses. Here > are how the queries perform (15578 rows in w3cexlog, with nonclustered > indexes on pacifictime and username) > > SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS > visitors > FROM portalusers p LEFT JOIN w3cexlog w > ON p.samaccountname=w.username > WHERE (w.pacifictime BETWEEN '20060102' AND '20060114') OR w.pacifictime > IS > NULL > GROUP BY p.ou > ORDER BY 2 DESC, 1 > > Account to SQL profiler, the above batch shows a duration of 1250 > milliseconds, CPU time of 1031 milliseconds. > > SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS > visitors > FROM portalusers p LEFT JOIN w3cexlog w > ON p.samaccountname=w.username AND w.pacifictime BETWEEN '20060102' AND > '20060114' > GROUP BY p.ou > ORDER BY 2 DESC, 1 > > According to SQL Profiler, the above batch has a duration of 1233 > milliseconds, and CPU time of 961 seconds. The query with the date range > restriction applied to the join appears to run ever-so-slightly faster. As > the table gets larger over time, a clearer performance difference may > appear. > > For the record, none of the columns in the table are nullable, so there > will > never be a pacifictime or username of NULL. > > Alexander, your query as written will not work because it would have to > group the 'hits' by date. I modified your suggestion into > > SELECT t1.OU, COUNT(t2.username) as "hits", COUNT(DISTINCT t2.username) AS > "visitors" > FROM > ( > SELECT DISTINCT OU FROM PortalUsers > ) t1 > LEFT JOIN > ( > SELECT p.ou, w.username, w.pacifictime > FROM portalusers p LEFT JOIN w3cexlog w on > p.samaccountname=w.username > ) t2 > ON (t1.OU = t2.OU ) > AND t2.pacifictime BETWEEN '20060102' AND '20060112' > GROUP BY t1.OU > ORDER BY 2 DESC, 1 > > Profiler shows a duration of 2250 milliseconds for this query. > > In my very non-professional opinion, the query that has the date > restriction > in the join is slightly more readable. > > > "ML" wrote: > >> When discussing the term "readability" two things IMHO should be taken >> into >> account: >> >> 1) what is the widely accepted standard? Above all, the base of >> readability >> is whether a piece of code can be understood by most peers in the >> specific >> industry; and >> >> 2) what is the propriatary standard in a segment of industry (e.g. a >> company)? In addition to the widely accepted standards special rules >> and/or >> principles can constitute a standard used by a company - be it internally >> or >> in co-operation with other companies. >> >> This is what I meant in my pervious post. When performance is the key and >> the product cannot simply follow the two bases of readability an >> exception to >> the rules is made and should be designated as such, or should constitute >> a >> new (special) rule. >> >> Well, at least that's my personal rule. >> >> >> ML >> >> --- >> http://milambda.blogspot.com/ > As the table gets larger over time, a clearer performance difference may appear. As the data grows, you could encounter at least 3 different situations:1. users have on average just a few rows in w3cexlog, just a few users don't have child rows 2. users have on average hundreds or thousands of rows in w3cexlog, just a few users don't have child rows 3. most users don't have any rows in w3cexlog, those that do have on average just a few rows What performs best in situation 2 may be the worst performer in situation 1. > Alexander, your query as written will not work because it would have to group the 'hits' by date.thanks for the correction On Sat, 14 Jan 2006 04:46:01 -0800, ML wrote:
>Another option is to leave the join unchanged and alter the where caluse like Hi ML,>this: > >... >WHERE (w.pacifictime BETWEEN '20060102' AND '20060112' or w.pacifictime is >null) >... That would also include rows with w.pacifictime equal to NULL in the base table (assuming the column is nullable); I don't think that this is what Mark wants. -- Hugo Kornelis, SQL Server MVP I didn't even think about this in the original reply. You would actually
have to use: or w.username is null (since this is the criteria that the join was done on.) -- Show quote---------------------------------------------------------------------------- Louis Davidson - http://spaces.msn.com/members/drsql/ SQL Server MVP "Arguments are to be avoided: they are always vulgar and often convincing." (Oscar Wilde) "Hugo Kornelis" <h***@perFact.REMOVETHIS.info> wrote in message news:aqfls1t9j1ii2i9muhu9e5vf6b2q2dadb0@4ax.com... > On Sat, 14 Jan 2006 04:46:01 -0800, ML wrote: > >>Another option is to leave the join unchanged and alter the where caluse >>like >>this: >> >>... >>WHERE (w.pacifictime BETWEEN '20060102' AND '20060112' or w.pacifictime is >>null) >>... > > Hi ML, > > That would also include rows with w.pacifictime equal to NULL in the > base table (assuming the column is nullable); I don't think that this is > what Mark wants. > > -- > Hugo Kornelis, SQL Server MVP Thanks Steve and ML. I should have seen it, but I was a little tired! I had
the incorrect thinking that the WHERE clause was applied before the join, instead of after. -- Show quote"Steve Kass" wrote: > > "Mark Williams" <MarkWilli***@discussions.microsoft.com> wrote in message > news:5002E6BF-4C4F-48B1-B333-D7B398C42430@microsoft.com... > >I am writing a stored procedure to get aggregrate data out of a web site > >log > > table by joining it with a view > > > > > > SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS > > visitors > > FROM portalusers p LEFT JOIN w3cexlog w on > > p.samaccountname=w.username > > WERE w.pacifictime BETWEEN '20060102' AND '20060112' > > GROUP BY p.ou > > ORDER BY 2 DESC, 1 > > > > The OUs that produces no hits are now gone from the result set! That WHERE > > clause wouldn't have removed any rows for the OUs that didn't have hits; > > they > > where never there to begin with; so why does the behavior of the outer > > join > > change? > > > Mark, > > This is the way things work. An outer join may contain results > that fail the ON condition, but it will never contain results that fail > the WHERE condition. The WHERE condition is a real filter > than can be thought of as being applied to the result of the > outer join. > > If you have result rows with 0 hits, all the result columns from > w3cexlog in those result rows contain NULL. NULL will never > be between '20060102' and '20060112', so you'll see no rows > that failed the ON condition. > > Maybe what you want is just > .... on p.samaccountname=w.username > AND w.pacifictime BETWEEN '20060102' AND '20060112' > .... > > Steve Kass > Drew University > > > > |
|||||||||||||||||||||||