Home All Groups Group Topic Archive Search About

Optimizing an IN clause

Author
12 Jan 2006 11:50 PM
Roger Garrett
I have a fairly straightforward SELECT query that includes the following:

MembersTable.MemberID IN
(
SELECT ZipcodesTable.MemberID
FROM ZipcodesTable.Zipcode IN
(
'01234','03631','55902' ...   '03036'
)

That is, it's looking for entries in the ZipcodeTable where the Zipcode
value is any one of a very large set of zipcodes, up to 500 Zipcodes. My
Zipcode field is a 5 character field.

Would the query run faster if I made the Zipcode field an int instead of 5
chars?

Are there other ways to speed up the zipcode-matching part of my query?

- Roger Garrett

Author
13 Jan 2006 12:05 AM
Jens
Changing to integer is not possible in this case, cause you got leading
zero in your zip code. Converting them to INT would trim them away. The
best way to speed this up would be top store the ZIP codes in a table
to join them. It would be even more easy to manage.

HTH, Jens Suessmeyer.
Author
14 Jan 2006 3:02 PM
Roger Garrett
Jens,

I don't see that leading zeros would cause any problem. I'm suggesting that
I change the Zipcodes column from char 5 to int and to just store the
numerical integer values of the zipcodes rather than the 5-character string.
A zipcode of "03036" would become a 3036 in the column, and when I'm
searching for an "03036" it would match with the 3036 value. Of course, I
would specify numeric values (e.g. 03036) rathet than quoted strings
('03036').

I'm assuming that a 5-char column occupies at least 6 bytes (to make it on
an even byte  boundary) and that an int occupies 4 bytes. That at a minium
saves some space in the database. And then, when I'm looking for a particular
value, it only has to compare 4 bytes (which is mostly likely a single
hardware instruction) instead of six bytes, so the queries should run a bit
faster.
Author
14 Jan 2006 5:28 PM
Steve Kass
I suggest you don't do this.  While it might save a little space
and make some pieces of your application a little faster, the
bottom line is that a ZIP code is not a number, but a string,
and you'll likely regret any decision to store it as something
it isn't.  For example, when you need to deal with ZIP+4s
some day, you won't enjoy tracking the problems down.

  select 32837 + '-3420'
  select 32837 + '-' + '3420'

You'll also forget the leading 0 somewhere, and as a result,
mail you send to me will go to Texas or the dead letter bin.

Steve Kass
Drew University
Madison, NJ 07940

Show quote
"Roger Garrett" <RogerGarr***@discussions.microsoft.com> wrote in message
news:19CF2F1A-54E9-4DB4-A51E-F0CA427B7E9F@microsoft.com...
> Jens,
>
> I don't see that leading zeros would cause any problem. I'm suggesting
> that
> I change the Zipcodes column from char 5 to int and to just store the
> numerical integer values of the zipcodes rather than the 5-character
> string.
> A zipcode of "03036" would become a 3036 in the column, and when I'm
> searching for an "03036" it would match with the 3036 value. Of course, I
> would specify numeric values (e.g. 03036) rathet than quoted strings
> ('03036').
>
> I'm assuming that a 5-char column occupies at least 6 bytes (to make it on
> an even byte  boundary) and that an int occupies 4 bytes. That at a minium
> saves some space in the database. And then, when I'm looking for a
> particular
> value, it only has to compare 4 bytes (which is mostly likely a single
> hardware instruction) instead of six bytes, so the queries should run a
> bit
> faster.
>
Author
14 Jan 2006 5:41 PM
Jens
For the performace reason: Try it out, once you converted this into
have your data as a char. (which is like I said and meanwhile also
Steve pointed out, preferable, because you don´t have to deal with
later problems around this. We has ourselves in Germany a change from
4digit numbers to 5 digits with a trailing zero. I can tell you, that
was for many software vendors like the Y2k problem).

These is my opinion, my personal experience and advice for you.

HTH, Jens Suessmeyer.
Author
13 Jan 2006 12:09 AM
ML
Relations. The performance of this query would benefit from using a join
instead of using the IN operator.

How do the values get into the query?


ML

---
http://milambda.blogspot.com/
Author
14 Jan 2006 3:15 PM
Roger Garrett
ML,

My application program constructs the query.

What's happenning here is this: I have a Zipcodes table. That table has two
columns, a MemberID column and a Zipcode column. Each member has one entry in
this table, signifying the zipcode of where he lives.
At certain times during the running of my application it needs to know the
MemebrIDs of all the members that live within a certain radius of a given
member. My application program figures out which zipcodes are within that
radius and constructs an array of strings signifying that set of zipcodes. It
then constructs a query, using that array of strings, in order to get from
the database the set of MemberIDs, from the Zipcodes table, of those members
who reside in any of those zipcodes. The query looks something like:

SELECT ZipcodesTable.MemberID
FROM ZipcodesTable.Zipcode IN
(
'01234','03631','55902' ... lots of zipcodes here ...  '03036'
)

As far as I can tell that means that SQL Server has to compare each and
every zipocde in the Zipcodes table with (possibly all of) the zipcodes
within the IN clause of the query. In fact, for MOST of the rows in the
Zipcodes table it will have to do the comparison against ALL of the zipcodes
in the IN clause, since most of the members will not be within ANY of those
zipcodes.

Now, if SQL Server were smart it might order those zipcodes from the IN
clause and determine the smallest and largest zipcode values and thereby do a
much quicker comparison at each row. But I don't know that I can rely on SQL
Server being that smart.

So I'm looking for a better way to express the query so that it runs as fast
as possible.
Author
15 Jan 2006 1:11 AM
Andrew J. Kelly
Have a look at these:

http://www.sommarskog.se/arrays-in-sql.html  Arrays & Lists
http://www.aspfaq.com/show.asp?id=2248                 Arrays & Lists
http://www.sommarskog.se/dynamic_sql.html  Dynamic SQL
http://vyaskn.tripod.com/passing_arrays_to_stored_procedures.htm   Arrays &
SP's

--
Andrew J. Kelly  SQL MVP


Show quote
"Roger Garrett" <RogerGarr***@discussions.microsoft.com> wrote in message
news:6191E649-0F16-4F91-A944-25AA34F8C831@microsoft.com...
> ML,
>
> My application program constructs the query.
>
> What's happenning here is this: I have a Zipcodes table. That table has
> two
> columns, a MemberID column and a Zipcode column. Each member has one entry
> in
> this table, signifying the zipcode of where he lives.
> At certain times during the running of my application it needs to know the
> MemebrIDs of all the members that live within a certain radius of a given
> member. My application program figures out which zipcodes are within that
> radius and constructs an array of strings signifying that set of zipcodes.
> It
> then constructs a query, using that array of strings, in order to get from
> the database the set of MemberIDs, from the Zipcodes table, of those
> members
> who reside in any of those zipcodes. The query looks something like:
>
> SELECT ZipcodesTable.MemberID
> FROM ZipcodesTable.Zipcode IN
> (
> '01234','03631','55902' ... lots of zipcodes here ...  '03036'
> )
>
> As far as I can tell that means that SQL Server has to compare each and
> every zipocde in the Zipcodes table with (possibly all of) the zipcodes
> within the IN clause of the query. In fact, for MOST of the rows in the
> Zipcodes table it will have to do the comparison against ALL of the
> zipcodes
> in the IN clause, since most of the members will not be within ANY of
> those
> zipcodes.
>
> Now, if SQL Server were smart it might order those zipcodes from the IN
> clause and determine the smallest and largest zipcode values and thereby
> do a
> much quicker comparison at each row. But I don't know that I can rely on
> SQL
> Server being that smart.
>
> So I'm looking for a better way to express the query so that it runs as
> fast
> as possible.
>
Author
13 Jan 2006 3:21 AM
Andrew J. Kelly
Have you tried placing the zipcodes in a table and using EXISTS instead of
IN?

--
Andrew J. Kelly  SQL MVP


Show quote
"Roger Garrett" <RogerGarr***@discussions.microsoft.com> wrote in message
news:5C55A707-6E72-4CDF-BFC0-7320AA8CDA12@microsoft.com...
>I have a fairly straightforward SELECT query that includes the following:
>
> MembersTable.MemberID IN
> (
> SELECT ZipcodesTable.MemberID
> FROM ZipcodesTable.Zipcode IN
> (
> '01234','03631','55902' ...   '03036'
> )
>
> That is, it's looking for entries in the ZipcodeTable where the Zipcode
> value is any one of a very large set of zipcodes, up to 500 Zipcodes. My
> Zipcode field is a 5 character field.
>
> Would the query run faster if I made the Zipcode field an int instead of 5
> chars?
>
> Are there other ways to speed up the zipcode-matching part of my query?
>
> - Roger Garrett
>
Author
14 Jan 2006 3:33 PM
Roger Garrett
Andrew,

I wasn't familiar with the EXISTS operator (I'm very new to all this) so I
just now read up on it. I don't see how EXISTS will help. How are you
suggesting that the zipcodes be put in a table? Do you mean the set of
zipcodes that I'm looking for for the specific current query? WHat would the
EXISTS query look like?

Please see my reply to Jens for a (hopefully) clearer description of what
I'm trying to accomplish.



Show quote
"Andrew J. Kelly" wrote:

> Have you tried placing the zipcodes in a table and using EXISTS instead of
> IN?
>
> --
> Andrew J. Kelly  SQL MVP
>
>
> "Roger Garrett" <RogerGarr***@discussions.microsoft.com> wrote in message
> news:5C55A707-6E72-4CDF-BFC0-7320AA8CDA12@microsoft.com...
> >I have a fairly straightforward SELECT query that includes the following:
> >
> > MembersTable.MemberID IN
> > (
> > SELECT ZipcodesTable.MemberID
> > FROM ZipcodesTable.Zipcode IN
> > (
> > '01234','03631','55902' ...   '03036'
> > )
> >
> > That is, it's looking for entries in the ZipcodeTable where the Zipcode
> > value is any one of a very large set of zipcodes, up to 500 Zipcodes. My
> > Zipcode field is a 5 character field.
> >
> > Would the query run faster if I made the Zipcode field an int instead of 5
> > chars?
> >
> > Are there other ways to speed up the zipcode-matching part of my query?
> >
> > - Roger Garrett
> >
>
>
>
Author
13 Jan 2006 4:22 AM
--CELKO--
>> Would the query run faster if I made the Zipcode field [sic] an INTEGER instead of CHAR(5)?  <<

ZIP codes are CHAR(5) and not INTEGER.  Columns are not fields.

>> Are there other ways to speed up the zipcode-matching part of my query? <<

For a large number of zip codes, you might  find using a table instead
of a list is faster.  It would have an index on its single column.

SELECT member_id
  FROM Membership
WHERE zip_code
      IN (SELECT zip_code FROM ZipLists);
Author
14 Jan 2006 3:37 PM
Roger Garrett
CELKO,

Are you suggesting that, prior to performing the query, I create a table,
fill it up with the set of zipcodes that I'm looking for, and perform the
query (and presumably then get rid of the table of zipcodes)?
Each time that I do the query it is almost guaranteed that I will be looking
for a different set of zipcodes. Is it really more efficient to create a new
table each time?

- Roger


Show quote
"--CELKO--" wrote:

> >> Would the query run faster if I made the Zipcode field [sic] an INTEGER instead of CHAR(5)?  <<
>
> ZIP codes are CHAR(5) and not INTEGER.  Columns are not fields.
>
> >> Are there other ways to speed up the zipcode-matching part of my query? <<
>
> For a large number of zip codes, you might  find using a table instead
> of a list is faster.  It would have an index on its single column.
>
> SELECT member_id
>   FROM Membership
>  WHERE zip_code
>       IN (SELECT zip_code FROM ZipLists);
>
>
Author
15 Jan 2006 8:04 PM
--CELKO--
>> Are you suggesting that, prior to performing the query, I create a
table, fill it up with the set of zipcodes that I'm looking for, and
perform the query (and presumably then get rid of the table of
zipcodes)? <<

Why would you want to get rid of that table? Just do inserts, updates
and deletes on it. That means other procedures for managing it.  Writing
a new query with a new zip code list in it will cost you compilation
time. 

If you have standard zip code lists for mailings, then the table is a
handy thing:

CREATE TABLE Mailings
(mailing_id INTEGER NOT NULL,
zip_code CHAR(5) NOT NULL,
PRIMARY KEY (mailing_id, zip_code));

Now mailing_id becomes a parameter to the original procedure.  You are
using procedures and not writing in-line code, aren't you? 

The question as to which is faster and if it is worth it depends on the
SQL product and the release and the size of the list of zip codes.  For
example, Ingres optimizes IN() and EXISTS() the same way.  DB2 builds a
hash table if the IN() list is over a certain size and a scratch table
if it is bigger than that.  Sybase and others automatically link PK-FK
references in a single index structure. 

Gross generalization: joining to another table is better as the list
gets longer.  A small table without a key can be read into main storage
and will act like a list of constants.  A larger table can be indexed or
hashed to speed up searching. 



--CELKO--
Please post DDL in a human-readable format and not a machine-generated
one. This way people do not have to guess what the keys, constraints,
DRI, datatypes, etc. in your schema are. Sample data is also a good
idea, along with clear specifications. 


*** Sent via Developersdex http://www.developersdex.com ***
Author
15 Jan 2006 8:57 PM
Tony Rogerson
> The question as to which is faster and if it is worth it depends on the
> SQL product and the release and the size of the list of zip codes.  For

Read the name of the group!

It would help if you kept it to Microsoft SQL Server.

--
Tony Rogerson
SQL Server MVP
http://sqlserverfaq.com - free video tutorials


Show quote
"--CELKO--" <remove.jcelko***@earthlink.net> wrote in message
news:e$Jco7gGGHA.1032@TK2MSFTNGP11.phx.gbl...
>>> Are you suggesting that, prior to performing the query, I create a
> table, fill it up with the set of zipcodes that I'm looking for, and
> perform the query (and presumably then get rid of the table of
> zipcodes)? <<
>
> Why would you want to get rid of that table? Just do inserts, updates
> and deletes on it. That means other procedures for managing it.  Writing
> a new query with a new zip code list in it will cost you compilation
> time.
>
> If you have standard zip code lists for mailings, then the table is a
> handy thing:
>
> CREATE TABLE Mailings
> (mailing_id INTEGER NOT NULL,
> zip_code CHAR(5) NOT NULL,
> PRIMARY KEY (mailing_id, zip_code));
>
> Now mailing_id becomes a parameter to the original procedure.  You are
> using procedures and not writing in-line code, aren't you?
>
> The question as to which is faster and if it is worth it depends on the
> SQL product and the release and the size of the list of zip codes.  For
> example, Ingres optimizes IN() and EXISTS() the same way.  DB2 builds a
> hash table if the IN() list is over a certain size and a scratch table
> if it is bigger than that.  Sybase and others automatically link PK-FK
> references in a single index structure.
>
> Gross generalization: joining to another table is better as the list
> gets longer.  A small table without a key can be read into main storage
> and will act like a list of constants.  A larger table can be indexed or
> hashed to speed up searching.
>
>
>
> --CELKO--
> Please post DDL in a human-readable format and not a machine-generated
> one. This way people do not have to guess what the keys, constraints,
> DRI, datatypes, etc. in your schema are. Sample data is also a good
> idea, along with clear specifications.
>
>
> *** Sent via Developersdex http://www.developersdex.com ***
Author
23 Jan 2006 6:20 PM
Roger Garrett
Show quote
"--CELKO--" wrote:

> >> Are you suggesting that, prior to performing the query, I create a
> table, fill it up with the set of zipcodes that I'm looking for, and
> perform the query (and presumably then get rid of the table of
> zipcodes)? <<
>
> Why would you want to get rid of that table? Just do inserts, updates
> and deletes on it. That means other procedures for managing it.  Writing
> a new query with a new zip code list in it will cost you compilation
> time. 
>
> If you have standard zip code lists for mailings, then the table is a
> handy thing:
>
> CREATE TABLE Mailings
> (mailing_id INTEGER NOT NULL,
>  zip_code CHAR(5) NOT NULL,
>  PRIMARY KEY (mailing_id, zip_code));
>
> Now mailing_id becomes a parameter to the original procedure.  You are
> using procedures and not writing in-line code, aren't you? 
>
> The question as to which is faster and if it is worth it depends on the
> SQL product and the release and the size of the list of zip codes.  For
> example, Ingres optimizes IN() and EXISTS() the same way.  DB2 builds a
> hash table if the IN() list is over a certain size and a scratch table
> if it is bigger than that.  Sybase and others automatically link PK-FK
> references in a single index structure. 
>
> Gross generalization: joining to another table is better as the list
> gets longer.  A small table without a key can be read into main storage
> and will act like a list of constants.  A larger table can be indexed or
> hashed to speed up searching. 
>
>
>
> --CELKO--
> Please post DDL in a human-readable format and not a machine-generated
> one. This way people do not have to guess what the keys, constraints,
> DRI, datatypes, etc. in your schema are. Sample data is also a good
> idea, along with clear specifications. 
>
>
> *** Sent via Developersdex http://www.developersdex.com ***
>
Author
23 Jan 2006 6:25 PM
Roger Garrett
CELKO,

My application is one in which I have millions of Members. Each member lives
in some zipocde. My query is intended to find the other members who live
within a specific radius of a given member. In order to perform the query I
therefore must create a list of zipcodes within that radius and use that list
within the query. Each such query, being as it is for a different member,
will have a different list of zipcodes. That means I have to build the query
(at least the zipcode portion of it) each time I do the query.

I don't see how I could put that zipcode info into a table, seeing as it is
different for each query. Or would it be reasonable to have a (temporary)
zipcode table that I fill up prior to my query, with the specific zipocdes
that I'm looking for, and then reference that table withinthe query, rather
than having the list of zipcodes included explicitly within the query? (and
then deleting that zipocde table after the query).

Roger Garrett
Author
29 Jan 2006 4:42 AM
--CELKO--
>> Each member lives in some zipocde. My query is intended to find the other members who live within a specific radius of a given member.<<

Ask the USPS about Zone tables.  This is a list the ZIP code by
distance from a given ZIP code.  It is updates, free and close enough
for practical work.
Author
29 Jan 2006 4:45 AM
--CELKO--
NO, I am suggesting that you go to the USPS  can get the Zone tables
tapes of CDs on s subscription basis.  Never create data that you can
buy from a trusted source.
Author
13 Jan 2006 8:03 AM
Tony Rogerson
Hi Roger,

Make sure you have an index ZipCode, MemberID on the ZipcodesTable.

Also, use EXISTS instead of IN.

AND EXISTS (
    SELECT *
    FROM ZipcodesTable zt
    WHERE zt.Zipcode IN ( ........ )
           AND zt.MemberID = MembersTable.MemberID
    )

Even better if you could put the IN clause into a table of its own...

AND EXISTS (
    SELECT *
    FROM #ZCodes zt
    WHERE zt.MemberID = MembersTable.MemberID
    )

--
Tony Rogerson
SQL Server MVP
http://sqlserverfaq.com - free video tutorials


Show quote
"Roger Garrett" <RogerGarr***@discussions.microsoft.com> wrote in message
news:5C55A707-6E72-4CDF-BFC0-7320AA8CDA12@microsoft.com...
>I have a fairly straightforward SELECT query that includes the following:
>
> MembersTable.MemberID IN
> (
> SELECT ZipcodesTable.MemberID
> FROM ZipcodesTable.Zipcode IN
> (
> '01234','03631','55902' ...   '03036'
> )
>
> That is, it's looking for entries in the ZipcodeTable where the Zipcode
> value is any one of a very large set of zipcodes, up to 500 Zipcodes. My
> Zipcode field is a 5 character field.
>
> Would the query run faster if I made the Zipcode field an int instead of 5
> chars?
>
> Are there other ways to speed up the zipcode-matching part of my query?
>
> - Roger Garrett
>
Author
16 Jan 2006 3:00 PM
Jim Underwood
Just curious...

How long does it take your code to run with the long list of zip codes?

Is MembersTable already indexed on Zipcode?

Lastly, at the risk of making things much more complicated that they need to
be....
How are you determining which zip codes are within a certain area of each
other?  I am guessing that you have some sort of coordinates attached to
each zip code in order to determine its relative location to another zip
code?

Point being that somewhere you have logic which looks up a list of zip
codes.  You may very well be able to use that same logic to join to your
MembersTable, rather than generating a list then using that list as an in
clause.  I am just guessing at how you are doing this, and tossing out an
idea, so forgive me if I am way off base.

Show quote
"Roger Garrett" <RogerGarr***@discussions.microsoft.com> wrote in message
news:5C55A707-6E72-4CDF-BFC0-7320AA8CDA12@microsoft.com...
> I have a fairly straightforward SELECT query that includes the following:
>
> MembersTable.MemberID IN
> (
> SELECT ZipcodesTable.MemberID
> FROM ZipcodesTable.Zipcode IN
> (
> '01234','03631','55902' ...   '03036'
> )
>
> That is, it's looking for entries in the ZipcodeTable where the Zipcode
> value is any one of a very large set of zipcodes, up to 500 Zipcodes. My
> Zipcode field is a 5 character field.
>
> Would the query run faster if I made the Zipcode field an int instead of 5
> chars?
>
> Are there other ways to speed up the zipcode-matching part of my query?
>
> - Roger Garrett
>

AddThis Social Bookmark Button