Home All Groups Group Topic Archive Search About

Proper Index On a LIKE and GROUP BY query

Author
16 Aug 2006 10:44 PM
Roee
Hello,

I have very big table. I don't know the exact size, but the total size of
the DB is ~80 GB and the table has ~18 million rows, its columns types are:
int, varchar 500, float, char 500, char 100.

I'm trying to do the following query and it takes a very long time:
SELECT col1, count(*), sum(col3)
FROM table
WHERE col5 LIKE 'string%'
GROUP BY col1

I thought that the most important thing in this kind of query is to build a
clustered index on col1 since this is the column that has the group by clause
but since the query is very slow I wonder whether I have the right index, or
maybe I should build another one on col5 (that has the LIKE cluse).

Will an index help when searching for a LIKE 'string%' values?
Should I change the clustered index to another columns?

I'm using SQL Server 2000.

Thanks,
Roee.

Author
16 Aug 2006 10:54 PM
Chris Lim
Roee wrote:
> I thought that the most important thing in this kind of query is to build a
> clustered index on col1 since this is the column that has the group by clause
> but since the query is very slow I wonder whether I have the right index, or
> maybe I should build another one on col5 (that has the LIKE cluse).
>
> Will an index help when searching for a LIKE 'string%' values?

You should definitely have an index on col5. Whether the optimizer uses
it depends on what you end up searching for. Doing a LIKE '%' will
result in a table scan, whereas a  LIKE 'abceef%' should produce an
index seek.

You could try a covering index (col5, col1, col3) as well.
Author
16 Aug 2006 10:56 PM
Vern Rabe
The best indexing is almost always "it depends", but I'd try a couple of
things. If the number of rows that would qualify for you "LIKE 'string%'
clause is small (3% or less), you would probably benefit from a clustered
indes on col5, or even a non-clustered index on col5, col1, and col3 (in that
order) as a covering index. Only you can determine the best via testing.

I would also re-evaluate the char(500) and char(100) datatypes. Char will
always use the full width, and if much of your data in those two columns are
smaller that the specified size, you're wasting space. Varchar may
significantly reduce the table size, and improve query performance.

HTH
Vern Rabe

Show quote
"Roee" wrote:

> Hello,
>
> I have very big table. I don't know the exact size, but the total size of
> the DB is ~80 GB and the table has ~18 million rows, its columns types are:
> int, varchar 500, float, char 500, char 100.
>
> I'm trying to do the following query and it takes a very long time:
> SELECT col1, count(*), sum(col3)
> FROM table
> WHERE col5 LIKE 'string%'
> GROUP BY col1
>
> I thought that the most important thing in this kind of query is to build a
> clustered index on col1 since this is the column that has the group by clause
> but since the query is very slow I wonder whether I have the right index, or
> maybe I should build another one on col5 (that has the LIKE cluse).
>
> Will an index help when searching for a LIKE 'string%' values?
> Should I change the clustered index to another columns?
>
> I'm using SQL Server 2000.
>
> Thanks,
> Roee.
Author
16 Aug 2006 11:48 PM
Roee
Hi,

First, it seems that the number of rows that qualify for the LIKE clause
varies from 70,000 to 1.5 million, so it hard to say of the this number is
smaller than 3% since 3% of 18 million is 540,000.
(just to be clear, I'm talking about the total number of rows the query
returns, not per each col1)

But I will definitly try to change the char columns to varchar, and I will
try to add the index on col5, col1, col3.

Out of curiosity, both you and Chris suggested that the index should be on
col3 as well, where obviously - since this column is summed - sql server
needs to scan all the rows that qualify in order to calculate the sum result.
So why is the index on this column important? Is it because the information
is positioned on the index itself and there will be no need to access the
actual rows of the table (on the disk I mean)? this is what you mean by
"covering index"?

Thanks again,
Roee.

Show quote
"Vern Rabe" wrote:

> The best indexing is almost always "it depends", but I'd try a couple of
> things. If the number of rows that would qualify for you "LIKE 'string%'
> clause is small (3% or less), you would probably benefit from a clustered
> indes on col5, or even a non-clustered index on col5, col1, and col3 (in that
> order) as a covering index. Only you can determine the best via testing.
>
> I would also re-evaluate the char(500) and char(100) datatypes. Char will
> always use the full width, and if much of your data in those two columns are
> smaller that the specified size, you're wasting space. Varchar may
> significantly reduce the table size, and improve query performance.
>
> HTH
> Vern Rabe
>
> "Roee" wrote:
>
> > Hello,
> >
> > I have very big table. I don't know the exact size, but the total size of
> > the DB is ~80 GB and the table has ~18 million rows, its columns types are:
> > int, varchar 500, float, char 500, char 100.
> >
> > I'm trying to do the following query and it takes a very long time:
> > SELECT col1, count(*), sum(col3)
> > FROM table
> > WHERE col5 LIKE 'string%'
> > GROUP BY col1
> >
> > I thought that the most important thing in this kind of query is to build a
> > clustered index on col1 since this is the column that has the group by clause
> > but since the query is very slow I wonder whether I have the right index, or
> > maybe I should build another one on col5 (that has the LIKE cluse).
> >
> > Will an index help when searching for a LIKE 'string%' values?
> > Should I change the clustered index to another columns?
> >
> > I'm using SQL Server 2000.
> >
> > Thanks,
> > Roee.
Author
17 Aug 2006 12:43 AM
Gail Erickson [MS]
> So why is the index on this column important? Is it because the
> information
> is positioned on the index itself and there will be no need to access the
> actual rows of the table (on the disk I mean)? this is what you mean by
> "covering index"?

Yes, that's exactly what a covering index does -- it contains all the
columns needed by query so that the base table or clustered index is never
accessed resulting in fewer disk I/O operations

Regards,
Gail
--
Gail Erickson [MS]
SQL Server Documentation Team
This posting is provided "AS IS" with no warranties, and confers no rights
Download the latest version of Books Online from
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx


Show quote
"Roee" <R***@discussions.microsoft.com> wrote in message
news:E61F2965-DE8B-42E2-8F53-9C7C2029F113@microsoft.com...
> Hi,
>
> First, it seems that the number of rows that qualify for the LIKE clause
> varies from 70,000 to 1.5 million, so it hard to say of the this number is
> smaller than 3% since 3% of 18 million is 540,000.
> (just to be clear, I'm talking about the total number of rows the query
> returns, not per each col1)
>
> But I will definitly try to change the char columns to varchar, and I will
> try to add the index on col5, col1, col3.
>
> Out of curiosity, both you and Chris suggested that the index should be on
> col3 as well, where obviously - since this column is summed - sql server
> needs to scan all the rows that qualify in order to calculate the sum
> result.
> So why is the index on this column important? Is it because the
> information
> is positioned on the index itself and there will be no need to access the
> actual rows of the table (on the disk I mean)? this is what you mean by
> "covering index"?
>
> Thanks again,
> Roee.
>
> "Vern Rabe" wrote:
>
>> The best indexing is almost always "it depends", but I'd try a couple of
>> things. If the number of rows that would qualify for you "LIKE 'string%'
>> clause is small (3% or less), you would probably benefit from a clustered
>> indes on col5, or even a non-clustered index on col5, col1, and col3 (in
>> that
>> order) as a covering index. Only you can determine the best via testing.
>>
>> I would also re-evaluate the char(500) and char(100) datatypes. Char will
>> always use the full width, and if much of your data in those two columns
>> are
>> smaller that the specified size, you're wasting space. Varchar may
>> significantly reduce the table size, and improve query performance.
>>
>> HTH
>> Vern Rabe
>>
>> "Roee" wrote:
>>
>> > Hello,
>> >
>> > I have very big table. I don't know the exact size, but the total size
>> > of
>> > the DB is ~80 GB and the table has ~18 million rows, its columns types
>> > are:
>> > int, varchar 500, float, char 500, char 100.
>> >
>> > I'm trying to do the following query and it takes a very long time:
>> > SELECT col1, count(*), sum(col3)
>> > FROM table
>> > WHERE col5 LIKE 'string%'
>> > GROUP BY col1
>> >
>> > I thought that the most important thing in this kind of query is to
>> > build a
>> > clustered index on col1 since this is the column that has the group by
>> > clause
>> > but since the query is very slow I wonder whether I have the right
>> > index, or
>> > maybe I should build another one on col5 (that has the LIKE cluse).
>> >
>> > Will an index help when searching for a LIKE 'string%' values?
>> > Should I change the clustered index to another columns?
>> >
>> > I'm using SQL Server 2000.
>> >
>> > Thanks,
>> > Roee.
Author
17 Aug 2006 12:44 AM
Arnie Rowland
Yes, you got it.

A 'covering' index is one where all of the values sought are in the index
itself and therefore there is no need to go to the actual data table for any
values.

An INDEX on col5, col1 is a good candidate to be selected by the query
processor since the WHERE clause filters on col5, then groups on col1. By
adding col3 to the mix, there is no need to go to the datatable for any of
the requested values. One stop shopping.

--
Arnie Rowland, Ph.D.
Westwood Consulting, Inc

Most good judgment comes from experience.
Most experience comes from bad judgment.
- Anonymous


Show quote
"Roee" <R***@discussions.microsoft.com> wrote in message
news:E61F2965-DE8B-42E2-8F53-9C7C2029F113@microsoft.com...
> Hi,
>
> First, it seems that the number of rows that qualify for the LIKE clause
> varies from 70,000 to 1.5 million, so it hard to say of the this number is
> smaller than 3% since 3% of 18 million is 540,000.
> (just to be clear, I'm talking about the total number of rows the query
> returns, not per each col1)
>
> But I will definitly try to change the char columns to varchar, and I will
> try to add the index on col5, col1, col3.
>
> Out of curiosity, both you and Chris suggested that the index should be on
> col3 as well, where obviously - since this column is summed - sql server
> needs to scan all the rows that qualify in order to calculate the sum
> result.
> So why is the index on this column important? Is it because the
> information
> is positioned on the index itself and there will be no need to access the
> actual rows of the table (on the disk I mean)? this is what you mean by
> "covering index"?
>
> Thanks again,
> Roee.
>
> "Vern Rabe" wrote:
>
>> The best indexing is almost always "it depends", but I'd try a couple of
>> things. If the number of rows that would qualify for you "LIKE 'string%'
>> clause is small (3% or less), you would probably benefit from a clustered
>> indes on col5, or even a non-clustered index on col5, col1, and col3 (in
>> that
>> order) as a covering index. Only you can determine the best via testing.
>>
>> I would also re-evaluate the char(500) and char(100) datatypes. Char will
>> always use the full width, and if much of your data in those two columns
>> are
>> smaller that the specified size, you're wasting space. Varchar may
>> significantly reduce the table size, and improve query performance.
>>
>> HTH
>> Vern Rabe
>>
>> "Roee" wrote:
>>
>> > Hello,
>> >
>> > I have very big table. I don't know the exact size, but the total size
>> > of
>> > the DB is ~80 GB and the table has ~18 million rows, its columns types
>> > are:
>> > int, varchar 500, float, char 500, char 100.
>> >
>> > I'm trying to do the following query and it takes a very long time:
>> > SELECT col1, count(*), sum(col3)
>> > FROM table
>> > WHERE col5 LIKE 'string%'
>> > GROUP BY col1
>> >
>> > I thought that the most important thing in this kind of query is to
>> > build a
>> > clustered index on col1 since this is the column that has the group by
>> > clause
>> > but since the query is very slow I wonder whether I have the right
>> > index, or
>> > maybe I should build another one on col5 (that has the LIKE cluse).
>> >
>> > Will an index help when searching for a LIKE 'string%' values?
>> > Should I change the clustered index to another columns?
>> >
>> > I'm using SQL Server 2000.
>> >
>> > Thanks,
>> > Roee.
Author
17 Aug 2006 12:56 AM
Vern Rabe
And to add to what Gail and Arnie said, a covering index should be
non-clustered. A clustered index is the data itself and would defeat the
purpose of a covering index.

Vern Rabe

Show quote
"Roee" wrote:

> Hi,
>
> First, it seems that the number of rows that qualify for the LIKE clause
> varies from 70,000 to 1.5 million, so it hard to say of the this number is
> smaller than 3% since 3% of 18 million is 540,000.
> (just to be clear, I'm talking about the total number of rows the query
> returns, not per each col1)
>
> But I will definitly try to change the char columns to varchar, and I will
> try to add the index on col5, col1, col3.
>
> Out of curiosity, both you and Chris suggested that the index should be on
> col3 as well, where obviously - since this column is summed - sql server
> needs to scan all the rows that qualify in order to calculate the sum result.
> So why is the index on this column important? Is it because the information
> is positioned on the index itself and there will be no need to access the
> actual rows of the table (on the disk I mean)? this is what you mean by
> "covering index"?
>
> Thanks again,
> Roee.
>
> "Vern Rabe" wrote:
>
> > The best indexing is almost always "it depends", but I'd try a couple of
> > things. If the number of rows that would qualify for you "LIKE 'string%'
> > clause is small (3% or less), you would probably benefit from a clustered
> > indes on col5, or even a non-clustered index on col5, col1, and col3 (in that
> > order) as a covering index. Only you can determine the best via testing.
> >
> > I would also re-evaluate the char(500) and char(100) datatypes. Char will
> > always use the full width, and if much of your data in those two columns are
> > smaller that the specified size, you're wasting space. Varchar may
> > significantly reduce the table size, and improve query performance.
> >
> > HTH
> > Vern Rabe
> >
> > "Roee" wrote:
> >
> > > Hello,
> > >
> > > I have very big table. I don't know the exact size, but the total size of
> > > the DB is ~80 GB and the table has ~18 million rows, its columns types are:
> > > int, varchar 500, float, char 500, char 100.
> > >
> > > I'm trying to do the following query and it takes a very long time:
> > > SELECT col1, count(*), sum(col3)
> > > FROM table
> > > WHERE col5 LIKE 'string%'
> > > GROUP BY col1
> > >
> > > I thought that the most important thing in this kind of query is to build a
> > > clustered index on col1 since this is the column that has the group by clause
> > > but since the query is very slow I wonder whether I have the right index, or
> > > maybe I should build another one on col5 (that has the LIKE cluse).
> > >
> > > Will an index help when searching for a LIKE 'string%' values?
> > > Should I change the clustered index to another columns?
> > >
> > > I'm using SQL Server 2000.
> > >
> > > Thanks,
> > > Roee.
Author
17 Aug 2006 1:05 AM
Roee
Thanks everybody.

I'm working on it now, will probably take a few hours (if not an entire
day...) to change all the columns (there are actually 10 columns that needs
to be altered since there are 5 tables of that kind - can't put all the data
in one table, too big), and then build the indices.

Hope for good results...
Show quote
:)

"Vern Rabe" wrote:

> And to add to what Gail and Arnie said, a covering index should be
> non-clustered. A clustered index is the data itself and would defeat the
> purpose of a covering index.
>
> Vern Rabe
>
> "Roee" wrote:
>
> > Hi,
> >
> > First, it seems that the number of rows that qualify for the LIKE clause
> > varies from 70,000 to 1.5 million, so it hard to say of the this number is
> > smaller than 3% since 3% of 18 million is 540,000.
> > (just to be clear, I'm talking about the total number of rows the query
> > returns, not per each col1)
> >
> > But I will definitly try to change the char columns to varchar, and I will
> > try to add the index on col5, col1, col3.
> >
> > Out of curiosity, both you and Chris suggested that the index should be on
> > col3 as well, where obviously - since this column is summed - sql server
> > needs to scan all the rows that qualify in order to calculate the sum result.
> > So why is the index on this column important? Is it because the information
> > is positioned on the index itself and there will be no need to access the
> > actual rows of the table (on the disk I mean)? this is what you mean by
> > "covering index"?
> >
> > Thanks again,
> > Roee.
> >
> > "Vern Rabe" wrote:
> >
> > > The best indexing is almost always "it depends", but I'd try a couple of
> > > things. If the number of rows that would qualify for you "LIKE 'string%'
> > > clause is small (3% or less), you would probably benefit from a clustered
> > > indes on col5, or even a non-clustered index on col5, col1, and col3 (in that
> > > order) as a covering index. Only you can determine the best via testing.
> > >
> > > I would also re-evaluate the char(500) and char(100) datatypes. Char will
> > > always use the full width, and if much of your data in those two columns are
> > > smaller that the specified size, you're wasting space. Varchar may
> > > significantly reduce the table size, and improve query performance.
> > >
> > > HTH
> > > Vern Rabe
> > >
> > > "Roee" wrote:
> > >
> > > > Hello,
> > > >
> > > > I have very big table. I don't know the exact size, but the total size of
> > > > the DB is ~80 GB and the table has ~18 million rows, its columns types are:
> > > > int, varchar 500, float, char 500, char 100.
> > > >
> > > > I'm trying to do the following query and it takes a very long time:
> > > > SELECT col1, count(*), sum(col3)
> > > > FROM table
> > > > WHERE col5 LIKE 'string%'
> > > > GROUP BY col1
> > > >
> > > > I thought that the most important thing in this kind of query is to build a
> > > > clustered index on col1 since this is the column that has the group by clause
> > > > but since the query is very slow I wonder whether I have the right index, or
> > > > maybe I should build another one on col5 (that has the LIKE cluse).
> > > >
> > > > Will an index help when searching for a LIKE 'string%' values?
> > > > Should I change the clustered index to another columns?
> > > >
> > > > I'm using SQL Server 2000.
> > > >
> > > > Thanks,
> > > > Roee.

AddThis Social Bookmark Button