Home All Groups Group Topic Archive Search About
Author
18 Aug 2005 4:34 PM
Mike Labosh
I am sure many of you folks are already subscribed, but for those who are
not, here's a story of some DBA job interviews of candidates that should
just be summarily shot:

(This will wrap)

http://www.sqlservercentral.com/columnists/sMcCown/howdoyouspellsql_printversion.asp

--
Peace & happy computing,

Mike Labosh, MCSD
"Musha ring dum a doo dum a da!" -- James Hetfield

Author
18 Aug 2005 5:17 PM
Mikito Harakiri
Another source for quote of the week.

Although the author is unfair emphasizing that DBA has to be aware char
vs varchar2 difference. The character string types in SQL are just
plain silly. For how long ordinary programming languages have [a
single] type String [with unlimited bounds] already? Also when did you
design a database schema with char datatype last time? In a word, there
is not a single advantage of char over varchar2. Therefore, char is not
even worth mentioned, and should just be deprecated: there are to many
really important things on DBA plate.

Mike Labosh wrote:
Show quote
> I am sure many of you folks are already subscribed, but for those who are
> not, here's a story of some DBA job interviews of candidates that should
> just be summarily shot:
>
> (This will wrap)
>
> http://www.sqlservercentral.com/columnists/sMcCown/howdoyouspellsql_printversion.asp
>
Author
18 Aug 2005 5:33 PM
JT
If this was intended to be a humorous quote from another faux DBA, then
thanks for the laugh! If that was your own opinion (and I'm assuming not),
then do society a favor by quiting your job and living off unemployment.
;-)

Show quote
"Mikito Harakiri" <mikharakiri_nosp***@yahoo.com> wrote in message
news:1124385453.899643.54220@f14g2000cwb.googlegroups.com...
> Another source for quote of the week.
>
> Although the author is unfair emphasizing that DBA has to be aware char
> vs varchar2 difference. The character string types in SQL are just
> plain silly. For how long ordinary programming languages have [a
> single] type String [with unlimited bounds] already? Also when did you
> design a database schema with char datatype last time? In a word, there
> is not a single advantage of char over varchar2. Therefore, char is not
> even worth mentioned, and should just be deprecated: there are to many
> really important things on DBA plate.
>
> Mike Labosh wrote:
>> I am sure many of you folks are already subscribed, but for those who are
>> not, here's a story of some DBA job interviews of candidates that should
>> just be summarily shot:
>>
>> (This will wrap)
>>
>> http://www.sqlservercentral.com/columnists/sMcCown/howdoyouspellsql_printversion.asp
>>
>
Author
18 Aug 2005 6:14 PM
Mikito Harakiri
No, that was my opinion. Do you imply I'm not qualified to be a DBA?
Well, I would be ashamed to be called a master of extents and segment
management.

JT wrote:
Show quote
> If this was intended to be a humorous quote from another faux DBA, then
> thanks for the laugh! If that was your own opinion (and I'm assuming not),
> then do society a favor by quiting your job and living off unemployment.
> ;-)
>
> "Mikito Harakiri" <mikharakiri_nosp***@yahoo.com> wrote in message
> news:1124385453.899643.54220@f14g2000cwb.googlegroups.com...
> > Another source for quote of the week.
> >
> > Although the author is unfair emphasizing that DBA has to be aware char
> > vs varchar2 difference. The character string types in SQL are just
> > plain silly. For how long ordinary programming languages have [a
> > single] type String [with unlimited bounds] already? Also when did you
> > design a database schema with char datatype last time? In a word, there
> > is not a single advantage of char over varchar2. Therefore, char is not
> > even worth mentioned, and should just be deprecated: there are to many
> > really important things on DBA plate.
> >
> > Mike Labosh wrote:
> >> I am sure many of you folks are already subscribed, but for those who are
> >> not, here's a story of some DBA job interviews of candidates that should
> >> just be summarily shot:
> >>
> >> (This will wrap)
> >>
> >> http://www.sqlservercentral.com/columnists/sMcCown/howdoyouspellsql_printversion.asp
> >>
> >
Author
18 Aug 2005 6:21 PM
vc
Mikito Harakiri wrote:
> No, that was my opinion. Do you imply I'm not qualified to be a DBA?
> Well, I would be ashamed to be called a master of extents and segment
> management.
>

What's so shameful about being called a DBA ?

Show quote
> >
Author
18 Aug 2005 7:31 PM
JT
You smell like someone who spends most of their time lurking around the
halls of a university or research lab rather than developing database models
for real production systems.

Show quote
"Mikito Harakiri" <mikharakiri_nosp***@yahoo.com> wrote in message
news:1124388849.145863.189590@g47g2000cwa.googlegroups.com...
> No, that was my opinion. Do you imply I'm not qualified to be a DBA?
> Well, I would be ashamed to be called a master of extents and segment
> management.
>
> JT wrote:
>> If this was intended to be a humorous quote from another faux DBA, then
>> thanks for the laugh! If that was your own opinion (and I'm assuming
>> not),
>> then do society a favor by quiting your job and living off unemployment.
>> ;-)
>>
>> "Mikito Harakiri" <mikharakiri_nosp***@yahoo.com> wrote in message
>> news:1124385453.899643.54220@f14g2000cwb.googlegroups.com...
>> > Another source for quote of the week.
>> >
>> > Although the author is unfair emphasizing that DBA has to be aware char
>> > vs varchar2 difference. The character string types in SQL are just
>> > plain silly. For how long ordinary programming languages have [a
>> > single] type String [with unlimited bounds] already? Also when did you
>> > design a database schema with char datatype last time? In a word, there
>> > is not a single advantage of char over varchar2. Therefore, char is not
>> > even worth mentioned, and should just be deprecated: there are to many
>> > really important things on DBA plate.
>> >
>> > Mike Labosh wrote:
>> >> I am sure many of you folks are already subscribed, but for those who
>> >> are
>> >> not, here's a story of some DBA job interviews of candidates that
>> >> should
>> >> just be summarily shot:
>> >>
>> >> (This will wrap)
>> >>
>> >> http://www.sqlservercentral.com/columnists/sMcCown/howdoyouspellsql_printversion.asp
>> >>
>> >
>
Author
18 Aug 2005 8:03 PM
AK
>> I would be ashamed to be called a master of extents and segment
management. <<

what is so shameful in it?
Author
18 Aug 2005 6:23 PM
Razvan Socol
>  there is not a single advantage of char over varchar2

"varchar2" ? You seem to live in an Oracle world, but here we are MS
SQL Server people. And in SQL Server there is an advantage for char
over varchar: char(n) has takes two bytes less to store than
varchar(n), if the string has always n characters.

Razvan
Author
18 Aug 2005 6:44 PM
Mikito Harakiri
Razvan Socol wrote:
> >  there is not a single advantage of char over varchar2
>
> "varchar2" ? You seem to live in an Oracle world, but here we are MS
> SQL Server people. And in SQL Server there is an advantage for char
> over varchar: char(n) has takes two bytes less to store than
> varchar(n), if the string has always n characters.
>
> Razvan

Hmm... With todays terabytes of disc memory, are 2 extra bytes really
that important? If storage size is really critical, then maybe column
compression might help?
Author
18 Aug 2005 6:55 PM
Stu
Not that I totally disagree with the concept of deprecating char vs
varchar, but I see this argument used a lot, and there's two flaws with
it.

1.  Storage size may be cheap and plentiful, but performance should
always be foremost in the DBA mind.  2 bytes in a single column pf
storage may not be much, but you also need to write queries that
retrieve that extra 2 bytes.  It's not just disk space; it's also
memory and CPU.

2.  Failing to appreciate the differences between varchar and char sets
us on a path of lazy design.  If it doesn't matter if I use char(10) vs
varchar(10), then what's the matter with varchar(50)?  How about
varchar(51) etc?  Eventually we could get away with "oh screw it; disks
are cheap, memory is cheap, CPU's are powerful, let's just put
everything in an text column and parse it on the fly".

I realize that what you are saying is a long way from suggesting that
we just throw away relational design altogether, but it's the small
steps that lead to the bigger leaps that lead to the cliffs.

Just my .02 
Stu
Author
18 Aug 2005 8:17 PM
Mikito Harakiri
Stu wrote:
> Not that I totally disagree with the concept of deprecating char vs
> varchar, but I see this argument used a lot, and there's two flaws with
> it.
>
> 1.  Storage size may be cheap and plentiful, but performance should
> always be foremost in the DBA mind.  2 bytes in a single column pf
> storage may not be much, but you also need to write queries that
> retrieve that extra 2 bytes.  It's not just disk space; it's also
> memory and CPU.

How much performance difference does it really make? Keep in mind that
in a typical table you can declare char a couple of boolean (Y/N)
columns at most.

> 2.  Failing to appreciate the differences between varchar and char sets
> us on a path of lazy design.  If it doesn't matter if I use char(10) vs
> varchar(10), then what's the matter with varchar(50)?  How about
> varchar(51) etc?  Eventually we could get away with "oh screw it; disks
> are cheap, memory is cheap, CPU's are powerful, let's just put
> everything in an text column and parse it on the fly".

Failing to appreciate the differences between varchar and char sets is
an ability to raise the level of abstraction. (BTW, the skill many DBAs
lack.) With low level of abstraction you would never quit chaising
perceived problems (eg. trying to figure out the "optimal" block size,
pondering if

select * from table

is faster than

select col1, col2, ... from table

etc.)
Author
18 Aug 2005 8:22 PM
AK
>> > storage may not be much, but you also need to write queries that
> retrieve that extra 2 bytes.  It's not just disk space; it's also
> memory and CPU.



How much performance difference does it really make?
<<

beleive me or not, in some cases it's 100% or more
Author
18 Aug 2005 8:28 PM
Mikito Harakiri
AK wrote:
> >> > storage may not be much, but you also need to write queries that
> > retrieve that extra 2 bytes.  It's not just disk space; it's also
> > memory and CPU.
>
>
>
> How much performance difference does it really make?
> <<
>
> beleive me or not, in some cases it's 100% or more

Do you have a bencmark to prove it?
Author
19 Aug 2005 10:03 AM
Roy Hann
"AK" <AK_TIREDOFSPAM@hotmail.COM> wrote in message
news:1124396559.557686.77080@g43g2000cwa.googlegroups.com...
> >> > storage may not be much, but you also need to write queries that
> > retrieve that extra 2 bytes.  It's not just disk space; it's also
> > memory and CPU.

> How much performance difference does it really make?
> <<
>
> beleive me or not, in some cases it's 100% or more

I simply don't believe you.  I don't believe the difference can be 100%.  I
certainly don't believe it can be more.  And I don't believe you can even
measure what difference there is.  And even if I did believe all that, I
don't believe it is anything but a bug in your software.

So I 100% don't believe you.  However, I am willing to be persuaded by
proof.

Roy
Author
22 Aug 2005 1:22 PM
AK
>>
> beleive me or not, in some cases it's 100% or more



I simply don't believe you.  I don't believe the difference can be
100%.  I
certainly don't believe it can be more.  And I don't believe you can
even
measure what difference there is.  And even if I did believe all that,
I
don't believe it is anything but a bug in your software.

So I 100% don't believe you.  However, I am willing to be persuaded by
proof.
<<

Vow, I missed such a statement.
Well, imagine a table with rows that are almost 4K each. That's 2 rows
per an 8K page. Add some additional width, and you end up with just one
row per data page, a 100% increase in table size. Believe me or not,
that's not a bug in our software, that's the way SQL Server stores
data.
On one hand, moving from 2K pages in 65 to 8K pages in 70 in 200 is a
big improvement. On the other hand, in Oracle we can choose page size
(blocksize in Oracle universe)
Author
18 Aug 2005 8:50 PM
Hugo Kornelis
On 18 Aug 2005 13:17:37 -0700, Mikito Harakiri wrote:

Show quote
>Stu wrote:
>> Not that I totally disagree with the concept of deprecating char vs
>> varchar, but I see this argument used a lot, and there's two flaws with
>> it.
>>
>> 1.  Storage size may be cheap and plentiful, but performance should
>> always be foremost in the DBA mind.  2 bytes in a single column pf
>> storage may not be much, but you also need to write queries that
>> retrieve that extra 2 bytes.  It's not just disk space; it's also
>> memory and CPU.
>
>How much performance difference does it really make? Keep in mind that
>in a typical table you can declare char a couple of boolean (Y/N)
>columns at most.

Hi Mikito,

The extra performance cost of declaring Y/N columns varchar instead of
char is trivial compared to the extra cost of doing so for foreign key
columns.

Most databases have lots of so-called "lookup tables". In the main
table, the state is stored as a 2-letter code, country as a 3-letter ISO
code and currency as another 3-letter ISO code. These codes are foreign
keys into the States, Countries and Currencies tables, that have the
primary key 2- or 3-letter code, a unique long name/description and
possibly some other columns as well.

All these 2- and 3-letter codes will be in indexes, and these indexes
will be heavily used during inserts and updates (to verify the foreign
key constraint) and in queries (because users generally prefer to see
the full name of the state/country/currency instead of the code).

Using varchar for a 2-letter code means that the space taken is doubled.
This halves the number of rows that fit on one leaf page of the index.
The result will be: more logical reads, lower cache hit ration, more
physical reads --> slower performance.


>> 2.  Failing to appreciate the differences between varchar and char sets
>> us on a path of lazy design.  If it doesn't matter if I use char(10) vs
>> varchar(10), then what's the matter with varchar(50)?  How about
>> varchar(51) etc?  Eventually we could get away with "oh screw it; disks
>> are cheap, memory is cheap, CPU's are powerful, let's just put
>> everything in an text column and parse it on the fly".
>
>Failing to appreciate the differences between varchar and char sets is
>an ability to raise the level of abstraction. (BTW, the skill many DBAs
>lack.)

It's not a prime skill for DBAs. The abstraction level where the actual
data type is irrelevant is where you'll find the information analyst,
functional designer, conceptual modeler or whatever the current name of
those people is. It's not the DBA's task to check if they did their job
properly. The DBA should ensure that the database runs smoothly, that as
blocking and deadlocks are minimized and that queries perform as fast as
possible. Especially the latter is impossible to do without appreciating
the difference between char, nchar, varchar, nvarchar, text, and ntext.


> With low level of abstraction you would never quit chaising
>perceived problems (eg. trying to figure out the "optimal" block size,

No SQL Server DBA will ever try tio figure out a block size. (Remember
that this thread started in a SQL Server group and was crossposted to
..theory later!)

>pondering if
>
>select * from table
>
>is faster than
>
>select col1, col2, ... from table

No need to ponder that -- all SQL Server DBAs (and presumably all DBAs
for all serious RDBMS's) know that SELECT * should never be used in
production code (except in a EXISTS(..) subquery).



Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
18 Aug 2005 9:17 PM
Mikito Harakiri
Hugo Kornelis wrote:
Show quote
> On 18 Aug 2005 13:17:37 -0700, Mikito Harakiri wrote:
>
> >Stu wrote:
> >> Not that I totally disagree with the concept of deprecating char vs
> >> varchar, but I see this argument used a lot, and there's two flaws with
> >> it.
> >>
> >> 1.  Storage size may be cheap and plentiful, but performance should
> >> always be foremost in the DBA mind.  2 bytes in a single column pf
> >> storage may not be much, but you also need to write queries that
> >> retrieve that extra 2 bytes.  It's not just disk space; it's also
> >> memory and CPU.
> >
> >How much performance difference does it really make? Keep in mind that
> >in a typical table you can declare char a couple of boolean (Y/N)
> >columns at most.
>
> Hi Mikito,
>
> The extra performance cost of declaring Y/N columns varchar instead of
> char is trivial compared to the extra cost of doing so for foreign key
> columns.
>
> Most databases have lots of so-called "lookup tables". In the main
> table, the state is stored as a 2-letter code, country as a 3-letter ISO
> code and currency as another 3-letter ISO code. These codes are foreign
> keys into the States, Countries and Currencies tables, that have the
> primary key 2- or 3-letter code, a unique long name/description and
> possibly some other columns as well.
>
> All these 2- and 3-letter codes will be in indexes, and these indexes
> will be heavily used during inserts and updates (to verify the foreign
> key constraint) and in queries (because users generally prefer to see
> the full name of the state/country/currency instead of the code).
>
> Using varchar for a 2-letter code means that the space taken is doubled.
> This halves the number of rows that fit on one leaf page of the index.
> The result will be: more logical reads, lower cache hit ration, more
> physical reads --> slower performance.

Now, we are talking! Verifying the foreign key constraint requires a
unique index scan. If the index grows in size by factor of two, the
number of levels might go up by 1, but typically would stay the same.
In short, verifying a foreign key constraint would be 2  or 3 logical
reads, in either case.

Normally, upper levels of index are cached. I guess for country codes,
the whole index is cached, so you are correct about country code PK
index taking twice the space. Let's see 200 countries multipled by 2
bytes. 800 bytes versus 400 bytes!

Show quote
> >> 2.  Failing to appreciate the differences between varchar and char sets
> >> us on a path of lazy design.  If it doesn't matter if I use char(10) vs
> >> varchar(10), then what's the matter with varchar(50)?  How about
> >> varchar(51) etc?  Eventually we could get away with "oh screw it; disks
> >> are cheap, memory is cheap, CPU's are powerful, let's just put
> >> everything in an text column and parse it on the fly".
> >
> >Failing to appreciate the differences between varchar and char sets is
> >an ability to raise the level of abstraction. (BTW, the skill many DBAs
> >lack.)
>
> It's not a prime skill for DBAs. The abstraction level where the actual
> data type is irrelevant is where you'll find the information analyst,
> functional designer, conceptual modeler or whatever the current name of
> those people is. It's not the DBA's task to check if they did their job
> properly. The DBA should ensure that the database runs smoothly, that as
> blocking and deadlocks are minimized and that queries perform as fast as
> possible. Especially the latter is impossible to do without appreciating
> the difference between char, nchar, varchar, nvarchar, text, and ntext.

Well, operating a machine indeed doesn't require abstract thinking. I
never said it does. I'm just implying that without critical analysis of
what you are doing, you would never be able to quit fighting the mess
of real world.

BTW, in my (admittedly very limited) SQL tuning experience, I have yet
to see the case there the size of a datatype make any difference.
Author
18 Aug 2005 9:50 PM
Hugo Kornelis
On 18 Aug 2005 14:17:54 -0700, Mikito Harakiri wrote:

(snip quoteback)
>Now, we are talking! Verifying the foreign key constraint requires a
>unique index scan. If the index grows in size by factor of two, the
>number of levels might go up by 1, but typically would stay the same.
>In short, verifying a foreign key constraint would be 2  or 3 logical
>reads, in either case.
>
>Normally, upper levels of index are cached. I guess for country codes,
>the whole index is cached, so you are correct about country code PK
>index taking twice the space. Let's see 200 countries multipled by 2
>bytes. 800 bytes versus 400 bytes!

Hi Mikito,

Ah, cynicism - I love it.

I chose these examples because they are well known by everyone. My bad;
I should have chosen more convincing (though less well-known) examples.

Many companies use a short mnemonic code for their customers. The use of
a short mnemonic code for products is not uncommon either. Imagine a
customers table with 10,000 customers, a products table with 5,000
products and an orders table with a few million rows. Now would you
prefer char(6) or varchar(6) for CustomerCode? And char(5) or varchar(5)
for ProductCode?

Another example: ticker symbols. Take a look at http://finance.yahoo.com
To be able to generate all the graphs they offer, they have to have
quite a few rows of historic quotes in their DB. How would you rate the
overhead of char(5) vs varchar(5) for ticker symbol in a row that only
has three columns: ticker, date/time and quote.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
18 Aug 2005 10:02 PM
Mikito Harakiri
Hugo Kornelis wrote:
> Many companies use a short mnemonic code for their customers. The use of
> a short mnemonic code for products is not uncommon either. Imagine a
> customers table with 10,000 customers, a products table with 5,000
> products and an orders table with a few million rows. Now would you
> prefer char(6) or varchar(6) for CustomerCode? And char(5) or varchar(5)
> for ProductCode?

But if lookup table cardinality goes up, then, the storage factor
char(n) vs varchar(n) goes down! Perhaps, you can convince me that the
effect of the two trailing bytes is not miniscule, as I previously
thought, but it just can't be significant. Any benchmark demonstrating
that the performance degradation is not in single percentage digits is
welcome.
Author
18 Aug 2005 10:14 PM
-CELKO-
>> Any benchmark demonstrating that the performance degradation is not in single percentage digits is welcome. <<

I am more worried about various numbers of extra blanks in the
VARCHAR(n) making display problems.  Trimming and padding are costly.
Author
19 Aug 2005 2:49 PM
JT
Perhaps I'm misreading your point, but it's data in a Char column that
typically contains trailing blanks that need a Trim. A VarChar won't contain
extraneous blanks unless the developer explicitly includes them in the
value.

Show quote
"-CELKO-" <jcelko***@earthlink.net> wrote in message
news:1124403285.842828.168210@g43g2000cwa.googlegroups.com...
>>> Any benchmark demonstrating that the performance degradation is not in
>>> single percentage digits is welcome. <<
>
> I am more worried about various numbers of extra blanks in the
> VARCHAR(n) making display problems.  Trimming and padding are costly.
>
Author
19 Aug 2005 9:37 PM
-CELKO-
>> A VARCHAR(n)  won't contain extraneous blanks unless the developer explicitly includes them in the value. <<

Not the developer, the data entry person who is probably an end user
these days.  The DB guy's job is to add a constaint like
"CHECK (TRIM (BOTH  foobar) = UPPER (foobar))" to the column on his
side.  Now how often have you seen Newbies go to that trouble?

Instead you see a lot of VARCHAR(50)  -- notice the magic number fifty
from ACCESS programmers -- columns which will eventually collect
garbage, like 40 spaces and a period.  Even worse,  NVARCHAR(50) whcih
collect garbage in Chinese!
Author
22 Aug 2005 7:29 PM
Mike Labosh
> Not the developer, the data entry person who is probably an end user
> these days.  The DB guy's job is to add a constaint like
> "CHECK (TRIM (BOTH  foobar) = UPPER (foobar))" to the column on his
> side.  Now how often have you seen Newbies go to that trouble?

Quite.

Last week I discovered a record in our "City" table called "A** HOLE" linked
to a "CompanyLocation" record called "F*** YOU" and had a contact at that
location named after some 1970's British speed-punk band, who was in charge
of IBM mainframe products like "EAT SH**"

I have also come across contacts with names like "[TYPE CONTACT NAME HERE]"
and "#Error" (That's the funny one.  #Error is what you get when you're in
MS Access and there's a data binding issue)

Working here is like living inside a Salvador Dali or MC Escher painting.

--
Peace & happy computing,

Mike Labosh, MCSD
"Musha ring dum a doo dum a da!" -- James Hetfield
Author
22 Aug 2005 7:46 PM
--CELKO--
My favorite was a Data Warehousing project migration with the part
number "I hate my job" repeated 9000 or so times.  Jack Nicholson does
data entry!!

>> Working here is like living inside a Salvador Dali or MC Escher painting. <<

Nah!   Escher and Dali had a pattern to their works -- projective
geometery and math for Escher, Freudian Psychology for Dali. we are in
Hieronymus Bosch! Crude, brutal views of Hell from the Middle Ages
mixed with stupidity.
Author
22 Aug 2005 8:31 PM
Mike Labosh
> Nah!   Escher and Dali had a pattern to their works -- projective
> geometery and math for Escher, Freudian Psychology for Dali. we are in
> Hieronymus Bosch! Crude, brutal views of Hell from the Middle Ages
> mixed with stupidity.

I had never heard of this Hieronymus Bosch person so I Googled the name.  He
is definately in need of some psychological help  :)

--
Peace & happy computing,

Mike Labosh, MCSD
"Musha ring dum a doo dum a da!" -- James Hetfield
Author
22 Aug 2005 11:44 PM
Mike Hodgson
Mike Labosh wrote:

>>Nah!   Escher and Dali had a pattern to their works -- projective
>>geometery and math for Escher, Freudian Psychology for Dali. we are in
>>Hieronymus Bosch! Crude, brutal views of Hell from the Middle Ages
>>mixed with stupidity.
>>   
>>
>
>I had never heard of this Hieronymus Bosch person so I Googled the name.  He
>is definately in need of some psychological help  :)

>
I think a 490 year old corpse is a little beyond help at this stage (but
I agree that in his time he could have done with a good psychiatrist or 10).

--
*mike hodgson*
blog: http://sqlnerd.blogspot.com
Author
6 Sep 2005 1:32 AM
Darius
> Instead you see a lot of VARCHAR(50)  -- notice the magic number fifty
> from ACCESS programmers -- columns which will eventually collect
> garbage, like 40 spaces and a period.  Even worse,  NVARCHAR(50) whcih
> collect garbage in Chinese!

Would you prefer 42 or some other magic number for data of undeterminable
length?  And how does the data type prevent users from entering garbage data?

I've lost count of the number of times I've seen big corporate databases
designed by those who hate Access with wierd problems, such as the assumption
that a surname will always be less than an arbitrary number which forces many
names to be entered as abbrieviations, or odd constructs such as 'Address
Line 1', 'Address Line 2'... (confusing formatting with data and forever
obstructing the possibility of analysis by region), or the assumption that
all phone numbers must have a 2 digit area code and will never be mobiile or
international numbers.  Heck, there are people out there who think phone
numbers are INTEGERS instead of STRINGS.

Of course, it's not easy - say if you allow 50 for a locality, the longest
possible placename is at least 3 times that (though probably not used in most
cases).  From the business perspective, you can't have the attitude that "n%
of customers are lost to our business because their data won't fit in our
database" (think the infamous example of the database designer who decided
surnames must be at least 3 characters long and therefore prevented people
with single letter surnames from obtaining drivers licenses).

All of us need to be learning and a little more humble, methinks (and
looking at some of the knowlege floating around in this group, that goes
double for me).
Author
6 Sep 2005 1:04 PM
AK
>From the business perspective, you can't have the attitude >that "n%
>of customers are lost to our business because their data won't fit >in our
>database" (think the infamous example of the database designer >who decided
>surnames must be at least 3 characters long and therefore >prevented people
>with single letter surnames from obtaining drivers licenses).

>All of us need to be learning and a little more humble, methinks
>(and
>looking at some of the knowlege floating around in this group, that >goes
>double for me).

Bravo!
Author
19 Aug 2005 10:43 PM
Hugo Kornelis
On Fri, 19 Aug 2005 10:49:59 -0400, JT wrote:

>Perhaps I'm misreading your point, but it's data in a Char column that
>typically contains trailing blanks that need a Trim. A VarChar won't contain
>extraneous blanks unless the developer explicitly includes them in the
>value.

Hi JT,

I think that Joe refers to the ANSI comparison rules for char and
varchar:

* Comparing two char (assume same length): Do character by character
compare.

* Comparing two varchar: First, find if actual length (not maximum
length!) is different. Then, pad shortest string with spaces until it's
just as long as the longest string. Finally, do character by character
compare of longest string to padded version of shortest string.

For comparison of char with different length, padding is also required,
but this padding is independent of the actual data; it's the same for
each row and the amount of padding can be determined in the compile
phase. (But in joins, you should not compare two different-length char
columns anyway!)

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
18 Aug 2005 10:25 PM
Colin Dawson
Show quote
"Mikito Harakiri" <mikharakiri_nosp***@yahoo.com> wrote in message
news:1124402576.470238.211960@o13g2000cwo.googlegroups.com...
> Hugo Kornelis wrote:
>> Many companies use a short mnemonic code for their customers. The use of
>> a short mnemonic code for products is not uncommon either. Imagine a
>> customers table with 10,000 customers, a products table with 5,000
>> products and an orders table with a few million rows. Now would you
>> prefer char(6) or varchar(6) for CustomerCode? And char(5) or varchar(5)
>> for ProductCode?
>
> But if lookup table cardinality goes up, then, the storage factor
> char(n) vs varchar(n) goes down! Perhaps, you can convince me that the
> effect of the two trailing bytes is not miniscule, as I previously
> thought, but it just can't be significant. Any benchmark demonstrating
> that the performance degradation is not in single percentage digits is
> welcome.
>

Surely, when developing a database application and performance tuning it
sqeezing every single performance point out of the database is worthwhile.
Everytime that you make a decision which costs a percent of two of
performance you degrade your database performance.  Given enough of these
compomises the database performance will drop significatly, I've seen this
happen for real on databases with hundreds of thousands of rows, the
performance degraded so much that the application became unusable.   A few
minor tweeks here and there and the program was running faster than it had
ever run before.

It may seem that it's a very academic kind of argument but in real terms the
more pedantic and performance oriented the DBA is the better the final
product will function.

Yes, you can always throw more kit at a problem, but when you can litterally
flick a switch or two and boost the performance it's worth doing.  To do
otherwise could be contrude as neglegence!

Regards

Colin Dawson
www.cjdawson.com
Author
19 Aug 2005 10:16 AM
Paul
Mikito Harakiri wrote:
>>Many companies use a short mnemonic code for their customers. The use of
>>a short mnemonic code for products is not uncommon either. Imagine a
>>customers table with 10,000 customers, a products table with 5,000
>>products and an orders table with a few million rows. Now would you
>>prefer char(6) or varchar(6) for CustomerCode? And char(5) or varchar(5)
>>for ProductCode?
>
> But if lookup table cardinality goes up, then, the storage factor
> char(n) vs varchar(n) goes down! Perhaps, you can convince me that the
> effect of the two trailing bytes is not miniscule, as I previously
> thought, but it just can't be significant. Any benchmark demonstrating
> that the performance degradation is not in single percentage digits is
> welcome.

I guess the char() is really just being used as a hint to the DBMS so it
can make an informed decision of what physical data structures to use.

Maybe a better solution would be to have a single type ("string" or
whatever) but then have a check constraint like len(column) < 6.

So you could regard the char datatype as shorthand for a varchar
datatype with a check constraint. The two ways of looking at it are
functionally identical. I think in practice certain DBMSs will
internally store a varchar(n) as a char(n) for sufficiently low values
of n anyway.

In theory, check constraints should be helpful to the DBMS, both for
queries and for deciding physical storage structures.

So there are three possibilities:
1) have char(n) and varchar(n) datatypes
2) scrap char(n) and just have varchar(n). The DBMS can decide to use
different internal storage methods if n is small enough.
3) have varchar with no maximum length specified. Now if you forget to
specify a maximum length via a check constraint, the DBMS isn't able to
optimize things so well.

You might have a similar argument with tinyint, smallint, int, bigint types.

tinyint is really just an int column with a check constaint of
0 <= column <= 255

The question is: should the length constraint be part of the type or
part of the database? Does it matter even?

Paul.

Paul.
Author
19 Aug 2005 10:32 PM
Hugo Kornelis
On 18 Aug 2005 15:02:56 -0700, Mikito Harakiri wrote:

Show quote
>Hugo Kornelis wrote:
>> Many companies use a short mnemonic code for their customers. The use of
>> a short mnemonic code for products is not uncommon either. Imagine a
>> customers table with 10,000 customers, a products table with 5,000
>> products and an orders table with a few million rows. Now would you
>> prefer char(6) or varchar(6) for CustomerCode? And char(5) or varchar(5)
>> for ProductCode?
>
>But if lookup table cardinality goes up, then, the storage factor
>char(n) vs varchar(n) goes down! Perhaps, you can convince me that the
>effect of the two trailing bytes is not miniscule, as I previously
>thought, but it just can't be significant. Any benchmark demonstrating
>that the performance degradation is not in single percentage digits is
>welcome.

Hi Mikito,

Below is the script I used to compare performance. I compared three
tasks:

1. Performing lots of single inserts (speed of inserts matters in
databases that have to process thousands of new rows per second)
2. Adding an index (not an everyday task, but there are situations where
a process can be sped up by adding an index, doing the process, then
dropping the index again)
3. Generating a typical report that involves some joins (the CASE
expressions I used may look silly, but imagine that I am using a payment
status instead - also note that I store the results in a temporary table
first, to make sure that network speed won't influence the results)

You'll find the full code further below. But first the results:

1. For the inserts, the elapsed time on my machine was 118156 ms with
char, 124406 ms with varchar. A degradation of over 5 %.

2. Adding the index took 2840 ms with char, 3236 ms with varchar. A
performance degradation of almost 14 %.

3a. Generating the report with the index created in step 2 took 224 ms
on average for char; 234 ms on average for varchar (I used three
consecutive executions for both tests, each time starting with a clean
cache). Performance degradation: 4.5 %.
3b. Generating the same report after dropping the supporting index
showed an even bigger difference: avg 2281 for char; avg 3250 for
varchar. Degradation: over 40 %.

1a. To top it off, I repeated the insertion test, but this time AFTER
creating the extra index. For char, the time taken was 227936 ms; for
varchar 313110 ms. Performance degradation: 37 %.

So here's your choice: either you use the index, degrade the queries by
"only" 4.5% but at the cost of slowing inserts down by 37%, or you
discard the index to reverse these figures (inserts slowed down 5%;
queries slowed down 40%).

Or you choose char instead of varchar when you know that the length is
more or less fixed. :-)


Here's the code I used for this benchmark:

-- Set up the tables
CREATE TABLE Clients
        (ClientID char(6) NOT NULL,
         ClientName varchar(36) NOT NULL,
         PRIMARY KEY (ClientID)
        )
CREATE TABLE Products
        (ProdID char(5) NOT NULL,
         ProdName varchar(30) NOT NULL,
         PRIMARY KEY (ProdID)
        )
CREATE TABLE Orders
        (ClientID char(6) NOT NULL,
         ProdID char(5) NOT NULL,
         PRIMARY KEY (ClientID, ProdID),
         FOREIGN KEY (ClientID) REFERENCES Clients,
         FOREIGN KEY (ProdID) REFERENCES Products
        )
go
-- Put some rows in the Clients table
DECLARE @i int, @r int, @NumClients int
SET @NumClients = 10000
SET @i = 0
WHILE @i < @NumClients
  BEGIN
    SET @r = CAST(@i AS bigint) * 308915776 / @NumClients
    INSERT Clients (ClientID, ClientName)
    VALUES(CHAR(65 + (@r % 26)) + CHAR(65 + ((@r / 26) % 26)) + CHAR(65
+ ((@r / 676) % 26))
         + CHAR(65 + ((@r / 17576) % 26)) + CHAR(65 + ((@r / 456976) %
26))
         + CHAR(65 + ((@r / 11881376) % 26)),
           LEFT(CAST(NewID() as VARCHAR(36)), 10 + RAND() * 26))
    SET @i = @i + 1
  END
select count(*) from Clients
go
-- Put some rows in the Products table
DECLARE @i int, @r int, @NumProds int
SET @NumProds = 5000
SET @i = 0
WHILE @i < @NumProds
  BEGIN
    SET @r = CAST(@i AS bigint) * 11881376 / @NumProds
    INSERT Products (ProdID, ProdName)
    VALUES(CHAR(65 + (@r % 26)) + CHAR(65 + ((@r / 26) % 26)) + CHAR(65
+ ((@r / 676) % 26))
         + CHAR(65 + ((@r / 17576) % 26)) + CHAR(65 + ((@r / 456976) %
26)),
           LEFT(CAST(NewID() as VARCHAR(36)), 5 + RAND() * 25))
    SET @i = @i + 1
  END
select count(*) from Products
go
-- Have each client order up to 10 products
DECLARE @Start datetime, @End datetime
SET @Start = CURRENT_TIMESTAMP
DECLARE @c int, @i int, @p int, @r int, @NumClients int, @NumProds int,
@ClientID char(6), @ProdID char(5), @Amt tinyint
SET @NumProds = 5000
SET @NumClients = 10000
SET @c = 0
WHILE @c < @NumClients
  BEGIN
    SET @r = CAST(@c AS bigint) * 308915776 / @NumClients
    SET @ClientID =
           CHAR(65 + (@r % 26)) + CHAR(65 + ((@r / 26) % 26)) + CHAR(65
+ ((@r / 676) % 26))
         + CHAR(65 + ((@r / 17576) % 26)) + CHAR(65 + ((@r / 456976) %
26))
         + CHAR(65 + ((@r / 11881376) % 26))
    SET @Amt = RAND() * 20 + 20
    SET @i = 0
    WHILE @i < @Amt
      BEGIN
        SET @r = CAST(RAND() * @NumProds AS bigint) * 11881376 /
@NumProds
        SET @ProdID =
               CHAR(65 + (@r % 26)) + CHAR(65 + ((@r / 26) % 26)) +
CHAR(65 + ((@r / 676) % 26))
             + CHAR(65 + ((@r / 17576) % 26)) + CHAR(65 + ((@r / 456976)
% 26))
        INSERT Orders (ClientID, ProdID)
        SELECT @ClientID, @ProdID
        WHERE  NOT EXISTS (SELECT *
                           FROM   Orders
                           WHERE  ClientID = @ClientID
                           AND    ProdID   = @ProdID)
        SET @i = @i + 1
      END
    SET @c = @c + 1
    IF @c % 100 = 0
      PRINT @c
  END
SET @End = CURRENT_TIMESTAMP
SELECT 'Generating orders', @Start, @End, DATEDIFF(ms, @Start, @End) AS
Elapsed
select count(*) from Orders
go
-- Add an index
DECLARE @Start datetime, @End datetime
SET @Start = CURRENT_TIMESTAMP
CREATE INDEX Orders_ProdID ON Orders(ProdID)
SET @End = CURRENT_TIMESTAMP
SELECT 'Adding an index', @Start, @End, DATEDIFF(ms, @Start, @End) AS
Elapsed
go
-- DROP INDEX Orders.Orders_ProdID
-- Flush buffers
CHECKPOINT
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
go
-- Perform a typical query
DECLARE @Start datetime, @End datetime
SET @Start = CURRENT_TIMESTAMP
SELECT     p.ProdID,
           COUNT(CASE WHEN LEN(c.ClientName) < 20 THEN 1 END) AS
ShortNameOrders,
           COUNT(CASE WHEN LEN(c.ClientName) > 30 THEN 1 END) AS
LongNameOrders,
           COUNT(c.ClientName) AS TotalOrders
INTO       #Results
FROM       Products AS p
INNER JOIN Orders   AS o
      ON   o.ProdID = p.ProdID
INNER JOIN Clients  AS c
      ON   c.ClientID = o.ClientID
WHERE      p.ProdID LIKE 'D%'
GROUP BY   p.ProdID
SET @End = CURRENT_TIMESTAMP
SELECT 'Query with join', @Start, @End, DATEDIFF(ms, @Start, @End) AS
Elapsed
SELECT * FROM #Results
DROP TABLE #Results
go
-- Clean up the mess
DROP TABLE Orders
DROP TABLE Products
DROP TABLE Clients
go


Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
19 Aug 2005 11:18 PM
Mikito Harakiri
Hugo Kornelis wrote:
Show quote
> On 18 Aug 2005 15:02:56 -0700, Mikito Harakiri wrote:
>
> >Hugo Kornelis wrote:
> >> Many companies use a short mnemonic code for their customers. The use of
> >> a short mnemonic code for products is not uncommon either. Imagine a
> >> customers table with 10,000 customers, a products table with 5,000
> >> products and an orders table with a few million rows. Now would you
> >> prefer char(6) or varchar(6) for CustomerCode? And char(5) or varchar(5)
> >> for ProductCode?
> >
> >But if lookup table cardinality goes up, then, the storage factor
> >char(n) vs varchar(n) goes down! Perhaps, you can convince me that the
> >effect of the two trailing bytes is not miniscule, as I previously
> >thought, but it just can't be significant. Any benchmark demonstrating
> >that the performance degradation is not in single percentage digits is
> >welcome.
>
> Hi Mikito,
>
> Below is the script I used to compare performance. I compared three
> tasks:
>
> 1. Performing lots of single inserts (speed of inserts matters in
> databases that have to process thousands of new rows per second)
> 2. Adding an index (not an everyday task, but there are situations where
> a process can be sped up by adding an index, doing the process, then
> dropping the index again)
> 3. Generating a typical report that involves some joins (the CASE
> expressions I used may look silly, but imagine that I am using a payment
> status instead - also note that I store the results in a temporary table
> first, to make sure that network speed won't influence the results)
>
> You'll find the full code further below. But first the results:
>
> 1. For the inserts, the elapsed time on my machine was 118156 ms with
> char, 124406 ms with varchar. A degradation of over 5 %.
>
> 2. Adding the index took 2840 ms with char, 3236 ms with varchar. A
> performance degradation of almost 14 %.
>
> 3a. Generating the report with the index created in step 2 took 224 ms
> on average for char; 234 ms on average for varchar (I used three
> consecutive executions for both tests, each time starting with a clean
> cache). Performance degradation: 4.5 %.
> 3b. Generating the same report after dropping the supporting index
> showed an even bigger difference: avg 2281 for char; avg 3250 for
> varchar. Degradation: over 40 %.
>
> 1a. To top it off, I repeated the insertion test, but this time AFTER
> creating the extra index. For char, the time taken was 227936 ms; for
> varchar 313110 ms. Performance degradation: 37 %.
>
> So here's your choice: either you use the index, degrade the queries by
> "only" 4.5% but at the cost of slowing inserts down by 37%, or you
> discard the index to reverse these figures (inserts slowed down 5%;
> queries slowed down 40%).
>
> Or you choose char instead of varchar when you know that the length is
> more or less fixed. :-)
>
>
> Here's the code I used for this benchmark:
>
> -- Set up the tables
> CREATE TABLE Clients
>         (ClientID char(6) NOT NULL,
>          ClientName varchar(36) NOT NULL,
>          PRIMARY KEY (ClientID)
>         )
> CREATE TABLE Products
>         (ProdID char(5) NOT NULL,
>          ProdName varchar(30) NOT NULL,
>          PRIMARY KEY (ProdID)
>         )
> CREATE TABLE Orders
>         (ClientID char(6) NOT NULL,
>          ProdID char(5) NOT NULL,
>          PRIMARY KEY (ClientID, ProdID),
>          FOREIGN KEY (ClientID) REFERENCES Clients,
>          FOREIGN KEY (ProdID) REFERENCES Products
>         )
> go
> -- Put some rows in the Clients table
> DECLARE @i int, @r int, @NumClients int
> SET @NumClients = 10000
> SET @i = 0
> WHILE @i < @NumClients
>   BEGIN
>     SET @r = CAST(@i AS bigint) * 308915776 / @NumClients
>     INSERT Clients (ClientID, ClientName)
>     VALUES(CHAR(65 + (@r % 26)) + CHAR(65 + ((@r / 26) % 26)) + CHAR(65
> + ((@r / 676) % 26))
>          + CHAR(65 + ((@r / 17576) % 26)) + CHAR(65 + ((@r / 456976) %
> 26))
>          + CHAR(65 + ((@r / 11881376) % 26)),
>            LEFT(CAST(NewID() as VARCHAR(36)), 10 + RAND() * 26))
>     SET @i = @i + 1
>   END
> select count(*) from Clients
> go
> -- Put some rows in the Products table
> DECLARE @i int, @r int, @NumProds int
> SET @NumProds = 5000
> SET @i = 0
> WHILE @i < @NumProds
>   BEGIN
>     SET @r = CAST(@i AS bigint) * 11881376 / @NumProds
>     INSERT Products (ProdID, ProdName)
>     VALUES(CHAR(65 + (@r % 26)) + CHAR(65 + ((@r / 26) % 26)) + CHAR(65
> + ((@r / 676) % 26))
>          + CHAR(65 + ((@r / 17576) % 26)) + CHAR(65 + ((@r / 456976) %
> 26)),
>            LEFT(CAST(NewID() as VARCHAR(36)), 5 + RAND() * 25))
>     SET @i = @i + 1
>   END
> select count(*) from Products
> go
> -- Have each client order up to 10 products
> DECLARE @Start datetime, @End datetime
> SET @Start = CURRENT_TIMESTAMP
> DECLARE @c int, @i int, @p int, @r int, @NumClients int, @NumProds int,
> @ClientID char(6), @ProdID char(5), @Amt tinyint
> SET @NumProds = 5000
> SET @NumClients = 10000
> SET @c = 0
> WHILE @c < @NumClients
>   BEGIN
>     SET @r = CAST(@c AS bigint) * 308915776 / @NumClients
>     SET @ClientID =
>            CHAR(65 + (@r % 26)) + CHAR(65 + ((@r / 26) % 26)) + CHAR(65
> + ((@r / 676) % 26))
>          + CHAR(65 + ((@r / 17576) % 26)) + CHAR(65 + ((@r / 456976) %
> 26))
>          + CHAR(65 + ((@r / 11881376) % 26))
>     SET @Amt = RAND() * 20 + 20
>     SET @i = 0
>     WHILE @i < @Amt
>       BEGIN
>         SET @r = CAST(RAND() * @NumProds AS bigint) * 11881376 /
> @NumProds
>         SET @ProdID =
>                CHAR(65 + (@r % 26)) + CHAR(65 + ((@r / 26) % 26)) +
> CHAR(65 + ((@r / 676) % 26))
>              + CHAR(65 + ((@r / 17576) % 26)) + CHAR(65 + ((@r / 456976)
> % 26))
>         INSERT Orders (ClientID, ProdID)
>         SELECT @ClientID, @ProdID
>         WHERE  NOT EXISTS (SELECT *
>                            FROM   Orders
>                            WHERE  ClientID = @ClientID
>                            AND    ProdID   = @ProdID)
>         SET @i = @i + 1
>       END
>     SET @c = @c + 1
>     IF @c % 100 = 0
>       PRINT @c
>   END
> SET @End = CURRENT_TIMESTAMP
> SELECT 'Generating orders', @Start, @End, DATEDIFF(ms, @Start, @End) AS
> Elapsed
> select count(*) from Orders
> go
> -- Add an index
> DECLARE @Start datetime, @End datetime
> SET @Start = CURRENT_TIMESTAMP
> CREATE INDEX Orders_ProdID ON Orders(ProdID)
> SET @End = CURRENT_TIMESTAMP
> SELECT 'Adding an index', @Start, @End, DATEDIFF(ms, @Start, @End) AS
> Elapsed
> go
> -- DROP INDEX Orders.Orders_ProdID
> -- Flush buffers
> CHECKPOINT
> DBCC DROPCLEANBUFFERS
> DBCC FREEPROCCACHE
> go
> -- Perform a typical query
> DECLARE @Start datetime, @End datetime
> SET @Start = CURRENT_TIMESTAMP
> SELECT     p.ProdID,
>            COUNT(CASE WHEN LEN(c.ClientName) < 20 THEN 1 END) AS
> ShortNameOrders,
>            COUNT(CASE WHEN LEN(c.ClientName) > 30 THEN 1 END) AS
> LongNameOrders,
>            COUNT(c.ClientName) AS TotalOrders
> INTO       #Results
> FROM       Products AS p
> INNER JOIN Orders   AS o
>       ON   o.ProdID = p.ProdID
> INNER JOIN Clients  AS c
>       ON   c.ClientID = o.ClientID
> WHERE      p.ProdID LIKE 'D%'
> GROUP BY   p.ProdID
> SET @End = CURRENT_TIMESTAMP
> SELECT 'Query with join', @Start, @End, DATEDIFF(ms, @Start, @End) AS
> Elapsed
> SELECT * FROM #Results
> DROP TABLE #Results
> go
> -- Clean up the mess
> DROP TABLE Orders
> DROP TABLE Products
> DROP TABLE Clients
> go

Hugo,

I was going to perform this test in oracle, but then I came across the
following passage:

Tom...
I was told once that if you only need a single char
use CHAR(1) since using VARCHAR2(1) has overhead
due to the Oracle having to maintain how long
the value is in a VARCHAR2 field

Is this a load of rubbish or is there some truth
to it?

Followup:

that is rubbish.

A char(n) is a varchar2(n) that is blank padded to it's maximum length.
The
implementation of a char physically on disk is identical to a varchar2
-- there is a length byte for a char(1) just as there is for a
varchar2(1).

http://www.jlcomp.demon.co.uk/faq/char_vs_varchar2.html
3. As opposed to a wide-spread misunderstanding, char (1) does *not*
use less space in the database than varchar2 (1) when the value 'a' is
stored because also char (1) has a length field. This can be proved by
the dump () function.

I guess this storage quirk doesn't really affect TPC-C benchmarks:)


It might be interesting, however, to dig down into the insert
performance difference. As I mentioned, 2 extra bytes shouldn't really
affect the speed of index unique scan.

BTW, I always use INTEGERs for id columns.
Author
22 Aug 2005 10:32 PM
Hugo Kornelis
On 19 Aug 2005 16:18:03 -0700, Mikito Harakiri wrote:

(snip)
>Hugo,
>
>I was going to perform this test in oracle, but then I came across the
>following passage:
(snip)
>I guess this storage quirk doesn't really affect TPC-C benchmarks:)

Hi Mikito,

I've never worked with Oracle, so I'll just have to take your word for
it, I guess.

>It might be interesting, however, to dig down into the insert
>performance difference. As I mentioned, 2 extra bytes shouldn't really
>affect the speed of index unique scan.

I think the culprint for the inserts is the extra index. The size of one
entry in the index grows from 11 bytes to 15 bytes (a large percentage).
And that will mean that less entries fit into one page - and since the
inserts are sequential for the PRIMARY KEY, but non-sequential for the
UNIQUE, the supporting index for the UNIQUE constraint will face lots of
page splits.

I include these two statements in the main loop for the inserts to have
some indication that the system is still busy:
>>     IF @c % 100 = 0
>>       PRINT @c
If you run the script, you'll see that the speed at which the counter
increase will graudually slow down as the number of index pages grows.

>BTW, I always use INTEGERs for id columns.

Yeah, I guess I should have named the key columns ClientCode and
ProdCode instead of ClientID and ProdID. :-)

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
18 Aug 2005 10:13 PM
Mikito Harakiri
Hugo Kornelis wrote:
> On 18 Aug 2005 13:17:37 -0700, Mikito Harakiri wrote:
> >pondering if
> >
> >select * from table
> >
> >is faster than
> >
> >select col1, col2, ... from table
>
> No need to ponder that -- all SQL Server DBAs (and presumably all DBAs
> for all serious RDBMS's) know that SELECT * should never be used in
> production code (except in a EXISTS(..) subquery).

My bad. I meant

select count(1) from table

vs.

select count(*) from table

BTW, you triggered the other example: is EXISTS or IN faster? This
question could come up only from somebody who is completely unaware of
SQL expression equivalency and query rewrite. Well, making sure the
extents and segments are layed out on disk properly, leaves little room
for education and abstract thinking.
Author
19 Aug 2005 7:53 AM
Mike Hodgson
Mikito Harakiri wrote:

Show quote
>Hugo Kornelis wrote:

>
>>On 18 Aug 2005 13:17:37 -0700, Mikito Harakiri wrote:
>>   
>>
>>>pondering if
>>>
>>>select * from table
>>>
>>>is faster than
>>>
>>>select col1, col2, ... from table
>>>     
>>>
>>No need to ponder that -- all SQL Server DBAs (and presumably all DBAs
>>for all serious RDBMS's) know that SELECT * should never be used in
>>production code (except in a EXISTS(..) subquery).
>>   
>>
>
>My bad. I meant
>
>select count(1) from table
>
>vs.
>
>select count(*) from table

>

"select count(1) from table" and "select count(*) from table" will both
come up with the same execution plan.  They will both count the number
of entries in the narrowest index on that table.  That's, perhaps, a bad
example as the optimiser is specifically designed to deal with that case
I believe.  See example below (both cases require 227 logical reads and
return 121371 rows in this example):


StmtText                                  
-------------------------------------------
select count(1) from dbo.SalesOrderDetail

(1 row(s) affected)

StmtText                                                                                                          

------------------------------------------------------------------------------------------------------------------

  |--Compute Scalar(DEFINE:([Expr1002]=Convert([Expr1003])))
       |--Stream Aggregate(DEFINE:([Expr1003]=Count(*)))
            |--Index
Scan(OBJECT:([AdventureWorks2000].[dbo].[SalesOrderDetail].[IX_SalesOrderDetail_ProductID]))

(3 row(s) affected)

StmtText                                      
-----------------------------------------------

select count(*) from dbo.SalesOrderDetail

(1 row(s) affected)

StmtText                                                                                                          

------------------------------------------------------------------------------------------------------------------

  |--Compute Scalar(DEFINE:([Expr1002]=Convert([Expr1003])))
       |--Stream Aggregate(DEFINE:([Expr1003]=Count(*)))
            |--Index
Scan(OBJECT:([AdventureWorks2000].[dbo].[SalesOrderDetail].[IX_SalesOrderDetail_ProductID]))

(3 row(s) affected)


>BTW, you triggered the other example: is EXISTS or IN faster? This
>question could come up only from somebody who is completely unaware of
>SQL expression equivalency and query rewrite. Well, making sure the
>extents and segments are layed out on disk properly, leaves little room
>for education and abstract thinking.

>
The EXISTS() predicate is typically a fairly efficient predicate because
it only needs to scan until it gets a match, at which time it returns. 
The worst case scenario (it finds a match on the last physical row, or
it doesn't find any matching row) is the same I/O as the IN() predicate
case because IN() will evaluate the entire subquery.

--
*mike hodgson*
blog: http://sqlnerd.blogspot.com
Author
19 Aug 2005 10:20 AM
Paul
Mike Hodgson wrote:
> The EXISTS() predicate is typically a fairly efficient predicate because
> it only needs to scan until it gets a match, at which time it returns.
> The worst case scenario (it finds a match on the last physical row, or
> it doesn't find any matching row) is the same I/O as the IN() predicate
> case because IN() will evaluate the entire subquery.

Why does IN() need to evaluate the entire subquery? Couldn't it in
theory work exactly the same as EXISTS() at the physical level?

Paul.
Author
19 Aug 2005 11:25 AM
Tony Andrews
Paul wrote:
> Mike Hodgson wrote:
> > The EXISTS() predicate is typically a fairly efficient predicate because
> > it only needs to scan until it gets a match, at which time it returns.
> > The worst case scenario (it finds a match on the last physical row, or
> > it doesn't find any matching row) is the same I/O as the IN() predicate
> > case because IN() will evaluate the entire subquery.
>
> Why does IN() need to evaluate the entire subquery? Couldn't it in
> theory work exactly the same as EXISTS() at the physical level?

Yes it could, and indeed does (Oracle 9i).  In the following example,
IN and EXISTS are processed the same way, and DO NOT evaluate the
entire subquery:

SQL> create table t1 as select object_id, object_name from all_objects;

Table created.

SQL> alter table t1 add constraint t1_pk primary key (object_id);

Table altered.

SQL> create table t2 as select object_id, object_name from all_objects
where rownum=1;

Table created.

SQL> alter table t2 add constraint t2_pk primary key (object_id);

Table altered.

SQL> analyze table t1 compute statistics;

Table analyzed.

SQL> analyze table t2 compute statistics;

Table analyzed.

SQL> select count(*) from t1;

  COUNT(*)
----------
     47355

SQL> select count(*) from t2;

  COUNT(*)
----------
         1

SQL> set autotrace on
SQL> select * from t2 where object_id in (select object_id from t1);

OBJECT_ID OBJECT_NAME
---------- ------------------------------
     18164 /1005bd30_LnkdConstant


Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=1 Card=1 Bytes=30)
   1    0   NESTED LOOPS (Cost=1 Card=1 Bytes=30)
   2    1     TABLE ACCESS (FULL) OF 'T2' (Cost=1 Card=1 Bytes=26)
   3    1     INDEX (UNIQUE SCAN) OF 'T1_PK' (UNIQUE)




Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          5  consistent gets
          3  physical reads
          0  redo size
        227  bytes sent via SQL*Net to client
        314  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

SQL> select * from t2 where exists (select null from t1 where
t1.object_id = t2.object_id);

OBJECT_ID OBJECT_NAME
---------- ------------------------------
     18164 /1005bd30_LnkdConstant


Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=1 Card=1 Bytes=26)
   1    0   FILTER
   2    1     TABLE ACCESS (FULL) OF 'T2' (Cost=1 Card=1 Bytes=26)
   3    1     INDEX (UNIQUE SCAN) OF 'T1_PK' (UNIQUE) (Cost=1 Card=1 B
          ytes=4)





Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          5  consistent gets
          0  physical reads
          0  redo size
        227  bytes sent via SQL*Net to client
        314  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed
Author
19 Aug 2005 12:44 PM
Mike Hodgson
Perhaps it's an implementation detail of Oracle then (I'm not very
familiar with how Oracle performs the various operations).

What would happen if you had slightly more realistic data (like both
tables with more than 1 row in them)?  For example, say you had:
InsurancePolicies {policy_id, broker_id, inception_date, ...} - (80000
rows) clustered index on policy_id (don't know what Oracle parlance for
"clustered index" is)
InsuranceBrokers {broker_id, broker_name, ...} - (200 rows) clustered
index on broker_id

then you said:

select * from InsuranceBrokers b
where broker_id in
    (
    select distinct broker_id from InsurancePolicies p
    where p.inception_date > '20050101'
    )

select * from InsuranceBrokers b
where exists
    (
    select * from InsurancePolicies p
    where p.broker_id = b.broker_id
    and p.inception_date > '20050101'
    )

Both queries should return the same data, namely all the brokers who own
1 or more policies  that started this year.  Now, with the IN() query,
if the physical data in InsurancePolicies is sorted by policy_id, then
how does the query engine know it's got all of the policies that started
this year unless it goes through every single row in InsurancePolicies? 
With the EXISTS() version, as soon as the query engine finds that the
broker in question owns a single policy that started this year it would
stop trawling through the 80000 row policy table.

Best case scenario for EXISTS(), the first policy row for that broker
started this year so that broker is included in the result set (scan 1
row out of 80000); worst case scenario, the only policy the broker owns
that started this year was created yesterday (and so has the greatest
policy_id and so is last in the physical order of rows in the table -
ie. full index scan; scan 80000 rows out of 80000).  For all cases for
IN() the query engine needs to go through every policy row that (that
started this year) to compile the distinct list to present back to the
outer query - i.e. full index scan.  Perhaps Oracle have done some
particular optimisations in that area, but I believe that's the way
Microsoft deal with it.

Bit of a dumb example really because an inner join would be the best way
to write that query anyway (well it would in SQL Server - I assume the
same would hold true for Oracle) but it's the simplest example my poor
tired brain would come up with at 10:30 on a Friday night.

--
*mike hodgson*
blog: http://sqlnerd.blogspot.com



Tony Andrews wrote:

Show quote
>Paul wrote:

>
>>Mike Hodgson wrote:
>>   
>>
>>>The EXISTS() predicate is typically a fairly efficient predicate because
>>>it only needs to scan until it gets a match, at which time it returns.
>>>The worst case scenario (it finds a match on the last physical row, or
>>>it doesn't find any matching row) is the same I/O as the IN() predicate
>>>case because IN() will evaluate the entire subquery.
>>>     
>>>
>>Why does IN() need to evaluate the entire subquery? Couldn't it in
>>theory work exactly the same as EXISTS() at the physical level?
>>   
>>
>
>Yes it could, and indeed does (Oracle 9i).  In the following example,
>IN and EXISTS are processed the same way, and DO NOT evaluate the
>entire subquery:
>
>SQL> create table t1 as select object_id, object_name from all_objects;
>
>Table created.
>
>SQL> alter table t1 add constraint t1_pk primary key (object_id);
>
>Table altered.
>
>SQL> create table t2 as select object_id, object_name from all_objects
>where rownum=1;
>
>Table created.
>
>SQL> alter table t2 add constraint t2_pk primary key (object_id);
>
>Table altered.
>
>SQL> analyze table t1 compute statistics;
>
>Table analyzed.
>
>SQL> analyze table t2 compute statistics;
>
>Table analyzed.
>
>SQL> select count(*) from t1;
>
>  COUNT(*)
>----------
>     47355
>
>SQL> select count(*) from t2;
>
>  COUNT(*)
>----------
>         1
>
>SQL> set autotrace on
>SQL> select * from t2 where object_id in (select object_id from t1);
>
> OBJECT_ID OBJECT_NAME
>---------- ------------------------------
>     18164 /1005bd30_LnkdConstant
>
>
>Execution Plan
>----------------------------------------------------------
>   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=1 Card=1 Bytes=30)
>   1    0   NESTED LOOPS (Cost=1 Card=1 Bytes=30)
>   2    1     TABLE ACCESS (FULL) OF 'T2' (Cost=1 Card=1 Bytes=26)
>   3    1     INDEX (UNIQUE SCAN) OF 'T1_PK' (UNIQUE)
>
>
>
>
>Statistics
>----------------------------------------------------------
>          0  recursive calls
>          0  db block gets
>          5  consistent gets
>          3  physical reads
>          0  redo size
>        227  bytes sent via SQL*Net to client
>        314  bytes received via SQL*Net from client
>          2  SQL*Net roundtrips to/from client
>          0  sorts (memory)
>          0  sorts (disk)
>          1  rows processed
>
>SQL> select * from t2 where exists (select null from t1 where
>t1.object_id = t2.object_id);
>
> OBJECT_ID OBJECT_NAME
>---------- ------------------------------
>     18164 /1005bd30_LnkdConstant
>
>
>Execution Plan
>----------------------------------------------------------
>   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=1 Card=1 Bytes=26)
>   1    0   FILTER
>   2    1     TABLE ACCESS (FULL) OF 'T2' (Cost=1 Card=1 Bytes=26)
>   3    1     INDEX (UNIQUE SCAN) OF 'T1_PK' (UNIQUE) (Cost=1 Card=1 B
>          ytes=4)
>
>
>
>
>
>Statistics
>----------------------------------------------------------
>          0  recursive calls
>          0  db block gets
>          5  consistent gets
>          0  physical reads
>          0  redo size
>        227  bytes sent via SQL*Net to client
>        314  bytes received via SQL*Net from client
>          2  SQL*Net roundtrips to/from client
>          0  sorts (memory)
>          0  sorts (disk)
>          1  rows processed
>

>
Author
19 Aug 2005 1:27 PM
Paul
Mike Hodgson wrote:
Show quote
> select * from InsuranceBrokers b
> where broker_id in
>    (
>    select distinct broker_id from InsurancePolicies p
>    where p.inception_date > '20050101'
>    )
....
> Best case scenario for EXISTS(), the first policy row for that broker
> started this year so that broker is included in the result set (scan 1
> row out of 80000); worst case scenario, the only policy the broker owns
> that started this year was created yesterday (and so has the greatest
> policy_id and so is last in the physical order of rows in the table -
> ie. full index scan; scan 80000 rows out of 80000).  For all cases for
> IN() the query engine needs to go through every policy row that (that
> started this year) to compile the distinct list to present back to the
> outer query - i.e. full index scan.

The optimizer should be able to recognise that the distinct is
unnecessary, so it wouldn't need to go through the whole table.

Just because the subquery uses "distinct" it doesn't mean that the DBMS
must materialise that internally if it's not necessary. Though I guess
index statistics may cause it to choose this option if it thinks it
would be faster.

If you think about it, both queries are logically identical so it would
be possible (at least in theory) for them to use identical query plans,
if the query optimizer is clever enough.

Paul.
Author
19 Aug 2005 2:16 PM
Tony Andrews
Mike Hodgson wrote:
Show quote
> Perhaps it's an implementation detail of Oracle then (I'm not very
> familiar with how Oracle performs the various operations).
>
> What would happen if you had slightly more realistic data (like both
> tables with more than 1 row in them)?  For example, say you had:
> InsurancePolicies {policy_id, broker_id, inception_date, ...} - (80000
> rows) clustered index on policy_id (don't know what Oracle parlance for
> "clustered index" is)
> InsuranceBrokers {broker_id, broker_name, ...} - (200 rows) clustered
> index on broker_id
>
> then you said:
>
> select * from InsuranceBrokers b
> where broker_id in
>     (
>     select distinct broker_id from InsurancePolicies p
>     where p.inception_date > '20050101'
>     )
>
> select * from InsuranceBrokers b
> where exists
>     (
>     select * from InsurancePolicies p
>     where p.broker_id = b.broker_id
>     and p.inception_date > '20050101'
>     )
>
> Both queries should return the same data, namely all the brokers who own
> 1 or more policies  that started this year.  Now, with the IN() query,
> if the physical data in InsurancePolicies is sorted by policy_id, then
> how does the query engine know it's got all of the policies that started
> this year unless it goes through every single row in InsurancePolicies?
> With the EXISTS() version, as soon as the query engine finds that the
> broker in question owns a single policy that started this year it would
> stop trawling through the 80000 row policy table.
>
> Best case scenario for EXISTS(), the first policy row for that broker
> started this year so that broker is included in the result set (scan 1
> row out of 80000); worst case scenario, the only policy the broker owns
> that started this year was created yesterday (and so has the greatest
> policy_id and so is last in the physical order of rows in the table -
> ie. full index scan; scan 80000 rows out of 80000).  For all cases for
> IN() the query engine needs to go through every policy row that (that
> started this year) to compile the distinct list to present back to the
> outer query - i.e. full index scan.  Perhaps Oracle have done some
> particular optimisations in that area, but I believe that's the way
> Microsoft deal with it.
>
> Bit of a dumb example really because an inner join would be the best way
> to write that query anyway (well it would in SQL Server - I assume the
> same would hold true for Oracle) but it's the simplest example my poor
> tired brain would come up with at 10:30 on a Friday night.

Here is the best simulation of your example I can come up with at the
moment.  It shows that Oracle is indeed choosing different plans this
time, though the elapsed time is similar either way (EXISTS slightly
faster, 0.3 secs rather than 0.4)

>From the Oracle docs, it appears that "where x in (select PKCOL from
y)" is a special case, that can be optimized into a join.

SQL> select count(*) from insurance_brokers;

  COUNT(*)
----------
        67

SQL> select count(*), count(distinct broker_id) from
insurance_policies;

  COUNT(*) COUNT(DISTINCTBROKER_ID)
---------- ------------------------
     30881                       30

SQL> select * from Insurance_Brokers b
  2  where broker_id in
  3     (
  4     select distinct broker_id from Insurance_Policies p
  5     where p.inception_date > date '2005-01-01'
  6     );

BROKER_ID BROKER_NAME
---------- ------------------------------
         0 SYS
         5 SYSTEM
        11 OUTLN
        18 DBSNMP
        20 WMSYS
        41 AURORA$JIS$UTILITY$
        29 ORDSYS
        30 ORDPLUGINS
        31 MDSYS
        32 CTXSYS
        34 XDB
        38 FSC
        39 RB
        42 OSE$HTTP$ADMIN
        56 PORTAL30
       121 FLOWS_010600
       120 FLOWS_FILES
        58 PORTAL30_SSO
        60 PORTAL30_SSO_PS
        61 PORTAL30_DEMO
        62 SCOTT
       118 AFOSTER
        71 WORKFLOW
       122 SX3COM
       128 TOAD
       132 BMS1
       134 DESDIR
       141 JREED
       145 NARROW
       146 ARROW

30 rows selected.


Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=94 Card=30 Bytes=720
          )

   1    0   HASH JOIN (Cost=94 Card=30 Bytes=720)
   2    1     VIEW OF 'VW_NSO_1' (Cost=92 Card=30 Bytes=390)
   3    2       SORT (UNIQUE) (Cost=92 Card=30 Bytes=270)
   4    3         TABLE ACCESS (FULL) OF 'INSURANCE_POLICIES' (Cost=9
          Card=30881 Bytes=277929)

   5    1     TABLE ACCESS (FULL) OF 'INSURANCE_BROKERS' (Cost=1 Card=
          67 Bytes=737)

SQL> select * from Insurance_Brokers b
  2  where exists
  3      (
  4      select * from Insurance_Policies p
  5      where p.broker_id = b.broker_id
  6      and p.inception_date > date '2005-01-01'
  7      );

BROKER_ID BROKER_NAME
---------- ------------------------------
         0 SYS
         5 SYSTEM
        11 OUTLN
        18 DBSNMP
        20 WMSYS
        41 AURORA$JIS$UTILITY$
        29 ORDSYS
        30 ORDPLUGINS
        31 MDSYS
        32 CTXSYS
        34 XDB
        38 FSC
        39 RB
        42 OSE$HTTP$ADMIN
        56 PORTAL30
       121 FLOWS_010600
       120 FLOWS_FILES
        58 PORTAL30_SSO
        60 PORTAL30_SSO_PS
        61 PORTAL30_DEMO
        62 SCOTT
       118 AFOSTER
        71 WORKFLOW
       122 SX3COM
       128 TOAD
       132 BMS1
       134 DESDIR
       141 JREED
       145 NARROW
       146 ARROW

30 rows selected.


Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=1 Card=3 Bytes=33)
   1    0   FILTER
   2    1     TABLE ACCESS (FULL) OF 'INSURANCE_BROKERS' (Cost=1 Card=
          3 Bytes=33)

   3    1     TABLE ACCESS (BY INDEX ROWID) OF 'INSURANCE_POLICIES' (C
          ost=6 Card=1029 Bytes=9261)

   4    3       INDEX (RANGE SCAN) OF 'IP_IB' (NON-UNIQUE) (Cost=2 Car
          d=1029)
Author
19 Aug 2005 10:46 PM
Hugo Kornelis
On 18 Aug 2005 15:13:28 -0700, Mikito Harakiri wrote:

Show quote
>Hugo Kornelis wrote:
>> On 18 Aug 2005 13:17:37 -0700, Mikito Harakiri wrote:
>> >pondering if
>> >
>> >select * from table
>> >
>> >is faster than
>> >
>> >select col1, col2, ... from table
>>
>> No need to ponder that -- all SQL Server DBAs (and presumably all DBAs
>> for all serious RDBMS's) know that SELECT * should never be used in
>> production code (except in a EXISTS(..) subquery).
>
>My bad. I meant
>
>select count(1) from table
>
>vs.
>
>select count(*) from table

Hi Mikito,

As Mike said: absolutely no difference between the two.


>BTW, you triggered the other example: is EXISTS or IN faster? This
>question could come up only from somebody who is completely unaware of
>SQL expression equivalency and query rewrite.

SQL Server will often produce the same execution for both versions. If
performance is really critical, always test all versions. If performance
is important but not criticat, use EXISTS - I've seen cases where it's
faster than IN, but I haven't seen the reverse yet.


> Well, making sure the
>extents and segments are layed out on disk properly, leaves little room
>for education and abstract thinking.

Huh? I don't know what your DMBS of choice is, but SQL Server doesn't
bother the DBA with extents and segments.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
20 Aug 2005 2:45 PM
Anith Sen
[...removed comp.databases.theory...]

>> As Mike said: absolutely no difference between the two.

However, COUNT comes in different flavors, esp. with NULLs, it is
inconsistent at best:

COUNT( column )
COUNT( * )  -- for the whole resultset
COUNT( constant )  --- this is silly
COUNT( some column expression )
COUNT( some column expression involving NULLs )
         -- weird results, changes with datatypes!

For instance with numeric expressions & values.

SELECT COUNT(*), COUNT( 2 ),
       COUNT( c * 2 + 1 ), COUNT( c * 2 + NULL ), COUNT( c )
  FROM ( SELECT 1 UNION SELECT NULL UNION SELECT 2 ) N ( c ) ;

>> SQL Server will often produce the same execution for both versions.

First of all, not all statements with EXISTS() can be re-written using IN()
or vice versa. For simple one-liners and straight forward queries involving
a few tables, the plans are often comparable but in certain complex
scenarios, with deeply nested SQL expressions this varies significantly.

>> If performance is really critical, always test all versions. If
>> performance is important but not criticat, use EXISTS - I've seen cases
>> where it's faster than IN, but I haven't seen the reverse yet.

I have seen it, though rarely esp. with small tables, when there are
correlated subqueries nested deeply. Another case is a derived table with
hard-wired values can sometimes generate plans which favors IN() over
EXISTS(). So as you said, testing & finding it out is the only solution.

>> Huh? I don't know what your DMBS of choice is, but SQL Server doesn't
>> bother the DBA with extents and segments.

True; perhaps DBCC commands, but in reality it can be a benefit rather than
botheration.

--
Anith
Author
19 Aug 2005 10:56 PM
Hugo Kornelis
On 18 Aug 2005 15:13:28 -0700, Mikito Harakiri wrote:

(snip)
>BTW, you triggered the other example: is EXISTS or IN faster?

Woops - forgot to add this to my previous reply:

Before thinking about speed, you should think about the difference
between the two. IN can result in UNKNOWN; EXISTS can only result in
TRUE or FALSE. Since UNKNOWN and FALSE are treated the same in a WHERE,
WHEN, or HAVING clause, many people doon't notice the difference - until
they start combining IN with NOT and getting unexpected results!

Only choose the faster version if they truly are equal!!

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
19 Aug 2005 11:35 PM
Mikito Harakiri
Hugo Kornelis wrote:
Show quote
> On 18 Aug 2005 15:13:28 -0700, Mikito Harakiri wrote:
>
> (snip)
> >BTW, you triggered the other example: is EXISTS or IN faster?
>
> Woops - forgot to add this to my previous reply:
>
> Before thinking about speed, you should think about the difference
> between the two. IN can result in UNKNOWN; EXISTS can only result in
> TRUE or FALSE. Since UNKNOWN and FALSE are treated the same in a WHERE,
> WHEN, or HAVING clause, many people doon't notice the difference - until
> they start combining IN with NOT and getting unexpected results!
>
> Only choose the faster version if they truly are equal!!

This tiny difference is really a pain in the butt. This UNNKNOWN based
logic is rubbish, and any query involving NULLs still could give a
surprising result. Therefore, the life would be much simpler if SQL
standard just defined  NOT IN and NOT EXIST to be identical. After all,
they both are just dumbed down aggregate scalar subquery expressions:

--IN, EXIST
select * from dept d
where 0<(select count(*) from emp
         where dept.deptno=emp.deptno)

--NOT IN, NOT EXIST
select * from dept d
where 0=(select count(*) from emp
         where dept.deptno=emp.deptno)


(both also known as semijoin and antijoin). Aggregate scalar subquery
syntax is more general, and generic solution always rule.
Author
19 Aug 2005 5:01 PM
mike
On Thu, 18 Aug 2005 22:50:33 +0200, Hugo Kornelis
<hugo@pe_NO_rFact.in_SPAM_fo> wrote:

[snip]
>The abstraction level where the actual
>data type is irrelevant is where you'll find the information analyst,
>functional designer, conceptual modeler or whatever the current name of
>those people is.
[snip]

The current name appears to be "Java programmer".  <cough>

--
mike sherrill
Author
19 Aug 2005 5:22 PM
AK
>> You have varchar2 limited to 4000 bytes.
Then you have text to cover 4K to 2G range. Then you have to rely on
some other option to be able store data bigger than 2G (split it into
chunks in your application????).

If you suggested such a design to programmic language community, you
would be laughed at.
<<

oh really? then how comes C# has 8 types to store integers only, not
counting decimal type? Is it funny too?


>> In fact, those ugly length
limited datatypes are just artifacts of the early SQL days,
<<

on the contrary, it is a very powerful tool in enforsing data integrity
Author
19 Aug 2005 7:00 PM
Gene Wirchenko
Show quote
On 19 Aug 2005 10:22:41 -0700, "AK" <AK_TIREDOFSPAM@hotmail.COM>
wrote:

>>> You have varchar2 limited to 4000 bytes.
>Then you have text to cover 4K to 2G range. Then you have to rely on
>some other option to be able store data bigger than 2G (split it into
>chunks in your application????).
>
>If you suggested such a design to programmic language community, you
>would be laughed at.
><<
>
>oh really? then how comes C# has 8 types to store integers only, not
>counting decimal type? Is it funny too?

     Is there anything different between a string 4096 long and a
string 4097 long?  Are there any different machine instructions that
are required to handle them?  No?

     What about with integers?  Yes?

     There is your difference.

>>> In fact, those ugly length
>limited datatypes are just artifacts of the early SQL days,
><<
>
>on the contrary, it is a very powerful tool in enforsing data integrity

     Oh?  How?  Data values have very little to do the difference
between 4096 and 4097 long strings.

Sincerely,

Gene Wirchenko
Author
19 Aug 2005 7:31 PM
AK
>>
>>> In fact, those ugly length
>limited datatypes are just artifacts of the early SQL days,
><<

>on the contrary, it is a very powerful tool in enforsing data integrity



     Oh?  How?
<<

by declaring STATE_CODE CHAR(2) I don't let old style abbreviations
'Fla' and 'Ill' in

by declaring COUNTRY_CODE CHAR(3), etc.
Author
19 Aug 2005 8:04 PM
Gene Wirchenko
On 19 Aug 2005 12:31:15 -0700, "AK" <AK_TIREDOFSPAM@hotmail.COM>
wrote:

[??? wrote:]
>>>> In fact, those ugly length
>>limited datatypes are just artifacts of the early SQL days,

[AK wrote:]

>>on the contrary, it is a very powerful tool in enforsing data integrity

[Gene Wirchenko wrote:]

>     Oh?  How?

>by declaring STATE_CODE CHAR(2) I don't let old style abbreviations
>'Fla' and 'Ill' in
>
>by declaring COUNTRY_CODE CHAR(3), etc.

     True, but that was not the point.  The point was that datatypes
limited to arbirary characters are artifacts.  (Note: "datatypes" not
"columns".  Your usage is good.)

     (Please do not strip attributions when you quote someone.)

Sincerely,

Gene Wirchenko
Author
19 Aug 2005 10:11 PM
Mikito Harakiri
AK wrote:
Show quote
> >>
> >>> In fact, those ugly length
> >limited datatypes are just artifacts of the early SQL days,
> ><<
>
> >on the contrary, it is a very powerful tool in enforsing data integrity
>
>
>
>      Oh?  How?
> <<
>
> by declaring STATE_CODE CHAR(2) I don't let old style abbreviations
> 'Fla' and 'Ill' in
>
> by declaring COUNTRY_CODE CHAR(3), etc.

Why not

CHECK CONSTRAINT(LENGTH(COUNTRY_CODE)<=2)

BTW, this way you could also declare that COUNTRY_CODE has exactly 2
characters.
Author
19 Aug 2005 10:49 PM
Hugo Kornelis
On Fri, 19 Aug 2005 12:00:15 -0700, Gene Wirchenko wrote:

Show quote
>On 19 Aug 2005 10:22:41 -0700, "AK" <AK_TIREDOFSPAM@hotmail.COM>
>wrote:
>
>>>> You have varchar2 limited to 4000 bytes.
>>Then you have text to cover 4K to 2G range. Then you have to rely on
>>some other option to be able store data bigger than 2G (split it into
>>chunks in your application????).
>>
>>If you suggested such a design to programmic language community, you
>>would be laughed at.
>><<
>>
>>oh really? then how comes C# has 8 types to store integers only, not
>>counting decimal type? Is it funny too?
>
>     Is there anything different between a string 4096 long and a
>string 4097 long?  Are there any different machine instructions that
>are required to handle them?  No?
>
>     What about with integers?  Yes?
>
>     There is your difference.

Hi Gene,

So what exactly is the difference between a number that might reach
30,000 or a number that might reach 40,000? And how exactly is that
difference different from the difference between a 4096 character string
and 4097 character string?

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
22 Aug 2005 9:00 PM
Gene Wirchenko
On Sat, 20 Aug 2005 00:49:14 +0200, Hugo Kornelis
<hugo@pe_NO_rFact.in_SPAM_fo> wrote:

[snip]

>So what exactly is the difference between a number that might reach
>30,000 or a number that might reach 40,000? And how exactly is that

     Different machine instructions.

>difference different from the difference between a 4096 character string
>and 4097 character string?

     This is just iterating a little more on the longer string.

Sincerely,

Gene Wirchenko
Author
22 Aug 2005 10:38 PM
Hugo Kornelis
On Mon, 22 Aug 2005 14:00:27 -0700, Gene Wirchenko wrote:

>On Sat, 20 Aug 2005 00:49:14 +0200, Hugo Kornelis
><hugo@pe_NO_rFact.in_SPAM_fo> wrote:
>
>[snip]
>
>>So what exactly is the difference between a number that might reach
>>30,000 or a number that might reach 40,000? And how exactly is that
>
>     Different machine instructions.

Hi Gene,

Isn't an RDBMS all about abstracting away from implementation details?

>>difference different from the difference between a 4096 character string
>>and 4097 character string?
>
>     This is just iterating a little more on the longer string.

Toppling the size needed just over the maximum amount reserved for it in
specific internal structures. Or, put more simply, it takes other
machine instructions to handle.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
23 Aug 2005 4:57 PM
Gene Wirchenko
On Tue, 23 Aug 2005 00:38:01 +0200, Hugo Kornelis
<hugo@pe_NO_rFact.in_SPAM_fo> wrote:

>On Mon, 22 Aug 2005 14:00:27 -0700, Gene Wirchenko wrote:
>
>>On Sat, 20 Aug 2005 00:49:14 +0200, Hugo Kornelis
>><hugo@pe_NO_rFact.in_SPAM_fo> wrote:
>>
>>[snip]
>>
>>>So what exactly is the difference between a number that might reach
>>>30,000 or a number that might reach 40,000? And how exactly is that
>>
>>     Different machine instructions.

>Isn't an RDBMS all about abstracting away from implementation details?

     And here I thought it was about storing/retrieving checked data.

     Whether one uses, for example, a string or an integer for a
particular column is an implementation detail.

>>>difference different from the difference between a 4096 character string
>>>and 4097 character string?
>>
>>     This is just iterating a little more on the longer string.
>
>Toppling the size needed just over the maximum amount reserved for it in
>specific internal structures. Or, put more simply, it takes other
>machine instructions to handle.

     "more" not "other".

Sincerely,

Gene Wirchenko
Author
23 Aug 2005 5:30 PM
Mike Labosh
>>Toppling the size needed just over the maximum amount reserved for it in
>>specific internal structures. Or, put more simply, it takes other
>>machine instructions to handle.
>
>     "more" not "other".

SHEESH!

I meant to pass on some amusement last week.  I cannot believe you people
are still arguing about the miniscule details.

--
Peace & happy computing,

Mike Labosh, MCSD
"Musha ring dum a doo dum a da!" -- James Hetfield
Author
23 Aug 2005 7:04 PM
Raymond D'Anjou
"Mike Labosh" <mlab***@hotmail.com> wrote in message
news:%2387KehAqFHA.1136@TK2MSFTNGP12.phx.gbl...
> SHEESH!
>
> I meant to pass on some amusement last week.  I cannot believe you people
> are still arguing about the miniscule details.

SQL experts... give them a bone and they'll chew on it for days.
Author
23 Aug 2005 9:02 PM
Gene Wirchenko
On Tue, 23 Aug 2005 13:30:31 -0400, "Mike Labosh"
<mlab***@hotmail.com> wrote:

>>>Toppling the size needed just ove