Home All Groups Group Topic Archive Search About
Author
4 May 2005 7:07 PM
js
Hi, I define a field as auto number field, usually how people deal with if
data is growing near 2147483647? Thanks.

Author
4 May 2005 8:15 PM
--CELKO--
1) Change the data type to something bigger.
2) Put some stuff in an archives.
3) Re-use previous numbers.  And try not toscrew up your data
warehouse.
4) If this is declared as a key, then re-think your whole design and
come back with a relational schema that does not use sequential
numbering for exposed physical locators.
Author
4 May 2005 8:30 PM
js
Thanks CELKO,
if I declare the key as GUID string, compare to identity number, the query
will a little bit slow?

Show quote
"--CELKO--" <jcelko***@earthlink.net> wrote in message
news:1115237750.148780.316040@f14g2000cwb.googlegroups.com...
> 1) Change the data type to something bigger.
> 2) Put some stuff in an archives.
> 3) Re-use previous numbers.  And try not toscrew up your data
> warehouse.
> 4) If this is declared as a key, then re-think your whole design and
> come back with a relational schema that does not use sequential
> numbering for exposed physical locators.
>
Author
4 May 2005 9:11 PM
--CELKO--
>> if I declare the key as GUID string, compare to identity number, the
query will a little bit slow? <<

Again, I think you are missing the basics here.  Ask yourself what is
the logical meaning of a GUID in the data model?  Where does it occur
in the reality you are trying to capture?  The answer is that it lives
only in the hardware that stores the data.  By definition, it cannot be
a key.  Number (4) was the irght answer; you do not have a relational
design and you are fishing for kludges to patch it up.
Author
4 May 2005 9:30 PM
Thomas Coleman
> in the reality you are trying to capture?  The answer is that it [a guid]
> lives
> only in the hardware that stores the data.

I don't buy this statement. Clearly Guids can be generated on machines other
than the database server on which they are stored. So what exactly do you mean
by it "lives" only in the hardware that stores the data?


Thomas
Author
5 May 2005 10:31 AM
--CELKO--
>> Clearly Guids can be generated on machines other than the database
server on which they are stored. So what exactly do you mean by it
"lives" only in the hardware that stores the data? <<

Go to the reality of the model.  Try to verify it as an identifier.
That is, given an entity, can anyone tell you the GUID by looking at
the entity?  I can read the VIN off of a car, the UPC code from a
retail package, etc. but the GUID is not an attribute of the entity.
The best I can do is validate its format; I used to be able to get some
hardware location data, but that is gone now.  By definition, a key is
a subset of attributes, so these exposed phsyical locators cannot be
keys in an RDBMS. 
..
Author
5 May 2005 1:53 PM
Thomas Coleman
>>> Clearly Guids can be generated on machines other than the database
> server on which they are stored. So what exactly do you mean by it
> "lives" only in the hardware that stores the data? <<
>
> Go to the reality of the model.  Try to verify it as an identifier.
> That is, given an entity, can anyone tell you the GUID by looking at
> the entity?  I can read the VIN off of a car,

If the Guid is stamped on the car you can do the same.

>  UPC code from a
> retail package, etc.

If the Guid is stamped on the package or encoded as a bar code you can do the
same.

> but the GUID is not an attribute of the entity.

It is if the database defines it to be. Just as the system defines a UPC or a
VIN to be part of the entity's attributes.

> The best I can do is validate its format; I used to be able to get some
> hardware location data, but that is gone now.

How the number was generated is irrelevant. You are assuming that a key must
contain cosmic meaning about the entity. Remember, for that to work, the user
must **understand** how to decypher the cosmic meaning from the key for it to be
useful. You are further assuming that the Guid is designed to have cosmic
meaning and it is not. It is simply a series of alpha numeric characters than
uniquely identifies the entity.

> By definition, a key is
> a subset of attributes, so these exposed phsyical locators cannot be
> keys in an RDBMS.

A key *can* be a subset of attributes *or* simply a means to uniquely identify
an entity (meaning it can be both). In the scenarios you mentioned above, the
"key" was simply a bunch of gobbledegook that was stamped on something physical
that a user could write down, go to the computer system and look up. Whether the
user was able to discern any cosmic meaning from gobbledegook is secondary to
its purpose.


Thomas
Author
17 May 2005 4:26 PM
--CELKO--
>> If the GUID is stamped on the car you can do the same. ...If the
GUID is stamped on the package or encoded as a bar code you can do the
same. <<

Know any company that is doing that?  What industry standards group
enforces these GUIDs to allow me to place orders and interface with
someone else in the retail industry? What information does the GUID
give me about the entity?  The manufacturer? product line?  Anything?

>> How the number was generated is irrelevant. <<

I want validation and verfication, there the generation rules are VERY
irrelevant.

>>You are assuming that a key must contain cosmic meaning about the
entity. <<

When you think that you can pull an identifier out of the air, totally
unrelated to the entity and by decree in your local village make it an
identifier, then you are dealing with "Cosmic Meaning" and mystical
powers.  Do you also beleive that God put a 17 digit hebrew number on
the bottom of everything in creation and all we have to do is find it?

>>  key *can* be a subset of attributes *or* simply a means to uniquely
identify an entity (meaning it can be both). <<

Gee. I guess poor old  Dr. Codd was wrong.  He said a key HAD TO BE
subset of attributes which uniquely identifies an entity. You keep
confusing exposed physical locators or system surrogates (GUID,
IDENTITY, ROWID, pointer chains, hash tables, bit vecrtors, B-Tree
indexdes, et al) with a relational key.
Author
17 May 2005 6:53 PM
Thomas Coleman
>>> If the GUID is stamped on the car you can do the same. ...If the
> GUID is stamped on the package or encoded as a bar code you can do the
> same. <<
>
> Know any company that is doing that?  What industry standards group
> enforces these GUIDs to allow me to place orders and interface with
> someone else in the retail industry?

Firstly, there is a specification before IEEE for a standard Guid (called a
Uniform Resource Identifier) implementation. I'm not sure of its status.
Secondly, your argument sounds like a feature. If this feature is specifically
requested, then I'd agree with you. But 99% of the PKs in an enterprise system
do not have this requirement. They (users) simply want a unique way of
identifying a given entity. An important aspect to understand is *how* they want
to locate or get information about an entity. Do they want to be able to look a
key value and know something about that entity or is it more likely that they'll
plug the value into a computer system and have the computer system tell them far
more information about the entity than could possibly be learned by a few meager
bits of data buried in a key?

> What information does the GUID
> give me about the entity?  The manufacturer? product line?  Anything?

Again, why does it need to? Outside customers and most internal users won't have
the foggiest idea what cosmic meaning is buried in the key value. What does the
value 1234567 mean to you? There very well may be cosmic meaning buried in that
key that all of three people in the universe might understand and use. However,
the cost of integrating cosmic meaning into a key value is not insignificant.
Further, think of a simpler problem: customers. There is no international
standard way of identifying a customer and you can't use their driver's license
or SSN numbers for privacy reasons. The customers scenario, IMO is more common
than the VIN scenario where you have a massive monolithic organization setting
the key specification in stone.

>>> How the number was generated is irrelevant. <<
>
> I want validation and verfication, there the generation rules are VERY
> irrelevant.

If that feature is specifically desired, then I'd agree. However, where we
clearly disagree is how often this is actually required. My argument is that
this sort of check digit type validation is far less common than you clearly
think.


>>>You are assuming that a key must contain cosmic meaning about the
> entity. <<
>
> When you think that you can pull an identifier out of the air, totally
> unrelated to the entity and by decree in your local village make it an
> identifier, then you are dealing with "Cosmic Meaning" and mystical
> powers.  Do you also beleive that God put a 17 digit hebrew number on
> the bottom of everything in creation and all we have to do is find it?

Actually the ten commandments DO have an identifier. They are numbered! The
number, by itself tells you nothing about the commandment unless you know the
rest of the commandment. If you know nothing of the ten commandments and an
evangelist tells you to obey the sixth commandment, you won't have any idea what
that means without looking it up. Besides, the Hebrews lead us to believe that
the commandments were created in order. Thus, "God" must have used the
equivalent of an identity column. ;->

There are thousands of examples of key values for which no one has any clue as
to the cosmic meaning inherit to them if there is any. As I said, not all keys
need cosmic meaning. Forcing all keys to have cosmic meaning ignores how the key
will actually be used. If users want the ability to look at a key and know
something about the entity, then clearly having meaning in the key is important.
However, if, as I suspect that no key can provide all the information that a
user desires and that instead they will lean heavily towards simply plugging the
key into a computer system to get more information, then creating a key with
meaning is frar more costly in development time than it is worth it.

>>>  key *can* be a subset of attributes *or* simply a means to uniquely
> identify an entity (meaning it can be both). <<
>
> Gee. I guess poor old  Dr. Codd was wrong.  He said a key HAD TO BE
> subset of attributes which uniquely identifies an entity. You keep
> confusing exposed physical locators or system surrogates (GUID,
> IDENTITY, ROWID, pointer chains, hash tables, bit vecrtors, B-Tree
> indexdes, et al) with a relational key.

Firstly, Codd is not infallible. Secondly, at the time Codd made his assumptions
computer systems were not nearly sophisticated enough nor fast enough to keep up
with users. Thus, having meaningful keys improved overall performance because
users could, at times, get information from a key and avoid having to query the
system itself. However, as systems have gotten far more sophisticated and far
more powerful, this is less true. Most keys cannot possible contain enough
information to be worthwhile. It is faster today to simply plug the value into a
computer system.

I'm not suggesting *all* keys should be surrogate or "exposed locators" as you
call them. However, I am suggesting that not *all* keys need be natural.


Thomas
Author
17 May 2005 8:09 PM
--CELKO--
>> However, if, as I suspect that no key can provide all the
information that a user desires and that instead they will lean heavily
towards simply plugging the key into a computer system to get more
information, then creating a key with meaning is frar more costly in
development time than it is worth <<

A key is supposed to identify an entity, not provide ALL the data about
it.  I do not expect my VIN number to include my gas mileage.  What a
user wants in a good key:

Uniqueness:

The first property is that key be unique.  This is the most basic
property it can have since without uniqueness it cannot be a key by
definition.  Uniqueness is necessary, but not sufficient.

Uniqueness has a context.  An identifier can be unique in the local
database, in the enterprise across databases or unique universally.  We
would prefer the last of those three options.

We can often get universal uniqueness with industry standard codes such
the VIN numbers.  We can get enterprise uniqueness with things like
telephone extensions and email addresses.  An identifier that is unique
only in a single database is workable, but pretty much useless because
it will lack the other desired properties.

Stability:

The second property we want is stability or invariance.  The first kind
of stability is within the schema itself and this applies to both key
and non-key columns.  The same data element should have the same
representation wherever it appears in the schema.  It should not be
CHAR(n) in one place and INTEGER in another.  The same basic set of
constraints should apply to it.  That is, if we use the VIN number as
an identifier, then we can constraint it to be only for vehicles form
Ford Motors; we cannot change the format of the VIN in one table and
not in all others.

The next kind of stability is over time.  You do not want keys changing
frequently or in unpredictable ways.  Contrary to a popular myth, this
does not mean that keys cannot ever change.  As the scope of their
context grows, they should be able to change.

On 2005 January 01, the United States added one more digit to the UPC
bar codes used in the retail industry.  The reason was globalization
and erosion of American industrial domination.  The global bar code
standard will be the European Article Number (EAN) Code.  The American
Universal Product Code (UPC) turned 30 years old in 2004 and was never
so universal after all.

The EAN was set up in 1977 and uses 13 digits, while the UPC has 12
digits of which you see 10 broken into two groups of five digits on a
label.  The Uniform Code Council, which sets the standards in North
America, has the details for the conversion worked out.  The conversion
is stable because of this central, trusted authority.  More than five
billion bar-coded products are scanned every day on Earth.  It has made
data mining in retail possible and saved millions of man-hours.

For the neo-Luddite programmers who think that changing a key is going
to kill you, let me quote John Metzger, chief information officer of
A&P.  The grocery chain had 630 stores in 2004 and the grocery industry
works 1% to 3% profit margins -- the smallest margins of any industry
that is not taking a loss.  A&P has handled the new barcode problem as
part of a modernization of its technology systems.  "It is important,"
Mr. Metzger said, "but it is not a shut-the-company-down kind of
issue."

Familiarity:

It helps if the users know something about the data.  This is not quite
the same as validation, but it is related.  Validation can tell you if
the code is properly formed via some process; familiarity can tell you
if it "feels right" because you know something about the context.
Thus, ICD codes for disease would confuse a patient, but not a medical
record clerk.

Validation:

Can you look at the data value and tell that it is wrong, without using
an external source?  For example, I know that '2004-02-30' is not a
valid date because no such day exists on the Common Era calendar.
Check digits and fixed format codes are one way of obtaining this.

Verifiability:

How do I verify a key?  This also comes in context and in levels of
trust.  When I cash a check at the supermarket, they are willing to
believe that the photo on the driver's license I present is really me,
no matter how ugly it is.  Or rather, they used to believe it was I;
Kroger's grocery store chain is putting an ink-less fingerprinting
system in place, just like many banks have done.

When I get a passport, I need a birth certificate and fingerprinting.
There is a little less trust here.  When I get a security clearance, I
also need to be investigated.  There is a lot less trust.

A key without a verification method has no data integrity and will lead
the accumulation of bad data.

Simplicity:

A key should be as simple as possible, but no simpler.  People, reports
and other systems will use the keys.  One man's simple is another man's
complex.  For an example of a horribly complex code that is in common
international usage, look up the IBAN (International Standard Bank
Number).  But it is minimal in context.

>> Firstly, Codd is not infallible.<<

Heresy! :)

>>  Secondly, at the time Codd made his assumptions computer systems
were not nearly sophisticated enough nor fast enough to keep up with
users. Thus, having meaningful keys improved overall performance
because users could, at times, get information from a key and avoid
having to query the system itself. <<

Dead wrong.  In the early days, the file systems and early DBMS used
pointers, row numbers, record numbering (identity)  and other hardware
level things to locate data because performance and storage was a huge
issue.  A longer meaningful key between two data sets would have slowed
down the systems.  Get out a book on IDMS, IMS, Total, etc. and look at
their enternals.

>> Most keys cannot possible contain enough information to be
worthwhile. <<

I'll remember that when I register my next car with a VIN, buy a  book
with an ISBN or make a bar-coded retail purchase.  All of those
transactions completed without a GUID, IDENTITY or Mystical Hebrew
Number.
Author
17 May 2005 9:48 PM
Thomas Coleman
<snip>

> A key is supposed to identify an entity, not provide ALL the data about
> it.  I do not expect my VIN number to include my gas mileage.  What a
> user wants in a good key:

So, where do you draw the line. Suppose 50% of the users want the color encoded
in a PK and 50% of the users want the name and address encoded in the PK. Trying
make this type of prediction is tantamount to voodoo. Far simpler to have a
number that they use to look up the value in their computer system.

> Uniqueness:

Obivously
<snip>

Show quote
> Stability:
>
> The second property we want is stability or invariance.  The first kind
> of stability is within the schema itself and this applies to both key
> and non-key columns.  The same data element should have the same
> representation wherever it appears in the schema.  It should not be
> CHAR(n) in one place and INTEGER in another.  The same basic set of
> constraints should apply to it.  That is, if we use the VIN number as
> an identifier, then we can constraint it to be only for vehicles form
> Ford Motors; we cannot change the format of the VIN in one table and
> not in all others.
>
> The next kind of stability is over time.  You do not want keys changing
> frequently or in unpredictable ways.  Contrary to a popular myth, this
> does not mean that keys cannot ever change.  As the scope of their
> context grows, they should be able to change.
>
> On 2005 January 01, the United States added one more digit to the UPC
> bar codes used in the retail industry.  The reason was globalization
> and erosion of American industrial domination.  The global bar code
> standard will be the European Article Number (EAN) Code.  The American
> Universal Product Code (UPC) turned 30 years old in 2004 and was never
> so universal after all.
>
> The EAN was set up in 1977 and uses 13 digits, while the UPC has 12
> digits of which you see 10 broken into two groups of five digits on a
> label.  The Uniform Code Council, which sets the standards in North
> America, has the details for the conversion worked out.  The conversion
> is stable because of this central, trusted authority.  More than five
> billion bar-coded products are scanned every day on Earth.  It has made
> data mining in retail possible and saved millions of man-hours.
>
> For the neo-Luddite programmers who think that changing a key is going
> to kill you, let me quote John Metzger, chief information officer of
> A&P.  The grocery chain had 630 stores in 2004 and the grocery industry
> works 1% to 3% profit margins -- the smallest margins of any industry
> that is not taking a loss.  A&P has handled the new barcode problem as
> part of a modernization of its technology systems.  "It is important,"
> Mr. Metzger said, "but it is not a shut-the-company-down kind of
> issue."
>

This is where you and vehementally disagree. Having a key with cosmic data
embedded in the value is far from stable. If all keys are strings the change is
lessened but still painful. Far more painful than it should. Further, it does
not account for the attribute change scenario. Suppose we have a key value of
SMITH01 and the person's name changes. We are left with three possibilites: 1.
Leave the key and accept it is wrong. 2. Change the key and everywhere in the
system and deal with the fact that printed versions of the key will no longer be
traceable. 3. Deprecate the key and create a new and have to deal with lost
histories. All three solutions are painful.

Here's something else to consider, you keep talking about keys where the schema
is set down by a standards board. In most database systems I have seen over the
past ten years or so,  a majority of entities cannot be tied to this type of
key. Customers are good example. You have no standards body that designates a
key that you can use. You can't use driver's licenses and you can't use SSN for
security and privacy reasons. There many other examples.

Yet another problem with this concept is that the information must be readily
available at time of entry. For example, let's suppose you have a table of
Companies and let's also suppose that companies are being entered hundreds every
day. The cost for users to lookup a Dun and Bradstreet number is too costly and
may not even be known nor relevant to the users.

Thus, related to stability is that all of the constituent parts have to be
knowable at time of entry.

> Familiarity:
>
> It helps if the users know something about the data.  This is not quite
> the same as validation, but it is related.  Validation can tell you if
> the code is properly formed via some process; familiarity can tell you
> if it "feels right" because you know something about the context.
> Thus, ICD codes for disease would confuse a patient, but not a medical
> record clerk.

How many people, I wonder that work at UPS, fully know *and use* the schema by
which tracking numbers are assembled? How many users at Card Services know how
transaction numbers are assembled? I'll bet the number in both cases is
extraordinarily small. People do not look at key values in a vacuum. They have
the context with which the key is associated (e.g. a *customer's* number, a
*postal code* value etc.)  That is enough to know what and where to look up
additional information.

Your ICD codes is another example of a key designated by a large monolithic
body. You can depend on a certain degree of stability with this type of key.
However, there are many other types of keys in the universe that are not
designated by a standards body.

> Validation:
>
> Can you look at the data value and tell that it is wrong, without using
> an external source?  For example, I know that '2004-02-30' is not a
> valid date because no such day exists on the Common Era calendar.
> Check digits and fixed format codes are one way of obtaining this.

Suppose I showed you this number 20050511. Is that a date? Maybe. Maybe not.
Without *context* the key *in and of  itself* is useless. Furthermore, this is
feature that can be added later. If specifically requested up front, then I can
understand integrating this into your key schema. However most keys do not
require this type of validation.


Show quote
> Verifiability:
>
> How do I verify a key?  This also comes in context and in levels of
> trust.  When I cash a check at the supermarket, they are willing to
> believe that the photo on the driver's license I present is really me,
> no matter how ugly it is.  Or rather, they used to believe it was I;
> Kroger's grocery store chain is putting an ink-less fingerprinting
> system in place, just like many banks have done.
>
> When I get a passport, I need a birth certificate and fingerprinting.
> There is a little less trust here.  When I get a security clearance, I
> also need to be investigated.  There is a lot less trust.
>
> A key without a verification method has no data integrity and will lead
> the accumulation of bad data.

In what context will this sort of verifiability be used? There are other ways to
"verify" the data than to integrate a bunch of meaning in the key. How do you
verify that Customer 12345 is Bob Jones? You would look up customer 12345 and if
the result was Bob Jones, you would know that someone gave you an invalid key.

Not every key requires DOD levels of trust. Some keys need this type of
verification. I would argue that most keys do not.


> Simplicity:
<snip>

What could be simpler than a number? ;->

>>>  Secondly, at the time Codd made his assumptions computer systems
> were not nearly sophisticated enough nor fast enough to keep up with
> users. Thus, having meaningful keys improved overall performance
> because users could, at times, get information from a key and avoid
> having to query the system itself. <<
>
> Dead wrong.  In the early days, the file systems and early DBMS used
> pointers, row numbers, record numbering (identity)  and other hardware
> level things to locate data because performance and storage was a huge
> issue.  A longer meaningful key between two data sets would have slowed
> down the systems.  Get out a book on IDMS, IMS, Total, etc. and look at
> their enternals.

Did Codd envision that users would be innundated with thousands of keys (with
context)? Did he envision that most users have no desire to learn 80 bazillion
key schemas? Did he envision the ease with which users would enter these keys
using copy-paste and bar codes? Did he envision millions of database systems
with millions more tables across the globe that track abstract concepts for
which there is no standards body and no use outside of the enterprise? Doubtful.
Codd was using mathematical theory to solve specific design problems given the
context of a paper-ful world.

I agree longer keys are less desirable than shorter keys. However, there are
many examples of long, "natural" keys (e.g. part numbers, credit card numbers).

>>> Most keys cannot possible contain enough information to be
> worthwhile. <<
>
> I'll remember that when I register my next car with a VIN, buy a  book
> with an ISBN or make a bar-coded retail purchase.  All of those
> transactions completed without a GUID, IDENTITY or Mystical Hebrew
> Number.

Without looking it up, pull out a VIN and tell me what the 10th digit
represents. I would bet that 90% or more of the population have no idea.

Besides, once again the VIN is a key whose schema is set by standards body. What
about the thousands of other abstract entities where there is no such key or
such a key is not available at time of entry?

By your recokening, you might as well encode all entity information into the
key. After all, someone may need it one day.



Thomas
Author
18 May 2005 4:26 PM
Alex Papadimoulis
I have to point something out here ...

> This is where you and I vehementally disagree.

This is not a YOU vs CELKO. This is a YOU vs {every standards organization
from MISMO to ACORD}. It's like arguing that, despite all the troubles
everyone has faced using it, asbestos is, in fact, a safe product to use just
because YOU have never had a problem with it. And since the Asbestos studies
are a few years old, things have changed since then (better resperators,
etc), so there's really no need to go with anything else.



Author
18 May 2005 5:09 PM
Thomas Coleman
>> This is where you and I vehementally disagree.
> This is not a YOU vs CELKO.

Correct

> This is a YOU vs {every standards organization
> from MISMO to ACORD}.

Incorrect.

Looking beyond your strawman abestos argument, let's make sure we understand the
context of my statement. I made this comment in reference to the stability of
keys that have meaning buried within them and are NOT defined by a standards
board. I would argue that any key made of attribute values is by definition not
stable.

This is not me against Celko nor any standards organization. Clearly, if a
standards body declares a schema for a key it has a degree of stability.
However, as I said, this is NOT the case for the vast majority of keys that are
required for systems. They do not have the luxury of a standards body to mandate
the make up of a key. I have personally worked on projects where I had to
correct problems created by the original designer who was as arrogant as you and
Mr. Celko in regards to the "stability" of their magic key The original designer
could not get their head around the possibility that certain information
required by the key would not be available at time of entry or that the
attributes that make up the key might change or that the attributes that make up
the key would not be unique in some circumstances or that a mistake in data
entry might occur.

This is not the theory of gravity here. This is simple observation. If you have
a key made of attribute values and one or more of those attribute values change
you have three choices:

1. Leave it and accept that the data is wrong.  (e.g. the customer SMITH01
changes their last name from Smith to Jones. Users now have to accept that some
keys are accurate and some keys "look" accurate but are not and this breaks the
usefulness of having magic meaning as users will always need to lookup the
values to find the accurate version of the information).

2. Change the key value everywhere in the system. This means that any permanent
stamp of the original key value will have no equivalent value in your system.
(e.g. When you look up SMITH01 from an invoice which has now changed to JONES01,
you won't find anything)

3. Deprecate the key and issue a new one. This is what banks do with account
numbers. However, now you have to deal with the fact that the histories are
misplaced. You have to either combine the histories or tie the histories of the
two instances together. Again, this is similar to how banks handle the
situation. This is much more effort and now imagine that this is a possibility
on any entity in your design and you come to realize what a cost in development
time and testing this kind of design creates.

As I said, but that you failed to grasp, keys that have cosmic meaning buried
within them do have a purpose. However, not every key in every system (nor even
a majority IMO) that is not mimicing something dictated by a standards board
requires this type key and in fact would be a mistake to do so and the cost of
getting the key makeup wrong can be significant.

Further, even if there is a standards body definition for a key in question, it
does not always make sense to use it (e.g. the use of SSN outside of a
government organization)


Thomas
Author
18 May 2005 6:28 PM
Alex Papadimoulis
> > This is a YOU vs {every standards organization
> > from MISMO to ACORD}.
>
> I made this comment in reference to the stability of
> keys that have meaning buried within them and are NOT defined by a standards
> board. I would argue that any key made of attribute values is by definition not
> stable.

Okay ... but, in other posts you have specifically argued against ISBN, UPC,
VIN, PH#, etc. as being no longer necessary in today's world because things
have changed so much since their inception.


> However, as I said, this is NOT the case for the vast majority of keys that are
> required for systems. They do not have the luxury of a standards body to mandate
> the make up of a key.

Don't be so sure about that; industries standardize things of important to
them. Take a look at (man-made) things around you, do some research, and
you'll find that there is a standards body or classification for just about
every item you find.

It's ironic that, with how incredibly easy it is to do reasearch these days,
there are so many system designers who don't even know such standards exist,
let alone spend a few hours researching them. They just dive in and make a
"tblCar" with "intCarId" (or "guidCarId").


> (e.g. the customer SMITH01 changes their last name from Smith to Jones

Good designers would never design such a key; that's as stupid as using a
GUID for an account number. Unchangable verifiable physical attributes are
great key candidates but rarely can one be found on a customer, unless you're
lucky enough to serve the physical location (elec. company). That's where
system-generated verifiable (i.e., with checksums, etc) keys come into play.
But since you also disagree with the rest of the world on verifiability,
there's no point in going over that again.

Author
18 May 2005 7:00 PM
Thomas Coleman
<snip>
>> I made this comment in reference to the stability of
>> keys that have meaning buried within them and are NOT defined by a standards
>> board. I would argue that any key made of attribute values is by definition
>> not
>> stable.
>
> Okay ... but, in other posts you have specifically argued against ISBN, UPC,
> VIN, PH#, etc. as being no longer necessary in today's world because things
> have changed so much since their inception.

That's an exaggeration. I have argued that they are not necessary for every
situation. Clearly, for the library of Congress, ISBN numbers were important but
for customer numbers and some types of account numbers they are not.

>> However, as I said, this is NOT the case for the vast majority of keys that
>> are
>> required for systems. They do not have the luxury of a standards body to
>> mandate
>> the make up of a key.
>
> Don't be so sure about that; industries standardize things of important to
> them. Take a look at (man-made) things around you, do some research, and
> you'll find that there is a standards body or classification for just about
> every item you find.

Off the top of my head I can think of two: Films and Customers. I can certainly
come up with scheme for both of those. However, there is no standards body that
designates an encoding for movies for example. Movies are "uniquely" identified
in the loosest way by the title, year of production, producer, director etc. It
is very possible (even probable) for a movie title to change just before
release. Further, movie titles differ from one country to the next even though
it is the same movie. Each labs has its own encoding scheme for films. In the
case of customers, we have the problem of dealing with timliness of information.
The customer may not have a driver's license nor government identification at
time of entry.

I fully agree that there are many standardized schemes for many keys. However, I
have found that because of abstraction of the real world as well as timeliness
of the information (which affects completeness) that standardized keys cannot
always be used.

If there is a standards body designating a given key *and* it is available at
time of entry, then I agree that it should be used. World currenncies are a good
example. However, it's that last part that makes the use of standardized keys,
at times, untenable.

<snip>
>> (e.g. the customer SMITH01 changes their last name from Smith to Jones
>
> Good designers would never design such a key; that's as stupid as using a
> GUID for an account number. Unchangable verifiable physical attributes are
> great key candidates but rarely can one be found on a customer, unless you're
> lucky enough to serve the physical location (elec. company). That's where
> system-generated verifiable (i.e., with checksums, etc) keys come into play.
> But since you also disagree with the rest of the world on verifiability,
> there's no point in going over that again.

Alex, I'd agree that SMITH 01 is a bad key. Yet, the Northwind database (granted
not an ideal example of db design) uses a related scheme based on customer
information that would produce the same problems although perhaps not in the
same degree.

Actually, my stance on verifiability is in line with most development shops:
don't add features that were not specifically requested. That means evaluating
on an entity by entity basis whether a given key schema requires the need for
visual verification of some sort and if so what type. Not every project needs to
meet DOD standards. Not every key needs this level of sophistication.



Thomas
Author
17 May 2005 8:30 PM
--CELKO--
>> They (users) simply want a unique way of identifying a given entity.
An important aspect to understand is *how* they want
to locate or get information about an entity. Do they want to be able
to look a key value and know something about that entity or is it more
likely that they'll plug the value into a computer system and have the
computer system tell them far more information about the entity than
could possibly be learned by a few meager bits of data buried in a key?
<<

How many people do you know that like to write out a full GUID in a
query?  How many of them find it fart easier than a familar encoding,
like a phone number?  A little checklist of desirable properties for a
key is good way to do a design inspection.

Uniqueness:

The first property is that key be unique.  This is the most basic
property it can have since without uniqueness it cannot be a key by
definition.  Uniqueness is necessary, but not sufficient.

Uniqueness has a context.  An identifier can be unique in the local
database, in the enterprise across databases or unique universally.  We
would prefer the last of those three options.

We can often get universal uniqueness with industry standard codes such
the VIN numbers.  We can get enterprise uniqueness with things like
telephone extensions and email addresses.  An identifier that is unique
only in a single database is workable, but pretty much useless because
it will lack the other desired properties.

Stability:

The second property we want is stability or invariance.  The first kind
of stability is within the schema itself and this applies to both key
and non-key columns.  The same data element should have the same
representation wherever it appears in the schema.  It should not be
CHAR(n) in one place and INTEGER in another.  The same basic set of
constraints should apply to it.  That is, if we use the VIN number as
an identifier, then we can constraint it to be only for vehicles form
Ford Motors; we cannot change the format of the VIN in one table and
not in all others.

The next kind of stability is over time.  You do not want keys changing
frequently or in unpredictable ways.  Contrary to a popular myth, this
does not mean that keys cannot ever change.  As the scope of their
context grows, they should be able to change in a uniform manner for
everyone at the same time.

On 2005 January 01, the United States added one more digit to the UPC
bar codes used in the retail industry.  The reason was globalization
and erosion of American industrial domination.  The global bar code
standard will be the European Article Number (EAN) Code.  The American
Universal Product Code (UPC) turned 30 years old in 2004 and was never
so universal after all.

For the neo-Luddite programmers who think that changing a key is going
to kill you, let me quote John Metzger, chief information officer of
A&P.  The grocery chain had 630 stores in 2004 and the grocery industry
works 1% to 3% profit margins -- the smallest margins of any industry
that is not taking a loss.  A&P has handled the new barcode problem as
part of a modernization of its technology systems.  "It is important,"
Mr. Metzger said, "but it is not a shut-the-company-down kind of
issue."

Familiarity:

It helps if the users know something about the data.  This is not quite
the same as validation, but it is related.  Validation can tell you if
the code is properly formed via some process; familiarity can tell you
if it "feels right" because you know something about the context.
Thus, ICD codes for disease would confuse a patient, but not a medical
record clerk.

Validation:

Can you look at the data value and tell that it is wrong, without using
an external source?  For example, I know that '2004-02-30' is not a
valid date because no such day exists on the Common Era calendar.
Check digits and fixed format codes are one way of obtaining this.

Verifiability:

How do I verify a key?  This also comes in context and in levels of
trust.  When I cash a check at the supermarket, they are willing to
believe that the photo on the driver's license I present is really me,
no matter how ugly it is.  Or rather, they used to believe it was I;
Kroger's grocery store chain is putting an ink-less fingerprinting
system in place, just like many banks have done.

When I get a passport, I need a birth certificate and fingerprinting.
There is a little less trust here.  When I get a security clearance, I
also need to be investigated.  There is a lot less trust.

A key without a verification method has no data integrity and will lead
the accumulation of bad data.

Simplicity:

A key should be as simple as possible, but no simpler.  People, reports
and other systems will use the keys.  Long, complex keys are more
subject to error.  One man's simple is another man's complex.  For an
example of a horribly complex code that is in common international
usage, look up the IBAN (International Standard Bank Number).

>> at the time Codd made his assumptions computer systems were not
nearly sophisticated enough nor fast enough to keep up with users.
Thus, having meaningful keys improved overall performance because users
could, at times, get information from a key and avoid having to query
the system itself. However, as systems have gotten far more
sophisticated and far more powerful, this is less true. <<

That is just dead wrong. RDBMS was considered impractical in Codd's
day!  Read the history.  One of the complaints was that the longer
relational keys could not keep up with the pointer chains and indexing
used at the time.

Look at the internals of IDMS, IMS, Total, et al in the 1970's.  The
file systems and early DBMS systems used record numbering, pointer
chains, indexes, hash tables and all kinds of physical locators instead
of meaningful keys.  All the linkages for traversals are done by the
machine, not in the data model.  Increased word size, processor speeds
and paralellism are what made RDBMS possible.

>> Most keys cannot possible contain enough information to be
worthwhile. It is faster today to simply plug the value into a computer
system. <<

I'll remember that when I use my VIN to register my car, an ISBN to buy
a book and try to complete a retail transaction with a standard bar
code invovled.  Somehow, even without being able to compute my gas
mileage from my VIN, I am able to conduct business.
Author
17 May 2005 9:54 PM
Thomas Coleman
<snip>
> How many people do you know that like to write out a full GUID in a
> query?  How many of them find it fart easier than a familar encoding,
> like a phone number?  A little checklist of desirable properties for a
> key is good way to do a design inspection.

They don't know they're typing in a guid. They're just typing in a bunch
characters. I've seen part numbers and credit card transaction numbers that
rival guids in length.

Phone numbers present a good example. There was obviously a method behind the
madness of the original phone number designations (I understand the particulars
so no need to elaborate). But now that meaning is lost because of cell phones. I
lived for three years in the Bay Area in CA with a Southern CA area code on my
cell phone number. That "meaning" is now lost for the better IMO. What people
really want from the number is not cosmic meaning, but stability. Most people
would prefer a number that never changes over one that tells them where their
neighbors are located. In fact, I would prefer that my number provide *no*
indication of location but that is obviously is not going to happen soon.

<snip> (Answered in a previous post)



Thomas
Author
18 May 2005 3:29 AM
trillionaire
GUID is too long. Most of the time no problem generating unique numbers.
Morever, just spend some a few time to re-generate another new GUID if
another GUID already exists in the database, although these might/may not
happen.

asteroid collider



Show quote
"Thomas Coleman" wrote:

> <snip>
> > How many people do you know that like to write out a full GUID in a
> > query?  How many of them find it fart easier than a familar encoding,
> > like a phone number?  A little checklist of desirable properties for a
> > key is good way to do a design inspection.
>
> They don't know they're typing in a guid. They're just typing in a bunch
> characters. I've seen part numbers and credit card transaction numbers that
> rival guids in length.
>
> Phone numbers present a good example. There was obviously a method behind the
> madness of the original phone number designations (I understand the particulars
> so no need to elaborate). But now that meaning is lost because of cell phones. I
> lived for three years in the Bay Area in CA with a Southern CA area code on my
> cell phone number. That "meaning" is now lost for the better IMO. What people
> really want from the number is not cosmic meaning, but stability. Most people
> would prefer a number that never changes over one that tells them where their
> neighbors are located. In fact, I would prefer that my number provide *no*
> indication of location but that is obviously is not going to happen soon.
>
> <snip> (Answered in a previous post)
>
>
>
> Thomas
>
>
>
Author
17 May 2005 4:54 AM
trillionaire
I use SQL Server "decimal" data type of "precision 14".
It can reach max, "99,999,999,999,999" thats 99 trillion and
can fit into a (precision 15) "double" data type on a C++ client application.


Show quote
"Thomas Coleman" wrote:

> >>> Clearly Guids can be generated on machines other than the database
> > server on which they are stored. So what exactly do you mean by it
> > "lives" only in the hardware that stores the data? <<
> >
> > Go to the reality of the model.  Try to verify it as an identifier.
> > That is, given an entity, can anyone tell you the GUID by looking at
> > the entity?  I can read the VIN off of a car,
>
> If the Guid is stamped on the car you can do the same.
>
> >  UPC code from a
> > retail package, etc.
>
> If the Guid is stamped on the package or encoded as a bar code you can do the
> same.
>
> > but the GUID is not an attribute of the entity.
>
> It is if the database defines it to be. Just as the system defines a UPC or a
> VIN to be part of the entity's attributes.
>
> > The best I can do is validate its format; I used to be able to get some
> > hardware location data, but that is gone now.
>
> How the number was generated is irrelevant. You are assuming that a key must
> contain cosmic meaning about the entity. Remember, for that to work, the user
> must **understand** how to decypher the cosmic meaning from the key for it to be
> useful. You are further assuming that the Guid is designed to have cosmic
> meaning and it is not. It is simply a series of alpha numeric characters than
> uniquely identifies the entity.
>
> > By definition, a key is
> > a subset of attributes, so these exposed phsyical locators cannot be
> > keys in an RDBMS.
>
> A key *can* be a subset of attributes *or* simply a means to uniquely identify
> an entity (meaning it can be both). In the scenarios you mentioned above, the
> "key" was simply a bunch of gobbledegook that was stamped on something physical
> that a user could write down, go to the computer system and look up. Whether the
> user was able to discern any cosmic meaning from gobbledegook is secondary to
> its purpose.
>
>
> Thomas
>
>
>
Author
17 May 2005 5:57 AM
trillionaire
I was thinking... and have just found a flaw. If each number in these 99
trillion numbers takes up only 1 byte of storage, I will need about 233
harddisks each capacity of 400 GB and a new shelf.

trillionaire


Show quote
"trillionaire" wrote:

> I use SQL Server "decimal" data type of "precision 14".
> It can reach max, "99,999,999,999,999" thats 99 trillion and
> can fit into a (precision 15) "double" data type on a C++ client application.
>
>
> "Thomas Coleman" wrote:
>
> > >>> Clearly Guids can be generated on machines other than the database
> > > server on which they are stored. So what exactly do you mean by it
> > > "lives" only in the hardware that stores the data? <<
> > >
> > > Go to the reality of the model.  Try to verify it as an identifier.
> > > That is, given an entity, can anyone tell you the GUID by looking at
> > > the entity?  I can read the VIN off of a car,
> >
> > If the Guid is stamped on the car you can do the same.
> >
> > >  UPC code from a
> > > retail package, etc.
> >
> > If the Guid is stamped on the package or encoded as a bar code you can do the
> > same.
> >
> > > but the GUID is not an attribute of the entity.
> >
> > It is if the database defines it to be. Just as the system defines a UPC or a
> > VIN to be part of the entity's attributes.
> >
> > > The best I can do is validate its format; I used to be able to get some
> > > hardware location data, but that is gone now.
> >
> > How the number was generated is irrelevant. You are assuming that a key must
> > contain cosmic meaning about the entity. Remember, for that to work, the user
> > must **understand** how to decypher the cosmic meaning from the key for it to be
> > useful. You are further assuming that the Guid is designed to have cosmic
> > meaning and it is not. It is simply a series of alpha numeric characters than
> > uniquely identifies the entity.
> >
> > > By definition, a key is
> > > a subset of attributes, so these exposed phsyical locators cannot be
> > > keys in an RDBMS.
> >
> > A key *can* be a subset of attributes *or* simply a means to uniquely identify
> > an entity (meaning it can be both). In the scenarios you mentioned above, the
> > "key" was simply a bunch of gobbledegook that was stamped on something physical
> > that a user could write down, go to the computer system and look up. Whether the
> > user was able to discern any cosmic meaning from gobbledegook is secondary to
> > its purpose.
> >
> >
> > Thomas
> >
> >
> >
Author
17 May 2005 1:42 PM
Thomas Coleman
I'm not sure I understand your. At precision 14, a decimal column will take up 9
bytes per row. If you mean that by using a data type this sufficiently large you
can allow for more rows than can be stored with modern equipment, that's true.
That would be true of a BigInt too. The problem is not storage capacity. The
(only IMO) downside to identity columns is that you have to make a round trip to
generate them. Granted, that can be mitigated through stored procedures.

Capacity is not really Joe's argument against identities and Guids. His
arguments are related to general design issues.



Thomas
Author
17 May 2005 2:06 PM
trillionaire
The other way would be to add another column "Time" of data type datetime.
Primary key is Time Column + Identity Column. When Identity reached maximum,
reset it back to 0. Then there is no need to generate unique id.

zillionaire



Show quote
"Thomas Coleman" wrote:

> I'm not sure I understand your. At precision 14, a decimal column will take up 9
> bytes per row. If you mean that by using a data type this sufficiently large you
> can allow for more rows than can be stored with modern equipment, that's true.
> That would be true of a BigInt too. The problem is not storage capacity. The
> (only IMO) downside to identity columns is that you have to make a round trip to
> generate them. Granted, that can be mitigated through stored procedures.
>
> Capacity is not really Joe's argument against identities and Guids. His
> arguments are related to general design issues.
>
>
>
> Thomas
>
>
>
Author
17 May 2005 3:46 PM
Thomas Coleman
A Guid has a couple of advantages over an identity column. The primary advantage
is that it can be generated outside of the database system. That means you save
a round trip to the database in order to generate a new key value. Another
advantage of their somewhat universally unique nature is that it is easy to make
certain types of database changes. For example, suppose you decide to combine
two entities into a single table. With identity values, you will likely have to
resequence many of the key values which means resequencing their child values.
With guids, this is not necessary. You can combine the values into the table and
not worry about resequencing or have to worry about child relationships. You can
simply recreate the DRI to the new table and be done. Another advantage with
using them as primary keys is that you can designate them as the RowGuid for
replication purposes and save the automatic addition of a Guid by SQL Server's
replication setup process.

I'm not suggesting that Guids do not have their problems. The first is their
size. At 16 bytes, they're on the large side. A second problem is their
readability although as surrogate keys (ones hidden from the user) they are
fine. That said, if a key is going to be 18 to 20 characters or more anyway, I
don't see that you lose much by going with a guid. Another issue is the fact
that "pure" guids are of course somewhat random and unique. That makes it
difficult to index against them. However, there are some solutions to sacrafice
some uniqueness probability for better indexing (google Nillison's COMB
algorithm).


Thomas



Show quote
"trillionaire" <trilliona***@discussions.microsoft.com> wrote in message
news:A65428C1-989D-4925-BE2F-852112258B3E@microsoft.com...
> The other way would be to add another column "Time" of data type datetime.
> Primary key is Time Column + Identity Column. When Identity reached maximum,
> reset it back to 0. Then there is no need to generate unique id.
>
> zillionaire
>
>
>
> "Thomas Coleman" wrote:
>
>> I'm not sure I understand your. At precision 14, a decimal column will take
>> up 9
>> bytes per row. If you mean that by using a data type this sufficiently large
>> you
>> can allow for more rows than can be stored with modern equipment, that's
>> true.
>> That would be true of a BigInt too. The problem is not storage capacity. The
>> (only IMO) downside to identity columns is that you have to make a round trip
>> to
>> generate them. Granted, that can be mitigated through stored procedures.
>>
>> Capacity is not really Joe's argument against identities and Guids. His
>> arguments are related to general design issues.
>>
>>
>>
>> Thomas
>>
>>
>>

AddThis Social Bookmark Button