Home All Groups Group Topic Archive Search About
Author
30 Sep 2005 9:33 PM
Frankie
What is an accurate definition of a "surrogate key" and what would be a
straight-forward example of when it would be a good idea (if ever) to use
one?

I ask because I think "surrogate key" is sometimes misused and/or is not the
correct term for what is being communicated. For example some say that an
IDENTITY property is a surrogate key - but others argue that IDENTITY
doesn't meet the formal or correct definition of "surrogate key".

BOL talks about surrogate keys in a few places but assumes that the reader
knows what one is. Not what I'm looking for.

Also, I googled this and got a bunch of hits: This description is
representative:
<<A surrogate key is frequently a sequential number (e.g. a Sybase "identity
column") but doesn't have to be.>>

That doesn't clarify ANYTHING. All it provides is an example which is
heavily argued against by Celko and others. I'd sure appreciate a clear
definition or description of what a surrogate key IS.

Maybe nobody really knows and they just all use it casually to mean
"IDENTITY" or AutoNumber (in Access).

Thanks!

Author
30 Sep 2005 10:05 PM
Jerry Spivey
Oh you're going to get a slew of responses on this one ;-)

Show quote
"Frankie" <BeansAndTa***@TacoTime.net> wrote in message
news:undxMagxFHA.4080@TK2MSFTNGP10.phx.gbl...
> What is an accurate definition of a "surrogate key" and what would be a
> straight-forward example of when it would be a good idea (if ever) to use
> one?
>
> I ask because I think "surrogate key" is sometimes misused and/or is not
> the correct term for what is being communicated. For example some say that
> an IDENTITY property is a surrogate key - but others argue that IDENTITY
> doesn't meet the formal or correct definition of "surrogate key".
>
> BOL talks about surrogate keys in a few places but assumes that the reader
> knows what one is. Not what I'm looking for.
>
> Also, I googled this and got a bunch of hits: This description is
> representative:
> <<A surrogate key is frequently a sequential number (e.g. a Sybase
> "identity column") but doesn't have to be.>>
>
> That doesn't clarify ANYTHING. All it provides is an example which is
> heavily argued against by Celko and others. I'd sure appreciate a clear
> definition or description of what a surrogate key IS.
>
> Maybe nobody really knows and they just all use it casually to mean
> "IDENTITY" or AutoNumber (in Access).
>
> Thanks!
>
Author
30 Sep 2005 10:40 PM
Frankie
I'm just looking for a more-or-less academic definition; not asking about
the merits.
Author
1 Oct 2005 2:52 AM
Anith Sen
>> I'm just looking for a more-or-less academic definition; not asking about
>> the merits.

The only authoritative answer is the quote by Date in Brian's post. Most of
the other addenda are editorializations based on products and systems one
sees around.

--
Anith
Author
30 Sep 2005 10:41 PM
Hugo Kornelis
On Fri, 30 Sep 2005 14:33:08 -0700, Frankie wrote:

>What is an accurate definition of a "surrogate key"

Hi Frankie,

I don't think you'll find any formal, generally-accepted definition. The
community is still in the holy war phase on this.

My take is this: a surrogate key is a key that acts as a surrogate for
another key. So the basic ingredients that you need before you can add a
surrogate key are:
1. A table with a key (the business key, presumably), and
2. One or more other places where this key is used (foreign key columns
in other tables).

Now, you can introduce a surrogate key. You introduce an extra column to
the table, BUT PRESERVE THE EXISTING KEY!! Make sure that both the
surrogate key and the business key are constrained to be either PRIMARY
KEY or UNIQUE. All other places where the business key is used, you can
now replace it with the surrogate key.
Some other considerations: the surrogate key should be generated by the
database, and never exposed to the end user. It should only be used
internally, as a surrogate for the business key.

So for example: let's say that persons are uniquely identified in your
business by (FirstName, LastName, DOB). Here's how the tables would look
without surrogate key:

CREATE TABLE Persons
          (FirstName varchar(20) NOT NULL,
           LastName varchar(30) NOT NULL,
           DOB datetime NOT NULL,
        -- other columns,
           PRIMARY KEY (FirstName, LastName, DOB),
          )
CREATE TABLE OtherTable
          (OtherKey int NOT NULL,
           FirstName varchar(20) NOT NULL,
           LastName varchar(30) NOT NULL,
           DOB datetime NOT NULL,
        -- other columns,
           PRIMARY KEY (OtherKey)
           FOREIGN KEY (FirstName, LastName, DOB)
             REFERENCES Persons (FirstName, LastName, DOB)
             ON UPDATE CASCADE,
          )

And here's the same design after introducing the surrogate key:

CREATE TABLE Persons
          (FirstName varchar(20) NOT NULL,
           LastName varchar(30) NOT NULL,
           DOB datetime NOT NULL,
           PersonID int NOT NULL IDENTITY,
        -- other columns,
           PRIMARY KEY (PeopleID),
           UNIQUE (FirstName, LastName, DOB),
          )
CREATE TABLE OtherTable
          (OtherKey int NOT NULL,
           PersonID int NOT NULL,
        -- other columns,
           PRIMARY KEY (OtherKey)
           FOREIGN KEY (PersonID) REFERENCES Persons (PersonID),
          )

> and what would be a
>straight-forward example of when it would be a good idea (if ever) to use
>one?

Should be considered oin a case by case basis. In the above table, joins
on just the integer PersonID will be faster than joins on (FirstName,
LastName, DOB). And a row in OtherTable takes less bytes, which means
more bytes per page --> less read/writes while using this table --> more
speed.
On the other hand, the rows in Persons get bigger. And some queries
might need an extra join (imagine a report on OtherTable that includes
FirstName and LastName but no other information from the Persons table;
can be done without join in the surrogate-less table, but requires a
join in the version with surrogate key).

>I ask because I think "surrogate key" is sometimes misused and/or is not the
>correct term for what is being communicated. For example some say that an
>IDENTITY property is a surrogate key - but others argue that IDENTITY
>doesn't meet the formal or correct definition of "surrogate key".

IDENTITY is one possible way to generate values for a surrogate key. The
uniqueidentifier datatype and the NEWID() function is another way. And
you can design and implement your own methods - but there shouldn't be
any need for that, since the actual value of the surrogate key is not
important. Remeber that the surrogate key values should not be exposed!


>BOL talks about surrogate keys in a few places but assumes that the reader
>knows what one is. Not what I'm looking for.
>
>Also, I googled this and got a bunch of hits: This description is
>representative:
><<A surrogate key is frequently a sequential number (e.g. a Sybase "identity
>column") but doesn't have to be.>>
>
>That doesn't clarify ANYTHING. All it provides is an example which is
>heavily argued against by Celko and others. I'd sure appreciate a clear
>definition or description of what a surrogate key IS.

Joe Celko's ideas about surrogate keys are different from mine. The main
difference is that I think that surrogate key values should be invisible
to end users, whereas Joe says they should be invisible to all people
that deal with the DB (including developers, DBA's, etc.). Of course,
Joe's interpretation of surrogate keys means that they can only be used
as a RDBMS internal optimizing trick, and only the techs in Redmont need
to know about them.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
1 Oct 2005 2:58 AM
Frankie
Thanks or all the reasonable responses (Hugo et al) ... very helpful.

-F
Author
30 Sep 2005 10:47 PM
Brian Selzer
The concept of surrogate keys has been around since the mid '70s.  From
Introduction to Database Systems, Seventh Edition, by C.J. Date.
<<
Surrogate keys are keys in the usual relational sense but have the following
specific properties:

-- They always involve exactly one attribute

-- Their values serve solely as surrogates (hence the name) for the entities
they stand for.  In other words, such values serve merely to represent the
fact that the corresponding entities exist--they carry no additional
information whatsoever.

-- When a new entity is inserted into the database, it is given a surrogate
key value that has never been used before and will never be used again, even
if the entity in question is subsequently deleted.

Ideally, surrogate key values would be system-generated, but whether they
are system- or user-generated has nothing to do with the basic idea of
surrogate keys as such
>>

IDENTITY values are not surrogate key values in the strict sense, because a
surrogate key value is unique within the entire database, not just within a
single table; however, IDENTITY values are commonly used as surrogates
because there is usually a declarative referential integrity constraint
associated with a column that contains IDENTITY values which identifies the
table associated with the IDENTITY value.  The combination of the table and
the IDENTITY value makes the meaning of an IDENTITY value unique within the
database, even though that meaning is strictly internal--that is, it doesn't
add any meaning to the entity it represents.

I could give many examples of when you should use them.  In fact, I would
argue that you should always use them, because there are inherent problems
with natural primary keys that place limitations on what you can do with a
database.  These stem from two properties.  (1) there is inherent redundancy
with natural primary keys.  That's why the cascading update kludge exists.
This complicates locking and reduces scalability.  (2) the key value CAN
change.  This makes using disconnected datasets, message queues, and
replication much more difficult to do, and requires additional code in every
application to detect and manage key value changes.

The main benefit of using a surrogate key is that it is immutable.  Within a
database, a surrogate key value cannot be changed once it's been assigned.
Remember, a surrogate key value indicates only that a distinct entity
exists.  A change to that value would indicate that the entity for which it
is a surrogate no longer exists.

One final point.  If a table contains any extrinsic information at all, then
an additional unique constraint must be defined on the table.  In other
words, if any of the columns in a table contains values that originated
outside the database, then there must be an additional candidate key defined
on the table and it must be enforced with a unique constraint--even if that
key includes every column other than the surrogate primary key.

Show quote
"Frankie" <BeansAndTa***@TacoTime.net> wrote in message
news:undxMagxFHA.4080@TK2MSFTNGP10.phx.gbl...
> What is an accurate definition of a "surrogate key" and what would be a
> straight-forward example of when it would be a good idea (if ever) to use
> one?
>
> I ask because I think "surrogate key" is sometimes misused and/or is not
> the correct term for what is being communicated. For example some say that
> an IDENTITY property is a surrogate key - but others argue that IDENTITY
> doesn't meet the formal or correct definition of "surrogate key".
>
> BOL talks about surrogate keys in a few places but assumes that the reader
> knows what one is. Not what I'm looking for.
>
> Also, I googled this and got a bunch of hits: This description is
> representative:
> <<A surrogate key is frequently a sequential number (e.g. a Sybase
> "identity column") but doesn't have to be.>>
>
> That doesn't clarify ANYTHING. All it provides is an example which is
> heavily argued against by Celko and others. I'd sure appreciate a clear
> definition or description of what a surrogate key IS.
>
> Maybe nobody really knows and they just all use it casually to mean
> "IDENTITY" or AutoNumber (in Access).
>
> Thanks!
>
Author
30 Sep 2005 11:26 PM
DanR
> What is an accurate definition of a "surrogate key" and what would be a
> straight-forward example of when it would be a good idea (if ever) to use
> one?

A surrogate key is a primary key that is not relevant to the data.

> I ask because I think "surrogate key" is sometimes misused and/or is not the
> correct term for what is being communicated. For example some say that an
> IDENTITY property is a surrogate key - but others argue that IDENTITY
> doesn't meet the formal or correct definition of "surrogate key".

An IDENTITY column can in some cases be relevant to the data.
CREATE TABLE Orders(
  OrderID int IDENTITY NOT NULL PRIMARY KEY,
  OrderDate datetime default getdate())
If one were to say "Hey Jorge, what is the status of Order 2351?" then the
OrderID becomes relevant. Although it behaves like one and has the advantages
and pitfalls of a surrogate key, OrderID is no longer a surrogate! Here is
one test you can use, if the value is printed on an invoice or is referenced
verbally it is no longer a surrogate.

If the value never leaves the confines of a computer then it's a surrogate.
A totally meaningless pointer.

Dan
Author
1 Oct 2005 12:33 AM
ML
I think you've just opened Pandora's sceptic tank. :)

As the name suggests the origin of 'surrogate' (in contrast to 'natural')
keys is not the data they represent. They are declared either by an authority
(e.g. ISBN/ISSN numbers) or an impartial machine (autoincrement, unique
identifier) or through the use of a similar method.


ML
Author
1 Oct 2005 12:44 AM
Michael C#
Not to get into the "whether or not it's a good idea to use surrogate keys
or not" argument, people do use them; that's just a fact of life.  If you do
use surrogate keys, keep in mind that:

1)  The surrogate key should never be exposed to the end user; and it should
not be able to be changed by the user.  It is strictly for internal use by
the system and should be generated and managed internally.

2)  The surrogate key itself should not be the only column that makes a row
unique; i.e., the Natural Key should make the row unique as well.



Show quote
"Frankie" <BeansAndTa***@TacoTime.net> wrote in message
news:undxMagxFHA.4080@TK2MSFTNGP10.phx.gbl...
> What is an accurate definition of a "surrogate key" and what would be a
> straight-forward example of when it would be a good idea (if ever) to use
> one?
>
> I ask because I think "surrogate key" is sometimes misused and/or is not
> the correct term for what is being communicated. For example some say that
> an IDENTITY property is a surrogate key - but others argue that IDENTITY
> doesn't meet the formal or correct definition of "surrogate key".
>
> BOL talks about surrogate keys in a few places but assumes that the reader
> knows what one is. Not what I'm looking for.
>
> Also, I googled this and got a bunch of hits: This description is
> representative:
> <<A surrogate key is frequently a sequential number (e.g. a Sybase
> "identity column") but doesn't have to be.>>
>
> That doesn't clarify ANYTHING. All it provides is an example which is
> heavily argued against by Celko and others. I'd sure appreciate a clear
> definition or description of what a surrogate key IS.
>
> Maybe nobody really knows and they just all use it casually to mean
> "IDENTITY" or AutoNumber (in Access).
>
> Thanks!
>
Author
1 Oct 2005 3:38 PM
--CELKO--
an appeal to authority, with a quote from Dr. Codd: "..Database users
may cause the system to generate or delete a surrogate, but they have
no control over its value, nor is its value ever displayed to them
...."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
the database relational model to capture more meaning.  ACM
Transactions on Database Systems, 4(4).  pp. 397-434.

This means that a surrogate ought to act like an index; created or
dropped by the user, managed by the system and NEVER seen by a user.
That means never used in queries, DRI or anything else that a user
does.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.

(1)  The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two companies
merge, the two employee databases might be combined with the result
that some or all of the serial numbers might be changed.).

(2)  Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee
serial numbers) and yet the entities denoted are the same.

(3)  It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it has
ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on
common key values may not yield the same result as a join on common
entities.  A solution - proposed in part [4] and more fully in [14] -
is to introduce entity domains which contain system-assigned
surrogates.  Database users may cause the system to generate or delete
a surrogate, but they have no control over its value, nor is its value
ever displayed to them....." (Codd in ACM TODS, pp 409-410).

References

Codd, E. (1979), Extending the database relational model to capture
more meaning.  ACM Transactions on Database Systems, 4(4).  pp. 397-434
Author
1 Oct 2005 4:09 PM
Frankie
Great! Now how do we correctly pronounce your name? "Sellko" or "Chelko"?

Your name does come up in discussions and it's just annoying to hear it
differently based on opinion.

- Just curious.



Show quote
"--CELKO--" <jcelko***@earthlink.net> wrote in message
news:1128181135.482306.281960@g44g2000cwa.googlegroups.com...
> an appeal to authority, with a quote from Dr. Codd: "..Database users
> may cause the system to generate or delete a surrogate, but they have
> no control over its value, nor is its value ever displayed to them
> ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
> the database relational model to capture more meaning.  ACM
> Transactions on Database Systems, 4(4).  pp. 397-434.
>
> This means that a surrogate ought to act like an index; created or
> dropped by the user, managed by the system and NEVER seen by a user.
> That means never used in queries, DRI or anything else that a user
> does.
>
> Codd also wrote the following:
>
> "There are three difficulties in employing user-controlled keys as
> permanent surrogates for entities.
>
> (1)  The actual values of user-controlled keys are determined by users
> and must therefore be subject to change by them (e.g. if two companies
> merge, the two employee databases might be combined with the result
> that some or all of the serial numbers might be changed.).
>
> (2)  Two relations may have user-controlled keys defined on distinct
> domains (e.g. one uses social security, while the other uses employee
> serial numbers) and yet the entities denoted are the same.
>
> (3)  It may be necessary to carry information about an entity either
> before it has been assigned a user-controlled key value or after it has
> ceased to have one (e.g. and applicant for a job and a retiree).
>
> These difficulties have the important consequence that an equi-join on
> common key values may not yield the same result as a join on common
> entities.  A solution - proposed in part [4] and more fully in [14] -
> is to introduce entity domains which contain system-assigned
> surrogates.  Database users may cause the system to generate or delete
> a surrogate, but they have no control over its value, nor is its value
> ever displayed to them....." (Codd in ACM TODS, pp 409-410).
>
> References
>
> Codd, E. (1979), Extending the database relational model to capture
> more meaning.  ACM Transactions on Database Systems, 4(4).  pp. 397-434
>
Author
1 Oct 2005 7:00 PM
--CELKO--
>> Great! Now how do we correctly pronounce your name? "Sellko" or "Chelko"?<<

SELL-ko.  It was originally CHELL-ko, when it was spelled in Czech with
a C-hychek (looks like an upside circumflex).

--CELKO--
Author
2 Oct 2005 6:01 AM
Brian Selzer
You should probably read the paper.  Celko obviously didn't, because the
paper enumerates several problems with using natural keys (Codd uses the
phrase "user-controlled" instead of "natural."), and indicates that the
surrogate key should be used as the primary key--in other words, in DRI.  I,
like C. J. Date, take issue with the notion of hiding the surrogate key
values from ALL database users.  I believe that the values should be exposed
to database developers and administrators, but hidden from database
operators.  While surrogate key values should never be seen on a data entry
screen or a report, they are extremely useful in an application program
because they cannot change.  Therefore, I think that the visibility of
surrogate key values is best handled through the use of security and/or
views.

Show quote
"Frankie" <BeansAndTa***@TacoTime.net> wrote in message
news:%23XIyGKqxFHA.4080@TK2MSFTNGP10.phx.gbl...
> Great! Now how do we correctly pronounce your name? "Sellko" or "Chelko"?
>
> Your name does come up in discussions and it's just annoying to hear it
> differently based on opinion.
>
> - Just curious.
>
>
>
> "--CELKO--" <jcelko***@earthlink.net> wrote in message
> news:1128181135.482306.281960@g44g2000cwa.googlegroups.com...
>> an appeal to authority, with a quote from Dr. Codd: "..Database users
>> may cause the system to generate or delete a surrogate, but they have
>> no control over its value, nor is its value ever displayed to them
>> ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
>> the database relational model to capture more meaning.  ACM
>> Transactions on Database Systems, 4(4).  pp. 397-434.
>>
>> This means that a surrogate ought to act like an index; created or
>> dropped by the user, managed by the system and NEVER seen by a user.
>> That means never used in queries, DRI or anything else that a user
>> does.
>>
>> Codd also wrote the following:
>>
>> "There are three difficulties in employing user-controlled keys as
>> permanent surrogates for entities.
>>
>> (1)  The actual values of user-controlled keys are determined by users
>> and must therefore be subject to change by them (e.g. if two companies
>> merge, the two employee databases might be combined with the result
>> that some or all of the serial numbers might be changed.).
>>
>> (2)  Two relations may have user-controlled keys defined on distinct
>> domains (e.g. one uses social security, while the other uses employee
>> serial numbers) and yet the entities denoted are the same.
>>
>> (3)  It may be necessary to carry information about an entity either
>> before it has been assigned a user-controlled key value or after it has
>> ceased to have one (e.g. and applicant for a job and a retiree).
>>
>> These difficulties have the important consequence that an equi-join on
>> common key values may not yield the same result as a join on common
>> entities.  A solution - proposed in part [4] and more fully in [14] -
>> is to introduce entity domains which contain system-assigned
>> surrogates.  Database users may cause the system to generate or delete
>> a surrogate, but they have no control over its value, nor is its value
>> ever displayed to them....." (Codd in ACM TODS, pp 409-410).
>>
>> References
>>
>> Codd, E. (1979), Extending the database relational model to capture
>> more meaning.  ACM Transactions on Database Systems, 4(4).  pp. 397-434
>>
>
>
Author
2 Oct 2005 3:38 PM
--CELKO--
>> (Codd uses the phrase "user-controlled" instead of "natural."), <<

No, a natural key is one that appears in nature; just what the name
says.  It is not "user controlled" but is discovered in the reality of
the data model.  What he was worried about was users designing encoding
schemes that were ambigous, limited range, etc. so they would have to
be corrected over time.

>> Therefore, I think that the visibility of surrogate key values is best handled through the use of security and/or views. <<

This is a job for the database engne.  Ever work with IDMS, TOTAL or
IMS in the pre-SQL days?  Pointer chains between parent and child
records would get broken and have to be restored.  You did this with a
utility program, but that is as close as you got to handling the
locators.
Author
3 Oct 2005 1:50 PM
Brian Selzer
Celko, you're stuck on stupid!

> What he was worried about was users designing encoding
> schemes that were ambigous, limited range, etc. so they would have to
> be corrected over time.

The text of the paper doesn't support this assertion.  Look at the example
given for one of the problems with user-controlled keys that Codd
enumerated:

<<
(2) Two relations may have user-controlled keys defined on distinct domains
(e.g., one uses social security, while the other uses employee serial
number) and yet the entities denoted are the same.
>>

A social security number is assigned by the government, yet Codd uses it as
an example of a user-controlled key.

Surrogate keys are NOT pointers any more than natural keys are.  They have
nothing to do with the physical location or organization of the data.


Show quote
"--CELKO--" <jcelko***@earthlink.net> wrote in message
news:1128267497.665652.176350@g43g2000cwa.googlegroups.com...
>>> (Codd uses the phrase "user-controlled" instead of "natural."), <<
>
> No, a natural key is one that appears in nature; just what the name
> says.  It is not "user controlled" but is discovered in the reality of
> the data model.  What he was worried about was users designing encoding
> schemes that were ambigous, limited range, etc. so they would have to
> be corrected over time.
>
>>> Therefore, I think that the visibility of surrogate key values is best
>>> handled through the use of security and/or views. <<
>
> This is a job for the database engne.  Ever work with IDMS, TOTAL or
> IMS in the pre-SQL days?  Pointer chains between parent and child
> records would get broken and have to be restored.  You did this with a
> utility program, but that is as close as you got to handling the
> locators.
>

AddThis Social Bookmark Button