|
database
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Surrogate Keysstraight-forward example of when it would be a good idea (if ever) to use one? I ask because I think "surrogate key" is sometimes misused and/or is not the correct term for what is being communicated. For example some say that an IDENTITY property is a surrogate key - but others argue that IDENTITY doesn't meet the formal or correct definition of "surrogate key". BOL talks about surrogate keys in a few places but assumes that the reader knows what one is. Not what I'm looking for. Also, I googled this and got a bunch of hits: This description is representative: <<A surrogate key is frequently a sequential number (e.g. a Sybase "identity column") but doesn't have to be.>> That doesn't clarify ANYTHING. All it provides is an example which is heavily argued against by Celko and others. I'd sure appreciate a clear definition or description of what a surrogate key IS. Maybe nobody really knows and they just all use it casually to mean "IDENTITY" or AutoNumber (in Access). Thanks! Oh you're going to get a slew of responses on this one ;-)
Show quote "Frankie" <BeansAndTa***@TacoTime.net> wrote in message news:undxMagxFHA.4080@TK2MSFTNGP10.phx.gbl... > What is an accurate definition of a "surrogate key" and what would be a > straight-forward example of when it would be a good idea (if ever) to use > one? > > I ask because I think "surrogate key" is sometimes misused and/or is not > the correct term for what is being communicated. For example some say that > an IDENTITY property is a surrogate key - but others argue that IDENTITY > doesn't meet the formal or correct definition of "surrogate key". > > BOL talks about surrogate keys in a few places but assumes that the reader > knows what one is. Not what I'm looking for. > > Also, I googled this and got a bunch of hits: This description is > representative: > <<A surrogate key is frequently a sequential number (e.g. a Sybase > "identity column") but doesn't have to be.>> > > That doesn't clarify ANYTHING. All it provides is an example which is > heavily argued against by Celko and others. I'd sure appreciate a clear > definition or description of what a surrogate key IS. > > Maybe nobody really knows and they just all use it casually to mean > "IDENTITY" or AutoNumber (in Access). > > Thanks! > I'm just looking for a more-or-less academic definition; not asking about
the merits. >> I'm just looking for a more-or-less academic definition; not asking about The only authoritative answer is the quote by Date in Brian's post. Most of >> the merits. the other addenda are editorializations based on products and systems one sees around. -- Anith On Fri, 30 Sep 2005 14:33:08 -0700, Frankie wrote:
>What is an accurate definition of a "surrogate key" Hi Frankie,I don't think you'll find any formal, generally-accepted definition. The community is still in the holy war phase on this. My take is this: a surrogate key is a key that acts as a surrogate for another key. So the basic ingredients that you need before you can add a surrogate key are: 1. A table with a key (the business key, presumably), and 2. One or more other places where this key is used (foreign key columns in other tables). Now, you can introduce a surrogate key. You introduce an extra column to the table, BUT PRESERVE THE EXISTING KEY!! Make sure that both the surrogate key and the business key are constrained to be either PRIMARY KEY or UNIQUE. All other places where the business key is used, you can now replace it with the surrogate key. Some other considerations: the surrogate key should be generated by the database, and never exposed to the end user. It should only be used internally, as a surrogate for the business key. So for example: let's say that persons are uniquely identified in your business by (FirstName, LastName, DOB). Here's how the tables would look without surrogate key: CREATE TABLE Persons (FirstName varchar(20) NOT NULL, LastName varchar(30) NOT NULL, DOB datetime NOT NULL, -- other columns, PRIMARY KEY (FirstName, LastName, DOB), ) CREATE TABLE OtherTable (OtherKey int NOT NULL, FirstName varchar(20) NOT NULL, LastName varchar(30) NOT NULL, DOB datetime NOT NULL, -- other columns, PRIMARY KEY (OtherKey) FOREIGN KEY (FirstName, LastName, DOB) REFERENCES Persons (FirstName, LastName, DOB) ON UPDATE CASCADE, ) And here's the same design after introducing the surrogate key: CREATE TABLE Persons (FirstName varchar(20) NOT NULL, LastName varchar(30) NOT NULL, DOB datetime NOT NULL, PersonID int NOT NULL IDENTITY, -- other columns, PRIMARY KEY (PeopleID), UNIQUE (FirstName, LastName, DOB), ) CREATE TABLE OtherTable (OtherKey int NOT NULL, PersonID int NOT NULL, -- other columns, PRIMARY KEY (OtherKey) FOREIGN KEY (PersonID) REFERENCES Persons (PersonID), ) > and what would be a Should be considered oin a case by case basis. In the above table, joins>straight-forward example of when it would be a good idea (if ever) to use >one? on just the integer PersonID will be faster than joins on (FirstName, LastName, DOB). And a row in OtherTable takes less bytes, which means more bytes per page --> less read/writes while using this table --> more speed. On the other hand, the rows in Persons get bigger. And some queries might need an extra join (imagine a report on OtherTable that includes FirstName and LastName but no other information from the Persons table; can be done without join in the surrogate-less table, but requires a join in the version with surrogate key). >I ask because I think "surrogate key" is sometimes misused and/or is not the IDENTITY is one possible way to generate values for a surrogate key. The>correct term for what is being communicated. For example some say that an >IDENTITY property is a surrogate key - but others argue that IDENTITY >doesn't meet the formal or correct definition of "surrogate key". uniqueidentifier datatype and the NEWID() function is another way. And you can design and implement your own methods - but there shouldn't be any need for that, since the actual value of the surrogate key is not important. Remeber that the surrogate key values should not be exposed! >BOL talks about surrogate keys in a few places but assumes that the reader Joe Celko's ideas about surrogate keys are different from mine. The main>knows what one is. Not what I'm looking for. > >Also, I googled this and got a bunch of hits: This description is >representative: ><<A surrogate key is frequently a sequential number (e.g. a Sybase "identity >column") but doesn't have to be.>> > >That doesn't clarify ANYTHING. All it provides is an example which is >heavily argued against by Celko and others. I'd sure appreciate a clear >definition or description of what a surrogate key IS. difference is that I think that surrogate key values should be invisible to end users, whereas Joe says they should be invisible to all people that deal with the DB (including developers, DBA's, etc.). Of course, Joe's interpretation of surrogate keys means that they can only be used as a RDBMS internal optimizing trick, and only the techs in Redmont need to know about them. Best, Hugo -- (Remove _NO_ and _SPAM_ to get my e-mail address) The concept of surrogate keys has been around since the mid '70s. From
Introduction to Database Systems, Seventh Edition, by C.J. Date. << Surrogate keys are keys in the usual relational sense but have the following specific properties: -- They always involve exactly one attribute -- Their values serve solely as surrogates (hence the name) for the entities they stand for. In other words, such values serve merely to represent the fact that the corresponding entities exist--they carry no additional information whatsoever. -- When a new entity is inserted into the database, it is given a surrogate key value that has never been used before and will never be used again, even if the entity in question is subsequently deleted. Ideally, surrogate key values would be system-generated, but whether they are system- or user-generated has nothing to do with the basic idea of surrogate keys as such >> IDENTITY values are not surrogate key values in the strict sense, because a surrogate key value is unique within the entire database, not just within a single table; however, IDENTITY values are commonly used as surrogates because there is usually a declarative referential integrity constraint associated with a column that contains IDENTITY values which identifies the table associated with the IDENTITY value. The combination of the table and the IDENTITY value makes the meaning of an IDENTITY value unique within the database, even though that meaning is strictly internal--that is, it doesn't add any meaning to the entity it represents. I could give many examples of when you should use them. In fact, I would argue that you should always use them, because there are inherent problems with natural primary keys that place limitations on what you can do with a database. These stem from two properties. (1) there is inherent redundancy with natural primary keys. That's why the cascading update kludge exists. This complicates locking and reduces scalability. (2) the key value CAN change. This makes using disconnected datasets, message queues, and replication much more difficult to do, and requires additional code in every application to detect and manage key value changes. The main benefit of using a surrogate key is that it is immutable. Within a database, a surrogate key value cannot be changed once it's been assigned. Remember, a surrogate key value indicates only that a distinct entity exists. A change to that value would indicate that the entity for which it is a surrogate no longer exists. One final point. If a table contains any extrinsic information at all, then an additional unique constraint must be defined on the table. In other words, if any of the columns in a table contains values that originated outside the database, then there must be an additional candidate key defined on the table and it must be enforced with a unique constraint--even if that key includes every column other than the surrogate primary key. Show quote "Frankie" <BeansAndTa***@TacoTime.net> wrote in message news:undxMagxFHA.4080@TK2MSFTNGP10.phx.gbl... > What is an accurate definition of a "surrogate key" and what would be a > straight-forward example of when it would be a good idea (if ever) to use > one? > > I ask because I think "surrogate key" is sometimes misused and/or is not > the correct term for what is being communicated. For example some say that > an IDENTITY property is a surrogate key - but others argue that IDENTITY > doesn't meet the formal or correct definition of "surrogate key". > > BOL talks about surrogate keys in a few places but assumes that the reader > knows what one is. Not what I'm looking for. > > Also, I googled this and got a bunch of hits: This description is > representative: > <<A surrogate key is frequently a sequential number (e.g. a Sybase > "identity column") but doesn't have to be.>> > > That doesn't clarify ANYTHING. All it provides is an example which is > heavily argued against by Celko and others. I'd sure appreciate a clear > definition or description of what a surrogate key IS. > > Maybe nobody really knows and they just all use it casually to mean > "IDENTITY" or AutoNumber (in Access). > > Thanks! > > What is an accurate definition of a "surrogate key" and what would be a A surrogate key is a primary key that is not relevant to the data.> straight-forward example of when it would be a good idea (if ever) to use > one? > I ask because I think "surrogate key" is sometimes misused and/or is not the An IDENTITY column can in some cases be relevant to the data.> correct term for what is being communicated. For example some say that an > IDENTITY property is a surrogate key - but others argue that IDENTITY > doesn't meet the formal or correct definition of "surrogate key". CREATE TABLE Orders( OrderID int IDENTITY NOT NULL PRIMARY KEY, OrderDate datetime default getdate()) If one were to say "Hey Jorge, what is the status of Order 2351?" then the OrderID becomes relevant. Although it behaves like one and has the advantages and pitfalls of a surrogate key, OrderID is no longer a surrogate! Here is one test you can use, if the value is printed on an invoice or is referenced verbally it is no longer a surrogate. If the value never leaves the confines of a computer then it's a surrogate. A totally meaningless pointer. Dan I think you've just opened Pandora's sceptic tank. :)
As the name suggests the origin of 'surrogate' (in contrast to 'natural') keys is not the data they represent. They are declared either by an authority (e.g. ISBN/ISSN numbers) or an impartial machine (autoincrement, unique identifier) or through the use of a similar method. ML Not to get into the "whether or not it's a good idea to use surrogate keys
or not" argument, people do use them; that's just a fact of life. If you do use surrogate keys, keep in mind that: 1) The surrogate key should never be exposed to the end user; and it should not be able to be changed by the user. It is strictly for internal use by the system and should be generated and managed internally. 2) The surrogate key itself should not be the only column that makes a row unique; i.e., the Natural Key should make the row unique as well. Show quote "Frankie" <BeansAndTa***@TacoTime.net> wrote in message news:undxMagxFHA.4080@TK2MSFTNGP10.phx.gbl... > What is an accurate definition of a "surrogate key" and what would be a > straight-forward example of when it would be a good idea (if ever) to use > one? > > I ask because I think "surrogate key" is sometimes misused and/or is not > the correct term for what is being communicated. For example some say that > an IDENTITY property is a surrogate key - but others argue that IDENTITY > doesn't meet the formal or correct definition of "surrogate key". > > BOL talks about surrogate keys in a few places but assumes that the reader > knows what one is. Not what I'm looking for. > > Also, I googled this and got a bunch of hits: This description is > representative: > <<A surrogate key is frequently a sequential number (e.g. a Sybase > "identity column") but doesn't have to be.>> > > That doesn't clarify ANYTHING. All it provides is an example which is > heavily argued against by Celko and others. I'd sure appreciate a clear > definition or description of what a surrogate key IS. > > Maybe nobody really knows and they just all use it casually to mean > "IDENTITY" or AutoNumber (in Access). > > Thanks! > an appeal to authority, with a quote from Dr. Codd: "..Database users
may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them ...."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434. This means that a surrogate ought to act like an index; created or dropped by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does. Codd also wrote the following: "There are three difficulties in employing user-controlled keys as permanent surrogates for entities. (1) The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.). (2) Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same. (3) It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree). These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution - proposed in part [4] and more fully in [14] - is to introduce entity domains which contain system-assigned surrogates. Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them....." (Codd in ACM TODS, pp 409-410). References Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434 Great! Now how do we correctly pronounce your name? "Sellko" or "Chelko"?
Your name does come up in discussions and it's just annoying to hear it differently based on opinion. - Just curious. Show quote "--CELKO--" <jcelko***@earthlink.net> wrote in message news:1128181135.482306.281960@g44g2000cwa.googlegroups.com... > an appeal to authority, with a quote from Dr. Codd: "..Database users > may cause the system to generate or delete a surrogate, but they have > no control over its value, nor is its value ever displayed to them > ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending > the database relational model to capture more meaning. ACM > Transactions on Database Systems, 4(4). pp. 397-434. > > This means that a surrogate ought to act like an index; created or > dropped by the user, managed by the system and NEVER seen by a user. > That means never used in queries, DRI or anything else that a user > does. > > Codd also wrote the following: > > "There are three difficulties in employing user-controlled keys as > permanent surrogates for entities. > > (1) The actual values of user-controlled keys are determined by users > and must therefore be subject to change by them (e.g. if two companies > merge, the two employee databases might be combined with the result > that some or all of the serial numbers might be changed.). > > (2) Two relations may have user-controlled keys defined on distinct > domains (e.g. one uses social security, while the other uses employee > serial numbers) and yet the entities denoted are the same. > > (3) It may be necessary to carry information about an entity either > before it has been assigned a user-controlled key value or after it has > ceased to have one (e.g. and applicant for a job and a retiree). > > These difficulties have the important consequence that an equi-join on > common key values may not yield the same result as a join on common > entities. A solution - proposed in part [4] and more fully in [14] - > is to introduce entity domains which contain system-assigned > surrogates. Database users may cause the system to generate or delete > a surrogate, but they have no control over its value, nor is its value > ever displayed to them....." (Codd in ACM TODS, pp 409-410). > > References > > Codd, E. (1979), Extending the database relational model to capture > more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434 > >> Great! Now how do we correctly pronounce your name? "Sellko" or "Chelko"?<< SELL-ko. It was originally CHELL-ko, when it was spelled in Czech witha C-hychek (looks like an upside circumflex). --CELKO-- You should probably read the paper. Celko obviously didn't, because the
paper enumerates several problems with using natural keys (Codd uses the phrase "user-controlled" instead of "natural."), and indicates that the surrogate key should be used as the primary key--in other words, in DRI. I, like C. J. Date, take issue with the notion of hiding the surrogate key values from ALL database users. I believe that the values should be exposed to database developers and administrators, but hidden from database operators. While surrogate key values should never be seen on a data entry screen or a report, they are extremely useful in an application program because they cannot change. Therefore, I think that the visibility of surrogate key values is best handled through the use of security and/or views. Show quote "Frankie" <BeansAndTa***@TacoTime.net> wrote in message news:%23XIyGKqxFHA.4080@TK2MSFTNGP10.phx.gbl... > Great! Now how do we correctly pronounce your name? "Sellko" or "Chelko"? > > Your name does come up in discussions and it's just annoying to hear it > differently based on opinion. > > - Just curious. > > > > "--CELKO--" <jcelko***@earthlink.net> wrote in message > news:1128181135.482306.281960@g44g2000cwa.googlegroups.com... >> an appeal to authority, with a quote from Dr. Codd: "..Database users >> may cause the system to generate or delete a surrogate, but they have >> no control over its value, nor is its value ever displayed to them >> ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending >> the database relational model to capture more meaning. ACM >> Transactions on Database Systems, 4(4). pp. 397-434. >> >> This means that a surrogate ought to act like an index; created or >> dropped by the user, managed by the system and NEVER seen by a user. >> That means never used in queries, DRI or anything else that a user >> does. >> >> Codd also wrote the following: >> >> "There are three difficulties in employing user-controlled keys as >> permanent surrogates for entities. >> >> (1) The actual values of user-controlled keys are determined by users >> and must therefore be subject to change by them (e.g. if two companies >> merge, the two employee databases might be combined with the result >> that some or all of the serial numbers might be changed.). >> >> (2) Two relations may have user-controlled keys defined on distinct >> domains (e.g. one uses social security, while the other uses employee >> serial numbers) and yet the entities denoted are the same. >> >> (3) It may be necessary to carry information about an entity either >> before it has been assigned a user-controlled key value or after it has >> ceased to have one (e.g. and applicant for a job and a retiree). >> >> These difficulties have the important consequence that an equi-join on >> common key values may not yield the same result as a join on common >> entities. A solution - proposed in part [4] and more fully in [14] - >> is to introduce entity domains which contain system-assigned >> surrogates. Database users may cause the system to generate or delete >> a surrogate, but they have no control over its value, nor is its value >> ever displayed to them....." (Codd in ACM TODS, pp 409-410). >> >> References >> >> Codd, E. (1979), Extending the database relational model to capture >> more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434 >> > > >> (Codd uses the phrase "user-controlled" instead of "natural."), << No, a natural key is one that appears in nature; just what the namesays. It is not "user controlled" but is discovered in the reality of the data model. What he was worried about was users designing encoding schemes that were ambigous, limited range, etc. so they would have to be corrected over time. >> Therefore, I think that the visibility of surrogate key values is best handled through the use of security and/or views. << This is a job for the database engne. Ever work with IDMS, TOTAL orIMS in the pre-SQL days? Pointer chains between parent and child records would get broken and have to be restored. You did this with a utility program, but that is as close as you got to handling the locators. Celko, you're stuck on stupid!
> What he was worried about was users designing encoding The text of the paper doesn't support this assertion. Look at the example > schemes that were ambigous, limited range, etc. so they would have to > be corrected over time. given for one of the problems with user-controlled keys that Codd enumerated: << (2) Two relations may have user-controlled keys defined on distinct domains (e.g., one uses social security, while the other uses employee serial number) and yet the entities denoted are the same. >> A social security number is assigned by the government, yet Codd uses it as an example of a user-controlled key. Surrogate keys are NOT pointers any more than natural keys are. They have nothing to do with the physical location or organization of the data. Show quote "--CELKO--" <jcelko***@earthlink.net> wrote in message news:1128267497.665652.176350@g43g2000cwa.googlegroups.com... >>> (Codd uses the phrase "user-controlled" instead of "natural."), << > > No, a natural key is one that appears in nature; just what the name > says. It is not "user controlled" but is discovered in the reality of > the data model. What he was worried about was users designing encoding > schemes that were ambigous, limited range, etc. so they would have to > be corrected over time. > >>> Therefore, I think that the visibility of surrogate key values is best >>> handled through the use of security and/or views. << > > This is a job for the database engne. Ever work with IDMS, TOTAL or > IMS in the pre-SQL days? Pointer chains between parent and child > records would get broken and have to be restored. You did this with a > utility program, but that is as close as you got to handling the > locators. >
Other interesting topics
|
|||||||||||||||||||||||