|
database
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
3 Questions on db Theory/DesignI'm what some would call a "lower tier" programmer in terms of education (only have an AAS). As is typical of one who has learnt more on the job than in the classroom, there are certain aspects of programming in which my *depth* of knowledge and insight would surprise you. However, in equal measure, there are certain aspects of programming in which my *lack* of knowledge would likewise surprise you as well. I'm about to try to use you to rectify some of my lacking. :-) 1.) To qualify for 1NF status a table must adhere to a couple ofstandards, one of which is that all columns must be the same datatype. How does one reconcile this in SQL Server with date/time fields? For instance, lets say I have 2 tables, the first in which all fields are varchar (including the PK "UID"). Then I have a second table which has the same varchar "UID" field as the PK, but all the remaining fields are smalldatetime? That would seem to violate 1NF? Would the work around to this be to CAST all the smalldatetimes into varchar? 2.) In a related vein... Why is a "relational" database good? Specificly, using the above example, what benefit am I garnering by splitting one large table into 2 separate tables? Is a relational db's primary benefit that it helps prevent data errors or that it makes data accessing more efficient? Or is it both? 3.) Unrelated but, frequently I see people requesting "DDL" in this ng... WTH is DDL?! :-) Thanks much! Roy > 1.) To qualify for 1NF status a table must adhere to a couple of I think you're misunderstanding 1NF. There is no requirement for all> standards, one of which is that all columns must be the same datatype. > How does one reconcile this in SQL Server with date/time fields? For > instance, lets say I have 2 tables, the first in which all fields are > varchar (including the PK "UID"). Then I have a second table which has > the same varchar "UID" field as the PK, but all the remaining fields > are smalldatetime? That would seem to violate 1NF? Would the work > around to this be to CAST all the smalldatetimes into varchar? columns in a table to be of the same datatype. The requirement is that every row in the table must have the same columns, of the same datatype, as every other row in the table. The following table is perfectly acceptable: CREATE TABLE X ( Col1 INT NOT NULL PRIMARY KEY, Col2 VARCHAR(50), Col3 DATETIME, Col4 VARCHAR(500), Col5 TINYINT ) Keep in mind that a datatype is more than just the SQL Server datatype. It's also the logical type of the attribute. So, for instance, the following would not be a good idea: CREATE TABLE X ( XType VARCHAR(50), XThing VARCHAR(100) ) INSERT X VALUES ('Date', '20050629') INSERT X VALUES ('Description', 'This is bad') Now XThing has two datatypes -- It can be a date in some cases, and a description in others. > 2.) In a related vein... Why is a "relational" database good? It's all about data integrity. The most efficient database in the> Specificly, using the above example, what benefit am I garnering by > splitting one large table into 2 separate tables? Is a relational db's > primary benefit that it helps prevent data errors or that it makes data > accessing more efficient? Or is it both? universe would be useless if you couldn't trust the data therein. > 3.) Unrelated but, frequently I see people requesting "DDL" in this Data Definition Language -- CREATE TABLE statements, etc.> ng... WTH is DDL?! :-) 1) 1NF: each field in a record can contain only one value. A comma
separated list of values stored in a single field violates this. 2) There are many benefits to using a relational database. Relational databases are flexible, in that you can create queries against the database that you didn't know you'd need prior to implementation. Relational databases reduce redundancy--in other words, in a fully normalized relational database, each atomic value is stored in only one place, and can always be retrieved if you know (1) the table name, (2) the primary key value, and (3) the column name. Relational databases are scalable, and resilient to change. You can add tables while a system is on-line. In some systems, you can even add columns to existing tables while a system is on-line. Normalization is used to eliminate redundancy, and by so doing eliminate update problems. The benefit you get from splitting a table depends on how you do it. If you're storing department-specific information in the Employee table, such that the same information is stored for each employee in the same department, then you can create a new Department table that contains the department specific information and reference the department key in the Employee table, thereby eliminating the redundancy associated with having the department-specific information in each record. Updates are more reliable. Since there is only one record for each department, instead of having to update every employee record that is in that department with the new information, you can simply update the single department record. There are six normal forms, and this is not the place to go over them. I recommend buying a book. "An Introduction to Database Systems" by C. J. Date is really good. It covers all the gory details about normalization, including 4NF and 5NF, and the update problems eliminated by normalizing. It talks about relational algebra--the theory behind joins, unions, projection, restriction, etc. Be warned, however: it is not a baby book. <roy.ander***@gmail.com> wrote in message Show quote news:1120069866.083221.167760@f14g2000cwb.googlegroups.com... > Hey all, > I'm what some would call a "lower tier" programmer in terms of > education (only have an AAS). As is typical of one who has learnt more > on the job than in the classroom, there are certain aspects of > programming in which my *depth* of knowledge and insight would surprise > you. However, in equal measure, there are certain aspects of > programming in which my *lack* of knowledge would likewise surprise you > as well. I'm about to try to use you to rectify some of my lacking. > :-) > > 1.) To qualify for 1NF status a table must adhere to a couple of > standards, one of which is that all columns must be the same datatype. > How does one reconcile this in SQL Server with date/time fields? For > instance, lets say I have 2 tables, the first in which all fields are > varchar (including the PK "UID"). Then I have a second table which has > the same varchar "UID" field as the PK, but all the remaining fields > are smalldatetime? That would seem to violate 1NF? Would the work > around to this be to CAST all the smalldatetimes into varchar? > > 2.) In a related vein... Why is a "relational" database good? > Specificly, using the above example, what benefit am I garnering by > splitting one large table into 2 separate tables? Is a relational db's > primary benefit that it helps prevent data errors or that it makes data > accessing more efficient? Or is it both? > > 3.) Unrelated but, frequently I see people requesting "DDL" in this > ng... WTH is DDL?! :-) > > Thanks much! > Roy > "Normalization is used to eliminate redundancy, and by so doing
eliminate update problems. The benefit you get from splitting a table depends on how you do it. If you're storing department-specific information in the Employee table, such that the same information is stored for each employee" <snip> Thanks for the informative posts you guys. A follow-up question: So there is truly no inherent benefit (in and of itself) to splitting up one large table in sql server into 2 or 3 smaller ones? For instance, lets say that we have one table with a "UID" PK and 99 other columns. Assuming it's a relation, there is no benefit to dividing the table into 2 tables with a matching "UID" PK and 50 columns a piece? In other words, the overhead processing is the same either way and in fact, might be more with 2 tables given that the "UID" is repeated in both? Hypothetical: Let's say I'm constructing a database intended to track what a group of dogs do every day. Each record would essentially be an indication of what the specific animal is doing at any given time. Each record would look something like this: "Dog_Owner_ID" "Dog_Name_ID" "Action_Begin_DateTime" "Action_End_DateTime" "Action_ID" "Action_Begin_Location" "Action_End_Location" "Encountered_Other_Dogs(Y/N)" "Number_Of_Dogs_Encountered" "Number_Of_Humans_Encountered" "Number_Of_Other_Creatures_Encountered" "Other_Creature_ID_1" "Other_Creature_ID_2" "Other_Creature_ID_3" "DateTime_Record_Appended" "DateTime_Record_Edited" Obviously this is a flaky example, but it's pretty indicative of the bizaare, heaped-together databases I've encountered in my time. Assuming that this is the dynamic table of the database that gets most frequently updated, eventually it would grow to an obscene number of records. Does splitting it up into multiple tables (say, for instance, by "Action_ID" so that there'd be a table specific to each action do more harm than good? Thanks again! *** Sent via Developersdex http://www.developersdex.com *** "Roy Anderson" <roy.andersonSPAMHATER@gmail.com> wrote in message That table might not be very well designed, but from the standpoint ofnews:%23N9uX8PfFHA.1304@TK2MSFTNGP10.phx.gbl... > > Hypothetical: > Let's say I'm constructing a database intended to track what a group of > dogs do every day. Each record would essentially be an indication of > what the specific animal is doing at any given time. Each record would > look something like this: relational databases and normalization there's nothing wrong with it that I can see -- other than, perhaps, the lack of an (obvious) primary key. Owner/Dog/Action/Date, I guess? There is redundancy in your example. What do you do if a dog encountered
more than three other creatures? How would you query which other creatures a dog encountered? I cringe every time I see a table like this. I worked on a system that tracked inventory for several retail outlets and a main warehouse. One table was designed like this: ITEM, WHQTY, ST1QTY, ST2QTY, ST3QTY, WHMIN, ST1MIN, ST2MIN, ST3MIN, WHMAX, ST1MAX, ST2MAX, ST3MAX, where QTY refered to the quantity on hand, MIN referred to the reorder threshold, and MAX referred to the max stock level. They added a fourth store.... I added a location table with 5 rows, ('WH'), ('ST1'), ('ST2'), ('ST3'), ('ST4'), and changed the inventory table thus: ITEM, LOCATION, QTY, MIN, MAX Having a separate table for each ActionID would only make queries more difficult. It's similar to the example above, where I combined the are four repeating groups of columns into a single group. By the same token, you wouldn't want to create a repeating group by separating the information into several tables. What happens when you add a new ActionID? Do you add an additional table? You shouldn't separate information just for the sake of separating it. There should be a reason, either to reduce redundancy, to overcome a physical limitation (only 8000 chars per row), to make it easier to enforce security, etc. There may be performance considerations, for example, if only one or two columns are updated with any frequency, you may want to separate them into their own table to reduce lock contention on the rest of the row. Show quote "Roy Anderson" <roy.andersonSPAMHATER@gmail.com> wrote in message news:#N9uX8PfFHA.1304@TK2MSFTNGP10.phx.gbl... > "Normalization is used to eliminate redundancy, and by so doing > eliminate update problems. The benefit you get from splitting a table > depends on how you do it. If you're storing department-specific > information in the Employee table, such that the same information is > stored for each employee" > > <snip> > Thanks for the informative posts you guys. > A follow-up question: So there is truly no inherent benefit (in and of > itself) to splitting up one large table in sql server into 2 or 3 > smaller ones? For instance, lets say that we have one table with a "UID" > PK and 99 other columns. Assuming it's a relation, there is no benefit > to dividing the table into 2 tables with a matching "UID" PK and 50 > columns a piece? In other words, the overhead processing is the same > either way and in fact, might be more with 2 tables given that the "UID" > is repeated in both? > > Hypothetical: > Let's say I'm constructing a database intended to track what a group of > dogs do every day. Each record would essentially be an indication of > what the specific animal is doing at any given time. Each record would > look something like this: > "Dog_Owner_ID" > "Dog_Name_ID" > "Action_Begin_DateTime" > "Action_End_DateTime" > "Action_ID" > "Action_Begin_Location" > "Action_End_Location" > "Encountered_Other_Dogs(Y/N)" > "Number_Of_Dogs_Encountered" > "Number_Of_Humans_Encountered" > "Number_Of_Other_Creatures_Encountered" > "Other_Creature_ID_1" > "Other_Creature_ID_2" > "Other_Creature_ID_3" > "DateTime_Record_Appended" > "DateTime_Record_Edited" > > Obviously this is a flaky example, but it's pretty indicative of the > bizaare, heaped-together databases I've encountered in my time. Assuming > that this is the dynamic table of the database that gets most frequently > updated, eventually it would grow to an obscene number of records. Does > splitting it up into multiple tables (say, for instance, by "Action_ID" > so that there'd be a table specific to each action do more harm than > good? > > Thanks again! > > *** Sent via Developersdex http://www.developersdex.com *** >> 1) 1NF: each field [sic] in a record [sic] can contain only one value. A comma separated list of values stored in a single field [sic] violates this. << First things first -- learn why rows are not records, tables are notfiles and columns are not fields. Until you have the right mental model, you will not understand SQL or RDBMS. The answer is that this is true 99% of the time in the real world. But a data element is atomic when it has one and only one value. Thus an enumerated path represetned as a CSV list could be a single data element, or a pair of (longtitude, latitude) columns could be a single data element Just for your information: I do understand what an attribute is. I do
understand what a tuple is. I do understand what a relation is. I understand the mathematical foundation of the relational model. What difference does it make if I use the informal term "field" instead of your informal term "column?" What difference does it make if I use the informal term "record" instead of your informal term "row?" I don't believe I have ever used the term "file," except in relation to physical database and transaction log files. I believe that I have a pretty good understanding of SQL and RDBMS. I understand intimately the limitations of file-based databases, both from a technical support standpoint and from a development standpoint. Except for a short stint in the U.S. Air Force, I've been working with databases all of my adult life. I even rolled my own in 8080 assembler. I have experience with DBase, Btrieve, FoxPro, Progress, Informix, SqlBase, DB2, Oracle, and SQL Server. I've been certified on SQL Server since 6.5 came out. I take offense at your insinuation that I don't know what I'm talking about and can't understand SQL or RDBMS. The answer is that this is true 100% of the time in the real world. Longitude and lattitude are two distinct numbers. Someone at some time may want to perform a query using one or both of those two numbers, for example, how many locations are within a twenty-five mile radius of location X. If you use a comma separated character field, you have to write code to extract each number and convert them to numeric values so that they can be compared (if it is even possible, since the database designer probably didn't bother to put a check constraint on the column to prevent garbage from being recorded). The same is true with the other example. One of the main benefits of the relational model is that it allows you to construct queries that you didn't anticipate when you designed the system. By violating the Information Rule, you incur increased query complexity and increased potential for error. "--CELKO--" <jcelko***@earthlink.net> wrote in message A comma separated list of values stored in a single field [sic] violatesnews:1120087943.695348.305560@g43g2000cwa.googlegroups.com... > >> 1) 1NF: each field [sic] in a record [sic] can contain only one value. this. << Show quote > > First things first -- learn why rows are not records, tables are not > files and columns are not fields. Until you have the right mental > model, you will not understand SQL or RDBMS. > > The answer is that this is true 99% of the time in the real world. But > a data element is atomic when it has one and only one value. Thus an > enumerated path represetned as a CSV list could be a single data > element, or a pair of (longtitude, latitude) columns could be a single > data element > >> What difference does it make if I use the informal term "field" instead of your informal term "column?" What difference does it make if I use the informal term "record" instead of your informal term "row?" << Column and row are terms in SQL while field and record are the propertermd for physical file systems. Like most new ideas, the hard part of understanding what the relational model is comes in un-learning what you know about file systems. As Artemus Ward (William Graham Sumner, 1840-1910) put it, "It ain't so much the things we don't know that get us into trouble. It's the things we know that just ain't so." If you already have a background in data processing with traditional file systems, the first things to un-learn are: (0) Databases are not file sets. (1) Tables are not files. (2) Rows are not records. (3) Columns are not fields. Modern data processing began with punch cards, or Hollerith cards used by the Bureau of the Census. Their original size was that of a United States Dollar bill. This was set by their inventor, Herman Hollerith, because he could get furniture to store the cards from the United States Treasury Department, just across the street. Likewise, physical constraints limited each card to 80 columns of holes in which to record a symbol. The influence of the punch card lingered on long after the invention of magnetic tapes and disk for data storage. This is why early video display terminals were 80 columns across. Even today, files which were migrated from cards to magnetic tape files or disk storage still use 80 column records. But the influence was not just on the physical side of data processing. The methods for handling data from the prior media were imitated in the new media. Data processing first consisted of sorting and merging decks of punch cards (later, sequential magnetic tape files) in a series of distinct steps. The result of each step feed into the next step in the process. Relational databases do not work that way. Each user connects to the entire database all at once, not to one file at time in a sequence of steps. The users might not all have the same database access rights once they are connected, however. Magnetic tapes could not be shared among users at the same time, but shared data is the point of a database. Tables versus Files A file is closely related to its physical storage media. A table may or may not be a physical file. DB2 from IBM uses one file per table, while Sybase puts several entire databases inside one file. A table is a <i>set<i> of rows of the same kind of thing. A set has no ordering and it makes no sense to ask for the first or last row. A deck of punch cards is sequential, and so are magnetic tape files. Therefore, a <i>physical<i> file of ordered sequential records also became the <i>mental<i> model for data processing and it is still hard to shake. Anytime you look at data, it is in some physical ordering. The various access methods for disk storage system came later, but even these access methods could not shake the mental model. Another conceptual difference is that a file is usually data that deals with a whole business process. A file has to have enough data in itself to support applications for that business process. Files tend to be "mixed" data which can be described by the name of the business process, such as "The Payroll file" or something like that. Tables can be either entities or relationships within a business process. This means that the data which was held in one file is often put into several tables. Tables tend to be "pure" data which can be described by single words. The payroll would now have separate tables for timecards, employees, projects and so forth. Tables as Entities An entity is physical or conceptual "thing" which has meaning be itself. A person, a sale or a product would be an example. In a relational database, an entity is defined by its attributes, which are shown as values in columns in rows in a table. To remind users that tables are sets of entities, I like to use plural or collective nouns that describe the function of the entities within the system for the names of tables. Thus "Employee" is a bad name because it is singular; "Employees" is a better name because it is plural; "Personnel" is best because it is collective and does not summon up a mental picture of individual persons. If you have tables with exactly the same structure, then they are sets of the same kind of elements. But you should have only one set for each kind of data element! Files, on the other hand, were PHYSICALLY separate units of storage which coudl be alike -- each tape or disk file represents a step in the PROCEDURE , such as moving from raw data, to edited data, and finally to archived data. In SQL, this should be a status flag in a table. Tables as Relationships A relationship is shown in a table by columns which reference one or more entity tables. Without the entities, the relationship has no meaning, but the relationship can have attributes of its own. For example, a show business contract might have an agent, an employer and a talent. The method of payment is an attribute of the contract itself, and not of any of the three parties. Rows versus Records Rows are not records. A record is defined in the application program which reads it; a row is defined in the database schema and not by a program at all. The name of the field in the READ or INPUT statements of the application; a row is named in the database schema. All empty files look alike; they are a directory entry in the operating system with a name and a length of zero bytes of storage. Empty tables still have columns, constraints, security privileges and other structures, even tho they have no rows. This is in keeping with the set theoretical model, in which the empty set is a perfectly good set. The difference between SQL's set model and standard mathematical set theory is that set theory has only one empty set, but in SQL each table has a different structure, so they cannot be used in places where non-empty versions of themselves could not be used. Another characteristic of rows in a table is that they are all alike in structure and they are all the "same kind of thing" in the model. In a file system, records can vary in size, datatypes and structure by having flags in the data stream that tell the program reading the data how to interpret it. The most common examples are Pascal's variant record, C's struct syntax and Cobol's OCCURS clause. The OCCURS keyword in Cobol and the Variant records in Pascal have a number which tells the program how many time a record structure is to be repeated in the current record. Unions in 'C' are not variant records, but variant mappings for the same physical memory. For example: union x {int ival; char j[4];} myStuff; defines myStuff to be either an integer (which are 4 bytes on most modern C compilers, but this code is non-portable) or an array of 4 bytes, depending on whether you say myStuff.ival or myStuff.j[0]; But even more than that, files often contained records which were summaries of subsets of the other records -- so called control break reports. There is no requirement that the records in a file be related in any way -- they are literally a stream of binary data whose meaning is assigned by the program reading them. Columns versus Fields A field within a record is defined by the application program that reads it. A column in a row in a table is defined by the database schema. The datatypes in a column are always scalar. The order of the application program variables in the READ or INPUT statements is important because the values are read into the program variables in that order. In SQL, columns are referenced only by their names. Yes, there are shorthands like the SELECT * clause and INSERT INTO <table name> statements which expand into a list of column names in the physical order in which the column names appear within their table declaration, but these are shorthands which resolve to named lists. The use of NULLs in SQL is also unique to the language. Fields do not support a missing data marker as part of the field, record or file itself. Nor do fields have constraints which can be added to them in the record, like the DEFAULT and CHECK() clauses in SQL. Relationships among tables within a database Files are pretty passive creatures and will take whatever an application program throws at them without much objection. Files are also independent of each other simply because they are connected to one application program at a time and therefore have no idea what other files looks like. A database actively seeks to maintain the correctness of all its data. The methods used are triggers, constraints and declarative referential integrity. Declarative referential integrity (DRI) says, in effect, that data in one table has a particular relationship with data in a second (possibly the same) table. It is also possible to have the database change itself via referential actions associated with the DRI. For example, a business rule might be that we do not sell products which are not in inventory. This rule would be enforce by a REFERENCES clause on the Orders table which references the Inventory table and a referential action of ON DELETE CASCADE Triggers are a more general way of doing much the same thing as DRI. A trigger is a block of procedural code which is executed before, after or instead of an INSERT INTO or UPDATE statement. You can do anything with a trigger that you can do with DRI and more. However, there are problems with TRIGGERs. While there is a standard syntax for them in the SQL-92 standard, most vendors have not implemented it. What they have is very proprietary syntax instead. Secondly, a trigger cannot pass information to the optimizer like DRI. In the example in this section, I know that for every product number in the Orders table, I have that same product number in the Inventory table. The optimizer can use that information in setting up EXISTS() predicates and JOINs in the queries. There is no reasonable way to parse procedural trigger code to determine this relationship. The CREATE ASSERTION statement in SQL-92 will allow the database to enforce conditions on the entire database as a whole. An ASSERTION is not like a CHECK() clause, but the difference is subtle. A CHECK() clause is executed when there are rows in the table to which it is attached. If the table is empty then all CHECK() clauses are effectively TRUE. Thus, if we wanted to be sure that the Inventory table is never empty, and we wrote: CREATE TABLE Inventory ( ... CONSTRAINT inventory_not_empty CHECK ((SELECT COUNT(*) FROM Inventory) > 0), ... ); it would not work. However, we could write: CREATE ASSERTION Inventory_not_empty CHECK ((SELECT COUNT(*) FROM Inventory) > 0); and we would get the desired results. The assertion is checked at the schema level and not at the table level. >> The answer is that this is true 100% of the time in the real world. Longitude and lattitude are two distinct numbers. << See how easy it is to confuse a data element with its representation!They are scalar only when they are together. This is one thing Chris Date and i agree on :) >> If you use a comma separated character field [sic], you have to write code to extract each number and convert them to numeric values so that they can be compared << Actually, Moreau and Ben-Gan's path enumeration model can be handlednicely with string operrtions and grep() patterns that do not require extraction. Go over to www.dbdebunk.com and look up Date's paper on 1NF. Un-learn. Hmm. Sounds pretty stupid to me. Those who fail to learn from
history are doomed to repeat it. Knowledge of how transaction files are applied to a master file gives insight into how a merge join operates. Knowledge of B-Trees gives insight into how indexes are organized and into how a loop join operates Knowledge of hashing algorithms and hash tables gives insight into how a hash join operates. In SQL Server, understanding the execution plan generated by the query optimizer is critical to improving the performance of your queries. Is this an excerpt from one of your books? Do you have a section on nondestructive readout core memory? I still think that there are two problems with the relational model. (1) it allows volatile primary key values, and (2) it only has three dimensions. First of all, the Information Rule SHOULD read, "Every datum (atomic value) in the database is located in only one place and can be accessed using (1) the table name, (2) the primary key value, and (3) the column name." Using natural keys in DRI relationships introduces redundancy into the database which increases the complexity of managing locks and security. It is also increases the complexity of the applications that must interact with the database because code must be added to detect a primary key value change and deal with it. Second, a relational database has only three dimensions: table name, primary key value, and column name. I think that there should be a fourth dimension: time. You should be able to create queries like: what was the price of this part on such-n-such a date? Who changed what where, when and how? In other words, which user changed which column in which row of which table at what time and with which application? You should be able to query what the state of the database was at a specified point in time. I have a workaround for both problems, but I think it should be a built-in feature. It appears that you and C. J. Date don't agree on very much. I just read one of his articles on www.dbdebunk.com, "There's Only One Relational Model!," and he didn't appear too pleased about your "misleading, inaccurate, or just plain wrong" assertions, nor about taking his name in vain. (Perhaps you shouldn't have sent me there.) "--CELKO--" <jcelko***@earthlink.net> wrote in message of your informal term "column?" What difference does it make if I use thenews:1120103085.567144.288200@g14g2000cwa.googlegroups.com... > >> What difference does it make if I use the informal term "field" instead informal term "record" instead of your informal term "row?" << Show quote > Longitude and lattitude are two distinct numbers. <<> Column and row are terms in SQL while field and record are the proper > termd for physical file systems. Like most new ideas, the hard part of > understanding what the relational model is comes in un-learning what > you know about file systems. As Artemus Ward (William Graham Sumner, > 1840-1910) put it, "It ain't so much the things we don't know that get > us into trouble. It's the things we know that just ain't so." > > If you already have a background in data processing with traditional > file systems, the first things to un-learn are: > > (0) Databases are not file sets. > (1) Tables are not files. > (2) Rows are not records. > (3) Columns are not fields. > > Modern data processing began with punch cards, or Hollerith cards used > by the Bureau of the Census. Their original size was that of a United > States Dollar bill. This was set by their inventor, Herman Hollerith, > because he could get furniture to store the cards from the United > States Treasury Department, just across the street. Likewise, physical > constraints limited each card to 80 columns of holes in which to record > a symbol. > > The influence of the punch card lingered on long after the invention of > magnetic tapes and disk for data storage. This is why early video > display terminals were 80 columns across. Even today, files which > were migrated from cards to magnetic tape files or disk storage still > use 80 column records. > > But the influence was not just on the physical side of data processing. > The methods for handling data from the prior media were imitated in > the new media. > > Data processing first consisted of sorting and merging decks of punch > cards (later, sequential magnetic tape files) in a series of distinct > steps. The result of each step feed into the next step in the process. > > > Relational databases do not work that way. Each user connects to the > entire database all at once, not to one file at time in a sequence of > steps. The users might not all have the same database access rights > once they are connected, however. Magnetic tapes could not be shared > among users at the same time, but shared data is the point of a > database. > > Tables versus Files > > A file is closely related to its physical storage media. A table may > or may not be a physical file. DB2 from IBM uses one file per table, > while Sybase puts several entire databases inside one file. A table is > a <i>set<i> of rows of the same kind of thing. A set has no ordering > and it makes no sense to ask for the first or last row. > > A deck of punch cards is sequential, and so are magnetic tape files. > Therefore, a <i>physical<i> file of ordered sequential records also > became the <i>mental<i> model for data processing and it is still hard > to shake. Anytime you look at data, it is in some physical ordering. > > The various access methods for disk storage system came later, but even > these access methods could not shake the mental model. > > Another conceptual difference is that a file is usually data that deals > with a whole business process. A file has to have enough data in > itself to support applications for that business process. Files tend > to be "mixed" data which can be described by the name of the business > process, such as "The Payroll file" or something like that. > > Tables can be either entities or relationships within a business > process. This means that the data which was held in one file is often > put into several tables. Tables tend to be "pure" data which can be > described by single words. The payroll would now have separate tables > for timecards, employees, projects and so forth. > > Tables as Entities > > An entity is physical or conceptual "thing" which has meaning be > itself. A person, a sale or a product would be an example. In a > relational database, an entity is defined by its attributes, which are > shown as values in columns in rows in a table. > > To remind users that tables are sets of entities, I like to use plural > or collective nouns that describe the function of the entities within > the system for the names of tables. Thus "Employee" is a bad name > because it is singular; "Employees" is a better name because it is > plural; "Personnel" is best because it is collective and does not > summon up a mental picture of individual persons. > > If you have tables with exactly the same structure, then they are sets > of the same kind of elements. But you should have only one set for > each kind of data element! Files, on the other hand, were PHYSICALLY > separate units of storage which coudl be alike -- each tape or disk > file represents a step in the PROCEDURE , such as moving from raw data, > to edited data, and finally to archived data. In SQL, this should be a > status flag in a table. > > Tables as Relationships > > A relationship is shown in a table by columns which reference one or > more entity tables. Without the entities, the relationship has no > meaning, but the relationship can have attributes of its own. For > example, a show business contract might have an agent, an employer and > a talent. The method of payment is an attribute of the contract > itself, and not of any of the three parties. > > Rows versus Records > > Rows are not records. A record is defined in the application program > which reads it; a row is defined in the database schema and not by a > program at all. The name of the field in the READ or INPUT statements > of the application; a row is named in the database schema. > > All empty files look alike; they are a directory entry in the operating > system with a name and a length of zero bytes of storage. Empty tables > still have columns, constraints, security privileges and other > structures, even tho they have no rows. > > This is in keeping with the set theoretical model, in which the empty > set is a perfectly good set. The difference between SQL's set model > and standard mathematical set theory is that set theory has only one > empty set, but in SQL each table has a different structure, so they > cannot be used in places where non-empty versions of themselves could > not be used. > > Another characteristic of rows in a table is that they are all alike in > structure and they are all the "same kind of thing" in the model. In a > file system, records can vary in size, datatypes and structure by > having flags in the data stream that tell the program reading the data > how to interpret it. The most common examples are Pascal's variant > record, C's struct syntax and Cobol's OCCURS clause. > > The OCCURS keyword in Cobol and the Variant records in Pascal have a > number which tells the program how many time a record structure is to > be repeated in the current record. > > Unions in 'C' are not variant records, but variant mappings for the > same physical memory. For example: > > union x {int ival; char j[4];} myStuff; > > defines myStuff to be either an integer (which are 4 bytes on most > modern C compilers, but this code is non-portable) or an array of 4 > bytes, depending on whether you say myStuff.ival or myStuff.j[0]; > > But even more than that, files often contained records which were > summaries of subsets of the other records -- so called control break > reports. There is no requirement that the records in a file be related > in any way -- they are literally a stream of binary data whose meaning > is assigned by the program reading them. > > Columns versus Fields > > A field within a record is defined by the application program that > reads it. A column in a row in a table is defined by the database > schema. The datatypes in a column are always scalar. > > The order of the application program variables in the READ or INPUT > statements is important because the values are read into the program > variables in that order. In SQL, columns are referenced only by their > names. Yes, there are shorthands like the SELECT * clause and INSERT > INTO <table name> statements which expand into a list of column names > in the physical order in which the column names appear within their > table declaration, but these are shorthands which resolve to named > lists. > > The use of NULLs in SQL is also unique to the language. Fields do not > support a missing data marker as part of the field, record or file > itself. Nor do fields have constraints which can be added to them in > the record, like the DEFAULT and CHECK() clauses in SQL. > > Relationships among tables within a database > > Files are pretty passive creatures and will take whatever an > application program throws at them without much objection. Files are > also independent of each other simply because they are connected to one > application program at a time and therefore have no idea what other > files looks like. > > A database actively seeks to maintain the correctness of all its data. > The methods used are triggers, constraints and declarative referential > integrity. > > Declarative referential integrity (DRI) says, in effect, that data in > one table has a particular relationship with data in a second (possibly > the same) table. It is also possible to have the database change > itself via referential actions associated with the DRI. > > For example, a business rule might be that we do not sell products > which are not in inventory. This rule would be enforce by a REFERENCES > clause on the Orders table which references the Inventory table and a > referential action of ON DELETE CASCADE > > Triggers are a more general way of doing much the same thing as DRI. A > trigger is a block of procedural code which is executed before, after > or instead of an INSERT INTO or UPDATE statement. You can do anything > with a trigger that you can do with DRI and more. > > However, there are problems with TRIGGERs. While there is a standard > syntax for them in the SQL-92 standard, most vendors have not > implemented it. What they have is very proprietary syntax instead. > Secondly, a trigger cannot pass information to the optimizer like DRI. > In the example in this section, I know that for every product number in > the Orders table, I have that same product number in the Inventory > table. The optimizer can use that information in setting up EXISTS() > predicates and JOINs in the queries. There is no reasonable way to > parse procedural trigger code to determine this relationship. > > The CREATE ASSERTION statement in SQL-92 will allow the database to > enforce conditions on the entire database as a whole. An ASSERTION is > not like a CHECK() clause, but the difference is subtle. A CHECK() > clause is executed when there are rows in the table to which it is > attached. If the table is empty then all CHECK() clauses are > effectively TRUE. Thus, if we wanted to be sure that the Inventory > table is never empty, and we wrote: > > CREATE TABLE Inventory > ( ... > CONSTRAINT inventory_not_empty > CHECK ((SELECT COUNT(*) FROM Inventory) > 0), ... ); > > it would not work. However, we could write: > > CREATE ASSERTION Inventory_not_empty > CHECK ((SELECT COUNT(*) FROM Inventory) > 0); > > and we would get the desired results. The assertion is checked at the > schema level and not at the table level. > > >> The answer is that this is true 100% of the time in the real world. > code to extract each number and convert them to numeric values so that they> See how easy it is to confuse a data element with its representation! > They are scalar only when they are together. This is one thing Chris > Date and i agree on :) > > >> If you use a comma separated character field [sic], you have to write can be compared << Show quote > > Actually, Moreau and Ben-Gan's path enumeration model can be handled > nicely with string operrtions and grep() patterns that do not require > extraction. > > Go over to www.dbdebunk.com and look up Date's paper on 1NF. > >> I think that there should be a fourth dimension: time. You should be able to create queries like: what was the price of this part on such-n-such a date? Who changed what where, when and how? << Read Rick Snodgrass's temporal proposals at his website at theUniversity of Arizona. They never made it into the Standard, but he has a ton of SQL for "faking it". >> It appears that you and C. J. Date don't agree on very much. << Actually we do. In that particular exchange, he was the one whoprovided the referencs to the papers by Codd that had some of the various versions of RM, RM II, and now he has his own version. You might want to look at other implementation of SQL which do not physically repeat the same values in PK=fk relationships (Sand, SQL Anywhere, etc.) >> I still think that there are two problems with the relational model. (1) It is precisely due to the potential volatility of key values, did >> it allows volatile primary key values, and (2) it only has three >> dimensions. First of all, the Information Rule SHOULD read, "Every datum >> (atomic value) in the database is located in only one place and can be >> accessed using (1) the table name, (2) the primary key value, and (3) the >> column name." Using natural keys in DRI relationships introduces >> redundancy into the database which increases the complexity of managing >> locks and security. relational proponents consider stability as a desirable criteria while choosing a primary key. Lock management is a physical issue, and has nothing to do with the logical model. Date's work on inclusion dependencies(INCD) in early 80s partly deals with this very issue where declarative constraints on volatile keys. Some of his findings and proposals can be found under the chapters "Referential Integrity and Foreign Keys" - Parts I, II & III in his book Relational Database Writings 85-89. >> I think that there should be a fourth dimension: time. You should be True, facilities for temporal(semi- as well as bi- temporal) expressions >> able to create queries like: what was the price of this part on >> such-n-such a date? Who changed what where, when and how? In other >> words, which user changed which column in which row of which table at >> what time and with which application? You should be able to query what >> the state of the database was at a specified point in time. I have a >> workaround for both problems, but I think it should be a built-in >> feature. should be a built-in feature, but that is not a relational drawback. Relational approaches to temporal databases have been around for quite some time now. Since no SQL vendors have implemented any of the proposals and no useful temporal type other types to represent discrete points in time exist in current products, even the existence of such approaches is not widely known. Snodgrass' work is high quality but some areas of his work are suspect, like the suggestion of hidden timestamps. The other popular one ( by Darwen, Lorenzos & Date), though addresses some implementation aspects, is strictly from a logical perspective. The fundamentals on temporal constraints and granularity issues are well addressed and can be found in their book: Temporal Data & the Relational Model. You can also find a preview at: http://web.onetel.com/~hughdarwen/TheThirdManifesto/TemporalData.Warwick.pdf You will also find Date's 6NF ( "irreducible relation" : one with a key and at most one other attribute ) is relation to fully temporal databases, though it is not mandatory. >> It appears that you and C. J. Date don't agree on very much. Perhaps on some issues, but Joe actually suggested you look up his paper on 1NF at: http://www.dbdebunk.com/page/page/629796.htm It is simple yet an outstanding work especially since so many freelance gurus these days offer 1NF suggestions based on ill conceived notions of repeated groups, "look-a-like" columns etc. The proposed 1NF requirements have been detailed in Date's recent version of Intro Book as the properties of a relation as well which can be paraphrased as: A table is said to be in 1NF, if it directly and faithfully represents some relation. This means, the following properties must be necessary & sufficient for a table to be in 1NF 1. There is no ordering to the rows or columns. 2. There are no duplicate rows, so that a key exists by definition 3. Every row-column intersection contains exactly one value from the applicable type. 4. All columns are regular ( no hidden columns, object ids etc. ) And strictly speaking, Nulls would invalidate 1NF since it is not a value ( other implications like information rule violation etc. ) but it is often discounted. Thus without reference to the underlying type claiming a CSV list violates 1NF is moot. If the system supports a CSV list type( with CSV list values and operators on those values), having a CSV list as an attribute value declared on that type would not violate the 1NF requirement. -- Anith > Thus without reference to the underlying type claiming a CSV list violates My primary objection to implementing a CSV list domain and defining> 1NF is moot. If the system supports a CSV list type( with CSV list values > and operators on those values), having a CSV list as an attribute value > declared on that type would not violate the 1NF requirement. relations containing attributes with a CSV list domain is that a CSV list is by definition not atomic. A CSV list is an ordered set of zero or more values, separated by commas. Every datum (atomic value) in a relational database can be accessed if you know the table name, the primary key value, and the column name. How can you write a query to access the third element in a CSV list column using only the table name, the primary key value, and the column name. The additional dimension (element number) makes this impossible. It follows then that a database that implements a CSV list domain is by definition not a relational database. Show quote "Anith Sen" <an***@bizdatasolutions.com> wrote in message http://web.onetel.com/~hughdarwen/TheThirdManifesto/TemporalData.Warwick.pdfnews:eLMC23YfFHA.3944@TK2MSFTNGP10.phx.gbl... > >> I still think that there are two problems with the relational model. (1) > >> it allows volatile primary key values, and (2) it only has three > >> dimensions. First of all, the Information Rule SHOULD read, "Every datum > >> (atomic value) in the database is located in only one place and can be > >> accessed using (1) the table name, (2) the primary key value, and (3) the > >> column name." Using natural keys in DRI relationships introduces > >> redundancy into the database which increases the complexity of managing > >> locks and security. > > It is precisely due to the potential volatility of key values, did > relational proponents consider stability as a desirable criteria while > choosing a primary key. Lock management is a physical issue, and has nothing > to do with the logical model. > > Date's work on inclusion dependencies(INCD) in early 80s partly deals with > this very issue where declarative constraints on volatile keys. Some of his > findings and proposals can be found under the chapters "Referential > Integrity and Foreign Keys" - Parts I, II & III in his book Relational > Database Writings 85-89. > > >> I think that there should be a fourth dimension: time. You should be > >> able to create queries like: what was the price of this part on > >> such-n-such a date? Who changed what where, when and how? In other > >> words, which user changed which column in which row of which table at > >> what time and with which application? You should be able to query what > >> the state of the database was at a specified point in time. I have a > >> workaround for both problems, but I think it should be a built-in > >> feature. > > True, facilities for temporal(semi- as well as bi- temporal) expressions > should be a built-in feature, but that is not a relational drawback. > > Relational approaches to temporal databases have been around for quite some > time now. Since no SQL vendors have implemented any of the proposals and no > useful temporal type other types to represent discrete points in time exist > in current products, even the existence of such approaches is not widely > known. > > Snodgrass' work is high quality but some areas of his work are suspect, like > the suggestion of hidden timestamps. > > The other popular one ( by Darwen, Lorenzos & Date), though addresses some > implementation aspects, is strictly from a logical perspective. The > fundamentals on temporal constraints and granularity issues are well > addressed and can be found in their book: Temporal Data & the Relational > Model. You can also find a preview at: > Show quote > > You will also find Date's 6NF ( "irreducible relation" : one with a key and > at most one other attribute ) is relation to fully temporal databases, > though it is not mandatory. > > >> It appears that you and C. J. Date don't agree on very much. > > Perhaps on some issues, but Joe actually suggested you look up his paper on > 1NF at: http://www.dbdebunk.com/page/page/629796.htm > > It is simple yet an outstanding work especially since so many freelance > gurus these days offer 1NF suggestions based on ill conceived notions of > repeated groups, "look-a-like" columns etc. > > The proposed 1NF requirements have been detailed in Date's recent version of > Intro Book as the properties of a relation as well which can be paraphrased > as: A table is said to be in 1NF, if it directly and faithfully represents > some relation. This means, the following properties must be necessary & > sufficient for a table to be in 1NF > 1. There is no ordering to the rows or columns. > 2. There are no duplicate rows, so that a key exists by definition > 3. Every row-column intersection contains exactly one value from the > applicable type. > 4. All columns are regular ( no hidden columns, object ids etc. ) > And strictly speaking, Nulls would invalidate 1NF since it is not a value > ( other implications like information rule violation etc. ) but it is often > discounted. > > Thus without reference to the underlying type claiming a CSV list violates > 1NF is moot. If the system supports a CSV list type( with CSV list values > and operators on those values), having a CSV list as an attribute value > declared on that type would not violate the 1NF requirement. > > -- > Anith > > >> My primary objection to implementing a CSV ist domain and defining Whether some value is atomic or not is purely subjective and contextual; In >> relations containing attributes with a CSV list domain is that a CSV list >> is by definition not atomic. some contexts the "Bryan" is an atomic value while in some other context it may be a list of letters. In some contexts the literal "2005/06/30" defines a single date value while in some other contexts it is comprised of three distinct values representing year, month and time. Thus a definition of 1NF on a loose term like "atomicity" alone is neither precise nor formal. For a formal definition one would need a reference to something which is non-contextual and unambiguous. A type or domain provides that. A type\ domain is a set of values along with associated operators applicable on those values. Thus in a good type system, one could have a type of arbitrary complexity and values are scalar with respect to the type. >> A CSV list is an ordered set of zero or more values, separated by commas. In some contexts it is. In some other contexts, it is just as scalar as an int or datetime or char(10) value. A well defined type could precisely capture the context and represent the values in a relational database appropriately. >> Every datum (atomic value) in a relational database can be accessed if Nobody disagrees with that.>> you know the table name, the primary key value, and the column name. >> How can you write a query to access the third element in a CSV list Using the operators applicable to the CSV list type.>> column using only the table name, the primary key value, and the column >> name. To use an analogy, how do you extract the month from a datetime value? Using DATEPART operator, which is an operator applicable to the DATETIME type. >> The additional dimension (element number) makes this impossible. Well, that could be an argument to the operator, if an element needs to be extracted based on its index or position in the list. >> It follows then that a database that implements a CSV list domain is by No, it does not follow anything of that sort. The link on 1NF provided early >> definition not a relational database in this thread should clarify such misconceptions. -- Anith Thanks for the informative posts all! I can practically feel my brain
growing... Now time to regulate. :-) CELKO wrote: ">> The answer is that this is true 100% of the time in the real world. Longitude and lattitude are two distinct numbers. << See how easy it is to confuse a data element with its representation! They are scalar only when they are together. This is one thing Chris Date and i agree on :) " You are incorrect if you are you implying that longitude/latitude together constitue a scalar. A tell-tale indicator of a scalar is that it cannot be broken down into a smaller logical part. A longitude would be a scalar, a longitude+latitude would not. Anith wrote: ">> How can you write a query to access the third element in a CSV list >> column using only the table name, the primary key value, and the column >> name. Using the operators applicable to the CSV list type.To use an analogy, how do you extract the month from a datetime value? Using DATEPART operator, which is an operator applicable to the DATETIME type. " Semantics. You muddy a pure theorectical concept with a human's interference. Using that same logic, one could stuff any crap into a database, call it a data element, and code an operator applicable to the CRAP datatype. Then you could point at your database and claim it still adhered to 1NF, no? ">> It follows then that a database that implements a CSV list domain is by >> definition not a relational database No, it does not follow anything of that sort. The link on 1NF provided early in this thread should clarify such misconceptions. " Boo! Hsss! No points for you. I checked out the link. It's pay-per-view. You cannot use something to prove your case that someone else must pay to see. Enter the cult leader who offers to sell you the mysteries of life in 3 easy installments of 19.95 and if you don't pay up, you're clearly one of the unenlightened masses. ;-) Let me insert some of my (admittedly limited) depth of knowledge into this discussion as I think the core philosophy of relational db's is being forgotten. As Celko points out, current databases have evolved from the flat, 2-dimensional constructs of the past into the robust, 3-dimensional constructs of today, HOWEVER, the point of them remains the same: storing and accessing data. As we get better and better at that, we've come to understand that nature has been doing a better job of it all along. Our fumbling attempts at relational db's are, in fact, a conscious (or subconscious) attempt to mimic nature. No longer do we have to rely on static hardcopy, now we have the electronic ether of our PC's as our toolbox. An entire virtual universe sits humming under our desks and we have a 17-inch window into it. When we have these discussions of what does or does not constitute a "relational" system we must keep the above in mind and focus not on the letter of the laws (ok, yes, you can argue that CSV data could be incorporated as a single data element adhering to 1NF) but on the *spirit* of the laws. We're trying to mimic nature. We're trying to create a robust, self-correcting, self-enforcing, multi-dimensional construct that will not allow itself to be compromised. If you adhere to the spirit of the concept well, I should be able to peruse your database from any angle and find the same logical atomic setup. I should be able to drill down to the most basic elements, not worrying about custom datatypes (and associated operators) that individual programmers created willy-nilly for their own short-sighted purposes. Cheers, Roy <roy.ander***@gmail.com> wrote in message
news:1120180243.548210.152550@g49g2000cwa.googlegroups.com... It all depends on the application. In the types of geocoding> > You are incorrect if you are you implying that longitude/latitude > together constitue a scalar. A tell-tale indicator of a scalar is that > it cannot be broken down into a smaller logical part. A longitude would > be a scalar, a longitude+latitude would not. applications I work with, a longitude without a lattitude is utterly meaningless. Combine the two and we have something. The two, in these applications, are not at all seperable. See Anith's post about atomicity. > Semantics. You muddy a pure theorectical concept with a human's Yes. 1NF is really defined by the needs and semantics of each> interference. Using that same logic, one could stuff any crap into a > database, call it a data element, and code an operator applicable to > the CRAP datatype. Then you could point at your database and claim it > still adhered to 1NF, no? particular database. > When we have these discussions of what does or does not constitute a You need to take a course in first order logic. It doesn't matter what> "relational" system we must keep the above in mind and focus not on the > letter of the laws (ok, yes, you can argue that CSV data could be > incorporated as a single data element adhering to 1NF) but on the > *spirit* of the laws. We're trying to mimic nature. We're trying to the "spirit" is, as long as you can prove it. 1NF is not about some idealistic view of data management. It's a way of logically describing how to avoid redundancy. Nothing more -- and nothing less. > It all depends on the application. In the types of geocoding The fact that they may be meaningless when separated does not change the> applications I work with, a longitude without a lattitude is utterly > meaningless. Combine the two and we have something. The two, in these > applications, are not at all seperable. See Anith's post about atomicity. fact that each component element means something different. If you isolate the attribute AreaCode from the Employee relation, it is also meaningless. The AreaCode for what phone number for whom? If the position or existence of an element of a data type imparts meaning to instances of that element, then those instances are atomic values with respect to the database. An instance of a data type cannot be atomic with respect to the database if any component element value is atomic with respect to the database. Show quote "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message news:udGcTWefFHA.3448@TK2MSFTNGP12.phx.gbl... > <roy.ander***@gmail.com> wrote in message > news:1120180243.548210.152550@g49g2000cwa.googlegroups.com... > > > > You are incorrect if you are you implying that longitude/latitude > > together constitue a scalar. A tell-tale indicator of a scalar is that > > it cannot be broken down into a smaller logical part. A longitude would > > be a scalar, a longitude+latitude would not. > > It all depends on the application. In the types of geocoding > applications I work with, a longitude without a lattitude is utterly > meaningless. Combine the two and we have something. The two, in these > applications, are not at all seperable. See Anith's post about atomicity. > > > > Semantics. You muddy a pure theorectical concept with a human's > > interference. Using that same logic, one could stuff any crap into a > > database, call it a data element, and code an operator applicable to > > the CRAP datatype. Then you could point at your database and claim it > > still adhered to 1NF, no? > > Yes. 1NF is really defined by the needs and semantics of each > particular database. > > > > When we have these discussions of what does or does not constitute a > > "relational" system we must keep the above in mind and focus not on the > > letter of the laws (ok, yes, you can argue that CSV data could be > > incorporated as a single data element adhering to 1NF) but on the > > *spirit* of the laws. We're trying to mimic nature. We're trying to > > You need to take a course in first order logic. It doesn't matter what > the "spirit" is, as long as you can prove it. 1NF is not about some > idealistic view of data management. It's a way of logically describing how > to avoid redundancy. Nothing more -- and nothing less. > > > -- > Adam Machanic > SQL Server MVP > http://www.datamanipulation.net > -- > > "Brian Selzer" <br***@selzer-software.com> wrote in message By that argument, no column can EVER be said to be 'atomic'. Fornews:%23gmZCRifFHA.3940@TK2MSFTNGP14.phx.gbl... > > The fact that they may be meaningless when separated does not change the > fact that each component element means something different. If you isolate > the attribute AreaCode from the Employee relation, it is also meaningless. > The AreaCode for what phone number for whom? instance, I can split up any character string into its component characters, and further decompose those characters into bits. They all have meaning to someone, somewhere, right? > By that argument, no column can EVER be said to be 'atomic'. For That is absurd. An attribute must add semantic context to its relation, or> instance, I can split up any character string into its component characters, > and further decompose those characters into bits. They all have meaning to > someone, somewhere, right? it wouldn't be included in the logical data model. A date value in the HireDate column is not only a date, it represents the date that a specific employee was hired. A character string derives its value from the permutation of its component character values. Regardless of the underlying physical representation of a character, the value is the same whether its Unicode, ASCII or EBCDIC. Each individual character in the permutation does not represent anything other than its own value. In the same way an integer derives its value from the permutation of its component bits, which do not represent anything other than their own values. Show quote "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message news:OCqLCqifFHA.3436@tk2msftngp13.phx.gbl... > "Brian Selzer" <br***@selzer-software.com> wrote in message > news:%23gmZCRifFHA.3940@TK2MSFTNGP14.phx.gbl... > > > > The fact that they may be meaningless when separated does not change the > > fact that each component element means something different. If you > isolate > > the attribute AreaCode from the Employee relation, it is also meaningless. > > The AreaCode for what phone number for whom? > > By that argument, no column can EVER be said to be 'atomic'. For > instance, I can split up any character string into its component characters, > and further decompose those characters into bits. They all have meaning to > someone, somewhere, right? > > > -- > Adam Machanic > SQL Server MVP > http://www.datamanipulation.net > -- > > > You've hit on exactly the point I made in my response. The decision
about whether a column is atomic or not is based on *semantics* and therefore isn't subject to a formal definition. The Relational Model is concerned with representation and that's why it doesn't have anything to say on this point. You can formulate a rule of thumb if you like but it's only that, and it won't change the meaning if 1NF. -- David Portas SQL Server MVP -- The decision about whether or not a functional dependency exists is based on
*semantics* as well, yet it has been formally defined. I've seen so many databases that have tables with repeating groups or columns with comma-delimited lists, and every time I ended up with a migraine trying to work with them. Theorists who advocate their use should be burned at the stake for heresy. Show quote "David Portas" <REMOVE_BEFORE_REPLYING_dpor***@acm.org> wrote in message news:1120248812.141505.51340@g49g2000cwa.googlegroups.com... > You've hit on exactly the point I made in my response. The decision > about whether a column is atomic or not is based on *semantics* and > therefore isn't subject to a formal definition. The Relational Model is > concerned with representation and that's why it doesn't have anything > to say on this point. You can formulate a rule of thumb if you like but > it's only that, and it won't change the meaning if 1NF. > > -- > David Portas > SQL Server MVP > -- > The difference is that Functional Dependencies can be represented in theory
by the notional existence of a function - a transformation that maps one set of values onto another. That is a mathematical concept. It's very hard to define such a formal concept for atomicity that's useful for every possible type. You want to forbid certain domains because you regard them as non-atomic but is it really sufficient, as Codd originally suggested, to say that non-atomic means that there exists some transformation to some smaller more "primitive" domain of values? I don't think so. Here's an example. How do you define the atomicity of a digital graphic image? We have types for digital images so an RDBMS may also come equipped with operators to transform images into other images - by slicing them into smaller images for example. So when is an image non-atomic? Is an atomic image one that contains only a single pixel? What use would such a restriction have in theory or in practice? Or maybe you want to require that an image is only non-atomic if smaller transformations of it have some meaning in themselves. Again, this seems like an arbitrary restriction. Smaller images may well be useful and meaningful to the user. With digital images we frequently want to render detailed clips of much larger graphics. Would you really want to disallow a domain from containing an image that was a clip of a larger one in the same domain? If so, how do you define which images are permitted in that domain and which aren't? Apparently your definition may have to be self-referential which poses some serious problems in defining constraints and other operations on that domain. -- David Portas SQL Server MVP -- > ... to say that non-atomic means that there exists some I would use the term resolution instead of transformation, because> transformation to some smaller more "primitive" domain of values? transformation implies change whereas resolution implies separation, but yes, I would agree with that. > ... How do you define the atomicity of a digital An image data type would be atomic. An image is a two-dimensional visual> graphic image? ... representation of something. The mechanism that renders the image for human consumption determines the value of the image, so both must be implementation-specific. The range of each dimension in an image may be continuous (a vector drawing) or not (a raster drawing), which makes it impossible to decompose a conceptual image into its individual component parts without losing information. I would argue that in order to maintain integrity, each image that was generated by transforming another image must be related to the original image, and the transformation operations must be performed any time the original image changes. It would be preferable, theoretically, to perform the transformations when an image is rendered for human consumption. That way only one copy of the image exists in the database. Show quote "David Portas" <REMOVE_BEFORE_REPLYING_dpor***@acm.org> wrote in message news:aPqdneoM7YgwBlvfRVn-vQ@giganews.com... > The difference is that Functional Dependencies can be represented in theory > by the notional existence of a function - a transformation that maps one set > of values onto another. That is a mathematical concept. > > It's very hard to define such a formal concept for atomicity that's useful > for every possible type. You want to forbid certain domains because you > regard them as non-atomic but is it really sufficient, as Codd originally > suggested, to say that non-atomic means that there exists some > transformation to some smaller more "primitive" domain of values? I don't > think so. Here's an example. How do you define the atomicity of a digital > graphic image? We have types for digital images so an RDBMS may also come > equipped with operators to transform images into other images - by slicing > them into smaller images for example. So when is an image non-atomic? Is an > atomic image one that contains only a single pixel? What use would such a > restriction have in theory or in practice? > > Or maybe you want to require that an image is only non-atomic if smaller > transformations of it have some meaning in themselves. Again, this seems > like an arbitrary restriction. Smaller images may well be useful and > meaningful to the user. With digital images we frequently want to render > detailed clips of much larger graphics. Would you really want to disallow a > domain from containing an image that was a clip of a larger one in the same > domain? If so, how do you define which images are permitted in that domain > and which aren't? Apparently your definition may have to be self-referential > which poses some serious problems in defining constraints and other > operations on that domain. > > -- > David Portas > SQL Server MVP > -- > > >> ... How do you define the atomicity of a digital Fine, but my point was do you have a rigorous theoretical basis for that >> graphic image? ... > > An image data type would be atomic. statement? I don't see it. Intuitively you might feel that an image type is atomic but this is where we came in: Atomicity isn't a formally defined property of a domain or value, rather it's a subjective decision we make based on our knowledge of the reality we intend to model and of the capabilities of our RDBMS. When stating your formal definition of why an image type is atomic, assume that the digital image type is only defined as a simple bitmap that may therefore be subdivided into multiple smaller bitmaps. The implication of this is that if you add attributes to define how the bitmap pieces fit together you can always decompose any bitmap larger than one pixel without loss of information. As far as I can see, this is exactly analogous to the example of an ordered delimited list. -- David Portas SQL Server MVP -- A couple of better examples to further your point David, are phone numbers and
street addresses. In some systems, the phone number stored as a single value is sufficiently atomic for the purposes of the problem domain. However, for the phone company, it would be important to break up the number into its constituent pieces. Street addresses pose the same problem. In most systems, "123 N. Main St." is sufficiently encapsulated for any all queries against the system. However, if you are building a system to be used by a delivery company, this is insufficient and must be broken into it's constituent pieces (123, N, Main, St). The determination of atomic is wholly dependent on the problem domain. There are many such examples, including delimited lists, where a value can be considered atomic in its raw form. Said another way, the determination of whether a value is sufficiently atomic is wholly dependent on how it is used in the system. Thomas Show quote "David Portas" <REMOVE_BEFORE_REPLYING_dpor***@acm.org> wrote in message news:qOWdnVHVyuyUvFXfRVn-gw@giganews.com... >>> ... How do you define the atomicity of a digital >>> graphic image? ... >> >> An image data type would be atomic. > > Fine, but my point was do you have a rigorous theoretical basis for that > statement? I don't see it. Intuitively you might feel that an image type is > atomic but this is where we came in: Atomicity isn't a formally defined > property of a domain or value, rather it's a subjective decision we make based > on our knowledge of the reality we intend to model and of the capabilities of > our RDBMS. > > When stating your formal definition of why an image type is atomic, assume > that the digital image type is only defined as a simple bitmap that may > therefore be subdivided into multiple smaller bitmaps. The implication of this > is that if you add attributes to define how the bitmap pieces fit together you > can always decompose any bitmap larger than one pixel without loss of > information. As far as I can see, this is exactly analogous to the example of > an ordered delimited list. > > -- > David Portas > SQL Server MVP > -- > > "Brian Selzer" <br***@selzer-software.com> wrote in message And, in my databases, a longitude without a latitude is nothing but anews:OM%23eJFnfFHA.1948@TK2MSFTNGP12.phx.gbl... > > employee was hired. A character string derives its value from the > permutation of its component character values. Regardless of the underlying > physical representation of a character, the value is the same whether its > Unicode, ASCII or EBCDIC. Each individual character in the permutation does > not represent anything other than its own value. In the same way an integer > derives its value from the permutation of its component bits, which do not > represent anything other than their own values. meaningless decimal. Add a latitude to that longitude and we suddenly get a datatype -- a point. We can do various operations on the point. We can compute distances to other points, we can add offsets based on direction, etc. We can do none of those operations on JUST a longitude. The longitude itself, within my databases, just like a single bit of an integer, is meaningless. But in some other database, perhaps some computation is possible based on longitude, and some other computations are possible based on latitude. And in THAT database, the combination of longitude and latitude would not be atomic. Likewise with characters. For instance, consider the following string: BRIAN Is that string 'atomic'? I don't know! In a given database, there might be rules defined to perform logic based on the 5 characters in the string. Perhaps each character represents an access level to some resource. In such a case, based on your previous posts to this thread, I'm assuming that you would probably expect to see: B,R,I,A,N But the commas are not what makes this string problematic, because as we've seen above, commas are not required for the access control scheme. We can parse the comma-less version and get the same net result. The problem is, in both cases the string is not being treated as merely a string -- it's being treated as a collection of characters. And that's a matter of semantics, because we cas also insert that same string into a column called 'FirstName' -- and, assuming that column is used the way we understand first names to be used, the string suddenly can be considered 'atomic'. > And, in my databases, a longitude without a latitude is nothing but a You're wrong. Because it is a longitude, it can only be used as a> meaningless decimal. Add a latitude to that longitude and we suddenly get a > datatype -- a point. longitude. It cannot be used as a lattitude. It cannot be used as a price. It carries with it the definition and constraints associated with its domain. (You may not have defined any constraints, but if you didn't, you probably should have.) It has a predefined range of possible values, and represents the distance of a place east or west of an imaginary line from the top to the bottom of the Earth. It is NOT just a meaningless decimal. > But in some other database, perhaps some computation is possible based What happens when THAT database is combined with your database.> on longitude, and some other computations are possible based on latitude. > And in THAT database, the combination of longitude and latitude would not be > atomic. Your other example should not merit comment, because stuffing flags in a character column is a clear violation of 1NF. Each flag has a separate purpose, therefore each flag value in a row is an atomic value with respect to the relational model. The fact that during implementation you chose to stuff the flags into a single character column is an example of denormalizing a relation for some implementation-specific reason. Show quote "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message news:uKx4vMwfFHA.2700@TK2MSFTNGP15.phx.gbl... > "Brian Selzer" <br***@selzer-software.com> wrote in message > news:OM%23eJFnfFHA.1948@TK2MSFTNGP12.phx.gbl... > > > > employee was hired. A character string derives its value from the > > permutation of its component character values. Regardless of the > underlying > > physical representation of a character, the value is the same whether its > > Unicode, ASCII or EBCDIC. Each individual character in the permutation > does > > not represent anything other than its own value. In the same way an > integer > > derives its value from the permutation of its component bits, which do not > > represent anything other than their own values. > > And, in my databases, a longitude without a latitude is nothing but a > meaningless decimal. Add a latitude to that longitude and we suddenly get a > datatype -- a point. We can do various operations on the point. We can > compute distances to other points, we can add offsets based on direction, > etc. We can do none of those operations on JUST a longitude. The longitude > itself, within my databases, just like a single bit of an integer, is > meaningless. > > But in some other database, perhaps some computation is possible based > on longitude, and some other computations are possible based on latitude. > And in THAT database, the combination of longitude and latitude would not be > atomic. > > Likewise with characters. For instance, consider the following string: > > BRIAN > > Is that string 'atomic'? I don't know! In a given database, there > might be rules defined to perform logic based on the 5 characters in the > string. Perhaps each character represents an access level to some resource. > In such a case, based on your previous posts to this thread, I'm assuming > that you would probably expect to see: > > B,R,I,A,N > > But the commas are not what makes this string problematic, because as > we've seen above, commas are not required for the access control scheme. We > can parse the comma-less version and get the same net result. The problem > is, in both cases the string is not being treated as merely a string -- it's > being treated as a collection of characters. And that's a matter of > semantics, because we cas also insert that same string into a column called > 'FirstName' -- and, assuming that column is used the way we understand first > names to be used, the string suddenly can be considered 'atomic'. > > > -- > Adam Machanic > SQL Server MVP > http://www.datamanipulation.net > -- > > > > "Brian Selzer" <br***@selzer-software.com> wrote in message EXACTLY!news:errQq62fFHA.3316@TK2MSFTNGP14.phx.gbl... > > to the relational model. The fact that during implementation you chose to > stuff the flags into a single character column is an example of > denormalizing a relation for some --- implementation-specific --- reason. Sorry, Brian, I disagree. Atomicity always depends on the system. For example,
suppose we have a mapping program that takes a location in the form (Long, Lat). Further suppose, that there is *never* a need to query on longitude or latitude. Rather, searches are against other attributes and at the end we map the location from the coordinates of the given entity found in the search. Thus, the only time the value will ever be used is to plug into this mapping software to map a geographical location. It will never be queried against. In that case, it is perfectly acceptable to create a column called "Coordinates" and store these two values as single point. The values individually have no meaning in the context of the system being built. They only have context together in the form (Long, Lat). If we did not accept that atomic was based on the system being designed, we would have no end to debates on whether any given value is sufficiently atomic. VINs could never be stored a single value because it is made of individual attributes. Storage of phone numbers would always require three or four columns. Storage of addresses would always require seven or more columns. Atomicity must always be checked against the needs and requirements of the system being built lest we get mired in "analysis-paralysis." Thomas > Semantics. You muddy a pure theorectical concept with a human's Amen brother. I don't know how many times I've seen columns filled with > interference. Using that same logic, one could stuff any crap into a > database, call it a data element, and code an operator applicable to > the CRAP datatype. Then you could point at your database and claim it > still adhered to 1NF, no? crap like '1~3~0~9~2' or '1,,3,5,,6'. Who cares whether it might technically qualify as 1NF? Note, none of the databases I saw with that had any operator for those fields, all hardcoded in the application layer. What if I want to add another element after the ~2? How unnecessarily complicated does this then become? Do you need to update your operator first? Still think you have a scalable and reliable system? How are you going to link a description in another table to ~3 with a foreign key when it's not a column? At the very least, the first part of your organization's best practices should be 'DO NOT implement the CRAP datatype or you will be sacked.' wbrianwhite wrote:
> Amen brother. I don't know how many times I've seen columns filled with As opposed to my current job:> crap like '1~3~0~9~2' or '1,,3,5,,6'. Who cares whether it might technically > qualify as 1NF? Note, none of the databases I saw with that had any operator > for those fields, all hardcoded in the application layer. What if I want to > add another element after the ~2? How unnecessarily complicated does this > then become? Do you need to update your operator first? Still think you > have a scalable and reliable system? How are you going to link a description > in another table to ~3 with a foreign key when it's not a column? At the > very least, the first part of your organization's best practices should be > 'DO NOT implement the CRAP datatype or you will be sacked.' "You MUST implement the CRAP datatype or you will be sacked." And let's not forget the ever-infamous contractual job. Company hires shmuck for 3-6 month project to get them off the ground. He does, but it's clearly been done with a minimum amount of design and a maximum amount of ad-hoc quick fixes. It worked just long enough for his paychecks to cash and any contractual obligations to expire. Enter you: You: "Well, it appears that your system sucks. I'll need to virtually start from scratch. My first clue was the CRAP datatype here." Boss, blustery: "What do you mean it sucks? It's been working fine for the past couple months! Are you sure you know what you're talking about?" You, blinking: "Oooooh. Ok, I see what you're saying now. You're right, I was confused there for a minute. It's become clear to me that your real problem are the widgets in the perforationator, I betcha dimes to dollars they've crossed wires with the gibson regulators. Shouldn't take me any time at all to clear up once we get the contract nailed down. The contract expires after a month, you say?" An absurd example--let's say you have the following table:
Orders: Customer CSV, Item CSV, Quantity INT, OrderDate DATETIME 'Joe Smith, 16473 Wills Ave., Apt. 153, Atlanta, Georgia', 'Candle, Red, 1 ft., $1.99', 5, '2005-04-30 11:15:00' 'Bill Black, P.O Box 1535, Arlington, Virginia', 'Candle, Blue, 6 in., $1.29', 3, '2005-05-02 16:18:00' 'Bill Black, P.O Box 1535, Arlington, Virginia', 'Candle, White, 3 in., $0.79', 2, '2005-05-15 12:32:00' In the strict sense, this table is in 1NF, because each attribute contains only one value. In my opinion, there must be some definitive criteria that separates the ridiculous example above and the equally ridiculous but opposite characterization that a string is not atomic because it is a list of character values. (A date can be thought of as the number of days since some arbitrary point in time, and as such, year, month and day are derived values; therefore, a date value is always scalar.) Let's explore this. The permutation of character values determines the meaning of a string. Each individual character has no significance other than its own value, nor does its inclusion or position in a string impart any semantic context to that value (the character value, not the string value). In most lists, each value represents something tangible, and the entry in the list is a handle, or shorthand for the tangible entity. For example, each item on a to-do list represents some task that must be performed. In other lists, like an ordered pair, the position of the value imparts meaning to the value. For example, in an ordered pair, the first number is the x coordinate, and the second is the y coordinate. Let's generalize a bit more. Given the following definitions for the different kinds of composite values: a collection is an unordered group of zero or more values, not necessarily distinct, that share the same base type; a set is an unordered group of zero or more distinct values that share the same base type; a list is an ordered group of zero or more values, not necessarily distinct, that share the same base type; an n-tuple is an ordered group of n values, whose position determines both type and context. I propose the following formal definition: a composite data type is atomic if and only if: (1) each distinct value is determined solely on the presence and/or order of its component elements' values, (2) all of its componenent elements are atomic, and (3) none of its component elements is a surrogate for a tangible entity. An n-tuple would never be atomic, because each element is a surrogate--that is, each element represents something. A string would always be atomic because its value is determined solely by the permutation of its component character values. A set of integers could be considered atomic if the integer values did not carry any additional meaning. A list of sale dates for item X could never be considered atomic because each date entry carries additional meaning aside from the date value, and consequently each date value is a surrogate for a sale date for item X. Show quote "Anith Sen" <an***@bizdatasolutions.com> wrote in message news:ek3$8VbfFHA.3988@TK2MSFTNGP10.phx.gbl... > >> My primary objection to implementing a CSV ist domain and defining > >> relations containing attributes with a CSV list domain is that a CSV list > >> is by definition not atomic. > > Whether some value is atomic or not is purely subjective and contextual; In > some contexts the "Bryan" is an atomic value while in some other context it > may be a list of letters. In some contexts the literal "2005/06/30" defines > a single date value while in some other contexts it is comprised of three > distinct values representing year, month and time. > > Thus a definition of 1NF on a loose term like "atomicity" alone is neither > precise nor formal. For a formal definition one would need a reference to > something which is non-contextual and unambiguous. A type or domain provides > that. A type\ domain is a set of values along with associated operators > applicable on those values. Thus in a good type system, one could have a > type of arbitrary complexity and values are scalar with respect to the type. > > >> A CSV list is an ordered set of zero or more values, separated by commas. > > In some contexts it is. In some other contexts, it is just as scalar as an > int or datetime or char(10) value. A well defined type could precisely > capture the context and represent the values in a relational database > appropriately. > > >> Every datum (atomic value) in a relational database can be accessed if > >> you know the table name, the primary key value, and the column name. > > Nobody disagrees with that. > > >> How can you write a query to access the third element in a CSV list > >> column using only the table name, the primary key value, and the column > >> name. > > Using the operators applicable to the CSV list type. > > To use an analogy, how do you extract the month from a datetime value? Using > DATEPART operator, which is an operator applicable to the DATETIME type. > > >> The additional dimension (element number) makes this impossible. > > Well, that could be an argument to the operator, if an element needs to be > extracted based on its index or position in the list. > > >> It follows then that a database that implements a CSV list domain is by > >> definition not a relational database > > No, it does not follow anything of that sort. The link on 1NF provided early > in this thread should clarify such misconceptions. > > -- > Anith > > Your proposals may make sense for some real-world products but they are
not consistent with the Relational Model. In fact you are proposing exactly the kind of constraints that some people incorrectly identify and criticize as being the limitations of the model - namely, that it can only cope with certain primitive types of data. These are product limitations, not theoretical ones. Relation-valued attributes, CSV list types, XML types and other "exotic" types are perfectly respectable in theory. Given that one is provided with operators for every built-in datatype, the quality of atomicity of values in that type is primarily a subjective and pragmatic issue - i.e. does your RDBMS support operators and other features that make particular representations of data a practical and useful design. Database professionals understand the features and constraints of the products they use and make judgements about non-atomic values based on their knowledge and experience of what will work well in those products. I'm as guilty as many others of instinctively labelling delimited lists or repeating groups of columns as a violation of 1NF. It's very convenient to roll out that argument sometimes but it's also a tiny bit lazy and prejudiced. It is shaped purely by my experience of the products and situations I've come across. -- David Portas SQL Server MVP -- David,
>> I'm as guilty as many others of instinctively labelling delimited lists Don't be hard on yourself :-) All of us have done it; even some of the early >> or repeating groups of columns as a violation of 1NF. It's very >> convenient to roll out that argument sometimes but it's also a tiny bit >> lazy and prejudiced. It is shaped purely by my experience of the products >> and situations I've come across. papers by Codd has similar suggestions. In fact, till some good research on type inheritance and most specific types (MSTs) became fruitful, most folks familiar with relational databases were stating the same. The true nature of first normal form, at least from the perspective of scalar (encapsulated) types has not been well understood in the early years of relational databases. But then common sense generally tells us, when someone in a public forum requests a CSV to be crammed into a single VARCHAR column it is most likely an informal violation of 1NF. -- Anith >> An absurd example--let's say you have the following table: [snipped for Provided each of those values is drawn from its corresponding well defined >> brevity] In the strict sense, this table is in 1NF, because each >> attribute contains only one value. type. >> In my opinion, there must be some definitive criteria that separates the It is already mentioned that such a definitive criteria can be provided by a >> ridiculous example above and the equally ridiculous but opposite >> characterization that a string is not atomic because it is a list of >> character values. precisely defined type. For a simple introduction to this concept see: http://www.cs.ust.hk/vldb2002/VLDB2002-proceedings/slides/S01P02slides.pdf However, as David said that the discretion regarding type support in a RDBMS is not strictly in the realm of relational theory, which occasionally results in subjective perceptions even within the logical model. >> (A date can be [snipped for brevity] I propose the following formal None of us here need to propose anything new, since qualified technical >> definition: literature is out there on value atomicity, t | |||||||||||||||||||||||