Home All Groups Group Topic Archive Search About

3 Questions on db Theory/Design

Author
29 Jun 2005 6:31 PM
roy.anderson@gmail.com
Hey all,
I'm what some would call a "lower tier" programmer in terms of
education (only have an AAS). As is typical of one who has learnt more
on the job than in the classroom, there are certain aspects of
programming in which my *depth* of knowledge and insight would surprise
you. However, in equal measure, there are certain aspects of
programming in which my *lack* of knowledge would likewise surprise you
as well. I'm about to try to use you to rectify some of my lacking.
:-)

1.) To qualify for 1NF status a table must adhere to a couple of
standards, one of which is that all columns must be the same datatype.
How does one reconcile this in SQL Server with date/time fields? For
instance, lets say I have 2 tables, the first in which all fields are
varchar (including the PK "UID"). Then I have a second table which has
the same varchar "UID" field as the PK, but all the remaining fields
are smalldatetime? That would seem to violate 1NF? Would the work
around to this be to CAST all the smalldatetimes into varchar?

2.) In a related vein... Why is a "relational" database good?
Specificly, using the above example, what benefit am I garnering by
splitting one large table into 2 separate tables? Is a relational db's
primary benefit that it helps prevent data errors or that it makes data
accessing more efficient? Or is it both?

3.) Unrelated but, frequently I see people requesting "DDL" in this
ng... WTH is DDL?! :-)

Thanks much!
Roy

Author
29 Jun 2005 6:38 PM
Adam Machanic
> 1.) To qualify for 1NF status a table must adhere to a couple of
> standards, one of which is that all columns must be the same datatype.
> How does one reconcile this in SQL Server with date/time fields? For
> instance, lets say I have 2 tables, the first in which all fields are
> varchar (including the PK "UID"). Then I have a second table which has
> the same varchar "UID" field as the PK, but all the remaining fields
> are smalldatetime? That would seem to violate 1NF? Would the work
> around to this be to CAST all the smalldatetimes into varchar?

    I think you're misunderstanding 1NF.  There is no requirement for all
columns in a table to be of the same datatype.  The requirement is that
every row in the table must have the same columns, of the same datatype, as
every other row in the table.  The following table is perfectly acceptable:

CREATE TABLE X
(
    Col1 INT NOT NULL PRIMARY KEY,
    Col2 VARCHAR(50),
    Col3 DATETIME,
    Col4 VARCHAR(500),
    Col5 TINYINT
)

Keep in mind that a datatype is more than just the SQL Server datatype.
It's also the logical type of the attribute.  So, for instance, the
following would not be a good idea:

CREATE TABLE X
(
    XType VARCHAR(50),
    XThing VARCHAR(100)
)

INSERT X VALUES ('Date', '20050629')
INSERT X VALUES ('Description', 'This is bad')

Now XThing has two datatypes -- It can be a date in some cases, and a
description in others.


> 2.) In a related vein... Why is a "relational" database good?
> Specificly, using the above example, what benefit am I garnering by
> splitting one large table into 2 separate tables? Is a relational db's
> primary benefit that it helps prevent data errors or that it makes data
> accessing more efficient? Or is it both?

    It's all about data integrity.  The most efficient database in the
universe would be useless if you couldn't trust the data therein.


> 3.) Unrelated but, frequently I see people requesting "DDL" in this
> ng... WTH is DDL?! :-)

    Data Definition Language -- CREATE TABLE statements, etc.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
29 Jun 2005 8:00 PM
Brian Selzer
1) 1NF: each field in a record can contain only one value.  A comma
separated list of values stored in a single field violates this.

2) There are many benefits to using a relational database.  Relational
databases are flexible, in that you can create queries against the database
that you didn't know you'd need prior to implementation.  Relational
databases reduce redundancy--in other words, in a fully normalized
relational database, each atomic value is stored in only one place, and can
always be retrieved if you know (1) the table name, (2) the primary key
value, and (3) the column name.  Relational databases are scalable, and
resilient to change.  You can add tables while a system is on-line.  In some
systems, you can even add columns to existing tables while a system is
on-line.

Normalization is used to eliminate redundancy, and by so doing eliminate
update problems.  The benefit you get from splitting a table depends on how
you do it.  If you're storing department-specific information in the
Employee table, such that the same information is stored for each employee
in the same department, then you can create a new Department table that
contains the department specific information and reference the department
key in the Employee table, thereby eliminating the redundancy associated
with having the department-specific information in each record.  Updates are
more reliable.  Since there is only one record for each department, instead
of having to update every employee record that is in that department with
the new information, you can simply update the single department record.

There are six normal forms, and this is not the place to go over them.  I
recommend buying a book.  "An Introduction to Database Systems" by C. J.
Date is really good.  It covers all the gory details about normalization,
including 4NF and 5NF, and the update problems eliminated by normalizing.
It talks about relational algebra--the theory behind joins, unions,
projection, restriction, etc.  Be warned, however: it is not a baby book.


<roy.ander***@gmail.com> wrote in message
Show quote
news:1120069866.083221.167760@f14g2000cwb.googlegroups.com...
> Hey all,
> I'm what some would call a "lower tier" programmer in terms of
> education (only have an AAS). As is typical of one who has learnt more
> on the job than in the classroom, there are certain aspects of
> programming in which my *depth* of knowledge and insight would surprise
> you. However, in equal measure, there are certain aspects of
> programming in which my *lack* of knowledge would likewise surprise you
> as well. I'm about to try to use you to rectify some of my lacking.
> :-)
>
> 1.) To qualify for 1NF status a table must adhere to a couple of
> standards, one of which is that all columns must be the same datatype.
> How does one reconcile this in SQL Server with date/time fields? For
> instance, lets say I have 2 tables, the first in which all fields are
> varchar (including the PK "UID"). Then I have a second table which has
> the same varchar "UID" field as the PK, but all the remaining fields
> are smalldatetime? That would seem to violate 1NF? Would the work
> around to this be to CAST all the smalldatetimes into varchar?
>
> 2.) In a related vein... Why is a "relational" database good?
> Specificly, using the above example, what benefit am I garnering by
> splitting one large table into 2 separate tables? Is a relational db's
> primary benefit that it helps prevent data errors or that it makes data
> accessing more efficient? Or is it both?
>
> 3.) Unrelated but, frequently I see people requesting "DDL" in this
> ng... WTH is DDL?! :-)
>
> Thanks much!
> Roy
>
Author
29 Jun 2005 11:05 PM
Roy Anderson
"Normalization is used to eliminate redundancy, and by so doing
eliminate update problems. The benefit you get from splitting a table
depends on how you do it. If you're storing department-specific
information in the Employee table, such that the same information is
stored for each employee"

<snip>
Thanks for the informative posts you guys.
A follow-up question: So there is truly no inherent benefit (in and of
itself) to splitting up one large table in sql server into 2 or 3
smaller ones? For instance, lets say that we have one table with a "UID"
PK and 99 other columns. Assuming it's a relation, there is no benefit
to dividing the table into 2 tables with a matching "UID" PK and 50
columns a piece? In other words, the overhead processing is the same
either way and in fact, might be more with 2 tables given that the "UID"
is repeated in both?

Hypothetical:
Let's say I'm constructing a database intended to track what a group of
dogs do every day. Each record would essentially be an indication of
what the specific animal is doing at any given time. Each record would
look something like this:
"Dog_Owner_ID"
"Dog_Name_ID"
"Action_Begin_DateTime"
"Action_End_DateTime"
"Action_ID"
"Action_Begin_Location"
"Action_End_Location"
"Encountered_Other_Dogs(Y/N)"
"Number_Of_Dogs_Encountered"
"Number_Of_Humans_Encountered"
"Number_Of_Other_Creatures_Encountered"
"Other_Creature_ID_1"
"Other_Creature_ID_2"
"Other_Creature_ID_3"
"DateTime_Record_Appended"
"DateTime_Record_Edited"

Obviously this is a flaky example, but it's pretty indicative of the
bizaare, heaped-together databases I've encountered in my time. Assuming
that this is the dynamic table of the database that gets most frequently
updated, eventually it would grow to an obscene number of records. Does
splitting it up into multiple tables (say, for instance, by "Action_ID"
so that there'd be a table specific to each action do more harm than
good?

Thanks again!

*** Sent via Developersdex http://www.developersdex.com ***
Author
30 Jun 2005 2:07 AM
Adam Machanic
"Roy Anderson" <roy.andersonSPAMHATER@gmail.com> wrote in message
news:%23N9uX8PfFHA.1304@TK2MSFTNGP10.phx.gbl...
>
> Hypothetical:
> Let's say I'm constructing a database intended to track what a group of
> dogs do every day. Each record would essentially be an indication of
> what the specific animal is doing at any given time. Each record would
> look something like this:

    That table might not be very well designed, but from the standpoint of
relational databases and normalization there's nothing wrong with it that I
can see -- other than, perhaps, the lack of an (obvious) primary key.
Owner/Dog/Action/Date, I guess?


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
30 Jun 2005 4:39 AM
Brian Selzer
There is redundancy in your example.  What do you do if a dog encountered
more than three other creatures?  How would you query which other creatures
a dog encountered?  I cringe every time I see a table like this.  I worked
on a system that tracked inventory for several retail outlets and a main
warehouse.  One table was designed like this:
ITEM, WHQTY, ST1QTY, ST2QTY, ST3QTY, WHMIN, ST1MIN, ST2MIN, ST3MIN, WHMAX,
ST1MAX, ST2MAX, ST3MAX, where QTY refered to the quantity on hand, MIN
referred to the reorder threshold, and MAX referred to the max stock level.
They added a fourth store....

I added a location table with 5 rows, ('WH'), ('ST1'), ('ST2'), ('ST3'),
('ST4'), and  changed the inventory table thus:

ITEM, LOCATION, QTY, MIN, MAX

Having a separate table for each ActionID would only make queries more
difficult.  It's similar to the example above, where I combined the are four
repeating groups of columns into a single group.  By the same token, you
wouldn't want to create a repeating group by separating the information into
several tables.  What happens when you add a new ActionID?  Do you add an
additional table?

You shouldn't separate information just for the sake of separating it.
There should be a reason, either to reduce redundancy, to overcome a
physical limitation (only 8000 chars per row), to make it easier to enforce
security, etc.  There may be performance considerations, for example, if
only one or two columns are updated with any frequency, you may want to
separate them into their own table to reduce lock contention on the rest of
the row.



Show quote
"Roy Anderson" <roy.andersonSPAMHATER@gmail.com> wrote in message
news:#N9uX8PfFHA.1304@TK2MSFTNGP10.phx.gbl...
> "Normalization is used to eliminate redundancy, and by so doing
> eliminate update problems. The benefit you get from splitting a table
> depends on how you do it. If you're storing department-specific
> information in the Employee table, such that the same information is
> stored for each employee"
>
> <snip>
> Thanks for the informative posts you guys.
> A follow-up question: So there is truly no inherent benefit (in and of
> itself) to splitting up one large table in sql server into 2 or 3
> smaller ones? For instance, lets say that we have one table with a "UID"
> PK and 99 other columns. Assuming it's a relation, there is no benefit
> to dividing the table into 2 tables with a matching "UID" PK and 50
> columns a piece? In other words, the overhead processing is the same
> either way and in fact, might be more with 2 tables given that the "UID"
> is repeated in both?
>
> Hypothetical:
> Let's say I'm constructing a database intended to track what a group of
> dogs do every day. Each record would essentially be an indication of
> what the specific animal is doing at any given time. Each record would
> look something like this:
> "Dog_Owner_ID"
> "Dog_Name_ID"
> "Action_Begin_DateTime"
> "Action_End_DateTime"
> "Action_ID"
> "Action_Begin_Location"
> "Action_End_Location"
> "Encountered_Other_Dogs(Y/N)"
> "Number_Of_Dogs_Encountered"
> "Number_Of_Humans_Encountered"
> "Number_Of_Other_Creatures_Encountered"
> "Other_Creature_ID_1"
> "Other_Creature_ID_2"
> "Other_Creature_ID_3"
> "DateTime_Record_Appended"
> "DateTime_Record_Edited"
>
> Obviously this is a flaky example, but it's pretty indicative of the
> bizaare, heaped-together databases I've encountered in my time. Assuming
> that this is the dynamic table of the database that gets most frequently
> updated, eventually it would grow to an obscene number of records. Does
> splitting it up into multiple tables (say, for instance, by "Action_ID"
> so that there'd be a table specific to each action do more harm than
> good?
>
> Thanks again!
>
> *** Sent via Developersdex http://www.developersdex.com ***
Author
29 Jun 2005 11:32 PM
--CELKO--
>> 1) 1NF: each field [sic] in a record [sic] can contain only one value.  A comma separated list of values stored in a single field [sic] violates this. <<

First things first -- learn why rows are not records, tables are not
files and columns are not fields.  Until you have the right mental
model, you will not understand SQL or RDBMS.

The answer is that this is true 99% of the time in the real world.  But
a data element is atomic when it has one and only one value.  Thus an
enumerated path represetned as a CSV list could be a single data
element, or a pair of (longtitude, latitude) columns could be a single
data element
Author
30 Jun 2005 3:07 AM
Brian Selzer
Just for your information:  I do understand what an attribute is.  I do
understand what a tuple is.  I do understand what a relation is.  I
understand the mathematical foundation of the relational model.  What
difference does it make if I use the informal term "field" instead of your
informal term "column?"   What difference does it make if I use the informal
term "record" instead of your informal term "row?"  I don't believe I have
ever used the term "file," except in relation to physical database and
transaction log files.  I believe that I have a pretty good understanding of
SQL and RDBMS.  I understand intimately the limitations of file-based
databases, both from a technical support standpoint and from a development
standpoint.  Except for a short stint in the U.S. Air Force, I've been
working with databases all of my adult life.  I even rolled my own in 8080
assembler.  I have experience with DBase, Btrieve, FoxPro, Progress,
Informix, SqlBase, DB2, Oracle, and SQL Server.  I've been certified on SQL
Server since 6.5 came out.   I take offense at your insinuation that I don't
know what I'm talking about and can't understand SQL or RDBMS.

The answer is that this is true 100% of the time in the real world.
Longitude and lattitude are two distinct numbers.  Someone at some time may
want to perform a query using one or both of those two numbers, for example,
how many locations are within a twenty-five mile radius of location X.  If
you use a comma separated character field, you have to write code to extract
each number and convert them to numeric values so that they can be compared
(if it is even possible, since the database designer probably didn't bother
to put a check constraint on the column to prevent garbage from being
recorded).  The same is true with the other example.  One of the main
benefits of the relational model is that it allows you to construct queries
that you didn't anticipate when you designed the system.  By violating the
Information Rule, you incur increased query complexity and increased
potential for error.


"--CELKO--" <jcelko***@earthlink.net> wrote in message
news:1120087943.695348.305560@g43g2000cwa.googlegroups.com...
> >> 1) 1NF: each field [sic] in a record [sic] can contain only one value.
A comma separated list of values stored in a single field [sic] violates
this. <<
Show quote
>
> First things first -- learn why rows are not records, tables are not
> files and columns are not fields.  Until you have the right mental
> model, you will not understand SQL or RDBMS.
>
> The answer is that this is true 99% of the time in the real world.  But
> a data element is atomic when it has one and only one value.  Thus an
> enumerated path represetned as a CSV list could be a single data
> element, or a pair of (longtitude, latitude) columns could be a single
> data element
>
Author
30 Jun 2005 3:44 AM
--CELKO--
>> What difference does it make if I use the informal term "field" instead of your informal term "column?"   What difference does it make if I use the informal term "record" instead of your informal term "row?"  <<

Column and row are terms in SQL while  field and record are the proper
termd for physical file systems. Like most new ideas, the hard part of
understanding what the relational model is comes in un-learning what
you know about file systems.  As Artemus Ward (William Graham Sumner,
1840-1910) put it, "It ain't so much the things we don't know that get
us into trouble. It's the things we know that just ain't so."

If you already have a background in data processing with traditional
file systems, the first things to un-learn are:

(0) Databases are not file sets.
(1) Tables are not files.
(2) Rows are not records.
(3) Columns are not fields.

Modern data processing began with punch cards, or Hollerith cards used
by the Bureau of the Census.  Their original size was that of a United
States Dollar bill.  This was set by their inventor, Herman Hollerith,
because he could get furniture to store the cards from the United
States Treasury Department, just across the street.  Likewise, physical
constraints limited each card to 80 columns of holes in which to record
a symbol.

The influence of the punch card lingered on long after the invention of
magnetic tapes and disk for data storage.  This is why early video
display  terminals were 80 columns across.  Even today, files which
were migrated from cards to magnetic tape files or disk storage still
use 80 column records.

But the influence was not just on the physical side of data processing.
The methods for handling data from the prior media were imitated in
the new  media.

Data processing first consisted of sorting and merging decks of punch
cards (later, sequential magnetic tape files) in a series of distinct
steps.  The result of each step feed into the next step in the process.


Relational databases do not work that way.  Each user connects to the
entire database all at once, not to one file at time in a sequence of
steps.  The users might not all have the same database access rights
once they are connected, however.  Magnetic tapes could not be shared
among users at the same time, but shared data is the point of a
database.

Tables versus Files

A file is closely related to its physical storage media.  A table may
or may  not be a physical file.  DB2 from IBM uses one file per table,
while Sybase puts several entire databases inside one file.  A table is
a <i>set<i> of rows of the same kind of thing.  A set has no ordering
and it makes no sense to ask for the first or last row.

A deck of punch cards is sequential, and so are magnetic tape files.
Therefore, a <i>physical<i> file of ordered sequential records also
became the <i>mental<i> model for data processing and it is still hard
to shake.  Anytime you look at data, it is in some physical ordering.

The various access methods for disk storage system came later, but even
these access methods could not shake the mental model.

Another conceptual difference is that a file is usually data that deals
with  a whole business process.  A file has to have enough data in
itself to support applications for that business process.  Files tend
to be "mixed"  data which can be described by the name of the business
process, such as "The Payroll file" or something like that.

Tables can be either entities or relationships within a business
process.  This means that the data which was held in one file is often
put into several tables.  Tables tend to be "pure" data which can be
described by single words.  The payroll would now have separate tables
for timecards, employees, projects and so forth.

Tables as Entities

An entity is physical or conceptual "thing" which has meaning be
itself.  A  person, a sale or a product would be an example.  In a
relational database, an entity is defined by its attributes, which are
shown as values in columns in rows in a table.

To remind users that tables are sets of entities, I like to use plural
or collective nouns that describe the function of the entities within
the system for the names of tables.  Thus "Employee" is a bad name
because it is singular; "Employees" is a better name because it is
plural; "Personnel" is best because it is collective and does not
summon up a mental picture of individual persons.

If you have tables with exactly the same structure, then they are sets
of the same kind of elements.  But you should have only one set for
each kind of data element!  Files, on the other hand, were PHYSICALLY
separate units of storage which coudl be alike -- each tape or disk
file represents a step in the PROCEDURE , such as moving from raw data,
to edited data, and finally to archived data.  In SQL, this should be a
status flag in a table.

Tables as Relationships

A relationship is shown in a table by columns which reference one or
more entity tables.  Without the entities, the relationship has no
meaning, but the relationship can have attributes of its own.  For
example, a show  business contract might have an agent, an employer and
a talent.  The method of payment is an attribute of the contract
itself, and not of any of the three parties.

Rows versus Records

Rows are not records.  A record is defined in the application program
which  reads it; a row is defined in the database schema and not by a
program at all.  The name of the field in the READ or INPUT statements
of the application; a row is named in the database schema.

All empty files look alike; they are a directory entry in the operating
system with a name and a length of zero bytes of storage.  Empty tables
still have columns, constraints, security privileges and other
structures, even tho they have no rows.

This is in keeping with the set theoretical model, in which the empty
set is a perfectly good set.  The difference between SQL's set model
and standard mathematical set theory is that set theory has only one
empty set, but in SQL  each table has a different structure, so they
cannot be used in places where non-empty versions of themselves could
not be used.

Another characteristic of rows in a table is that they are all alike in
structure and they are all the "same kind of thing" in the model.  In a
file system, records can vary in size, datatypes and structure by
having flags in the data stream that tell the program reading the data
how to interpret it.  The most common examples are Pascal's variant
record, C's struct syntax and Cobol's OCCURS clause.

The OCCURS keyword in Cobol and the Variant records in Pascal have a
number which tells the program how many time a record structure is to
be repeated in the current record.

Unions in 'C' are not variant records, but variant mappings for the
same physical memory. For example:

union x {int ival; char j[4];} myStuff;

defines myStuff to be either an integer (which are 4 bytes on most
modern C  compilers, but this code is non-portable) or an array of 4
bytes, depending on whether you say myStuff.ival or myStuff.j[0];

But even more than that, files often contained records which were
summaries of subsets of the other records -- so called control break
reports.  There is no requirement that the records in a file be related
in any way -- they are literally a stream of binary data whose meaning
is assigned by the program reading them.

Columns versus Fields

A field within a record is defined by the application program that
reads it.  A column in a row in a table is defined by the database
schema.  The datatypes in a column are always scalar.

The order of the application program variables in the READ or INPUT
statements is important because the values are read into the program
variables in that order.  In SQL, columns are referenced only by their
names.  Yes, there are shorthands like the SELECT * clause and INSERT
INTO <table name> statements which expand into a list of column names
in the physical order in which the column names appear within their
table declaration, but these are shorthands which resolve to named
lists.

The use of NULLs in SQL is also unique to the language.  Fields do not
support a missing data marker as part of the field, record or file
itself.  Nor do fields have constraints which can be added to them in
the record, like the DEFAULT and CHECK() clauses in SQL.

Relationships among tables within a database

Files are pretty passive creatures and will take whatever an
application program throws at them without much objection.  Files are
also independent of each other simply because they are connected to one
application program at a time and therefore have no idea what other
files looks like.

A database actively seeks to maintain the correctness of all its data.
The methods used are triggers, constraints and declarative referential
integrity.

Declarative referential integrity (DRI) says, in effect, that data in
one table has a particular relationship with data in a second (possibly
the same)  table.  It is also possible to have the database change
itself via referential actions associated with the DRI.

For example, a business rule might be that we do not sell products
which are not in inventory.  This rule would be enforce by a REFERENCES
clause on the Orders table which references the Inventory table and a
referential action of ON DELETE CASCADE

Triggers are a more general way of doing much the same thing as DRI.  A
trigger is a block of procedural code which is executed before, after
or instead of an INSERT INTO or UPDATE statement.  You can do anything
with a trigger that you can do with DRI and more.

However, there are problems with TRIGGERs.  While there is a standard
syntax  for them in the SQL-92 standard, most vendors have not
implemented it.  What they have is very proprietary syntax instead.
Secondly, a trigger cannot pass information to the optimizer like DRI.
In the example in this section, I know that for every product number in
the Orders table, I have that same product number in the Inventory
table.  The optimizer can use that information in setting up EXISTS()
predicates and JOINs in the queries.  There is no reasonable way to
parse procedural trigger code to determine this relationship.

The CREATE ASSERTION statement in SQL-92 will allow the database to
enforce conditions on the entire database as a whole.  An ASSERTION is
not like a CHECK() clause, but the difference is subtle.  A CHECK()
clause is executed when there are rows in the table to which it is
attached.  If the table is empty then all CHECK() clauses are
effectively TRUE.  Thus, if we wanted to be sure that the Inventory
table is never empty, and we wrote:

CREATE TABLE Inventory
( ...
  CONSTRAINT inventory_not_empty
       CHECK ((SELECT COUNT(*) FROM Inventory) > 0), ... );

it would not work.  However, we could write:

CREATE ASSERTION Inventory_not_empty
        CHECK ((SELECT COUNT(*) FROM Inventory) > 0);

and we would get the desired results.  The assertion is checked at the
schema level and not at the table level.

>> The answer is that this is true 100% of the time in the real world. Longitude and lattitude are two distinct numbers. <<

See how easy it is to confuse a data element with its representation!
They are scalar only when they are together.  This is one thing Chris
Date and i agree on :)

>> If you use a comma separated character field [sic], you have to write code to extract each number and convert them to numeric values so that they can be compared  <<

Actually, Moreau and Ben-Gan's path enumeration model can be handled
nicely with string operrtions and grep() patterns that do not require
extraction.

Go over to www.dbdebunk.com and look up Date's paper on 1NF.
Author
30 Jun 2005 7:25 AM
Brian Selzer
Un-learn.  Hmm.  Sounds pretty stupid to me.  Those who fail to learn from
history are doomed to repeat it.  Knowledge of how transaction files are
applied to a master file gives insight into how a merge join operates.
Knowledge of B-Trees gives insight into how indexes are organized and into
how a loop join operates  Knowledge of hashing algorithms and hash tables
gives insight into how a hash join operates.  In SQL Server, understanding
the execution plan generated by the query optimizer is critical to improving
the performance of your queries.

Is this an excerpt from one of your books?  Do you have a section on
nondestructive readout core memory?

I still think that there are two problems with the relational model.  (1) it
allows volatile primary key values, and (2) it only has three dimensions.
First of all, the Information Rule SHOULD read, "Every datum (atomic value)
in the database is located in only one place and can be accessed using (1)
the table name, (2) the primary key value, and (3) the column name."  Using
natural keys in DRI relationships introduces redundancy into the database
which increases the complexity of managing locks and security.  It is also
increases the complexity of the applications that must interact with the
database because code must be added to detect a primary key value change and
deal with it.  Second, a relational database has only three dimensions:
table name, primary key value, and column name.  I think that there should
be a fourth dimension: time.  You should be able to create queries like:
what was the price of this part on such-n-such a date? Who changed what
where, when and how?  In other words, which user changed which column in
which row of which table at what time and with which application?  You
should be able to query what the state of the database was at a specified
point in time.  I have a workaround for both problems, but I think it should
be a built-in feature.

It appears that you and C. J. Date don't agree on very much.  I just read
one of his articles on www.dbdebunk.com, "There's Only One Relational
Model!," and he didn't appear too pleased about your "misleading,
inaccurate, or just plain wrong" assertions, nor about taking his name in
vain.  (Perhaps you shouldn't have sent me there.)

"--CELKO--" <jcelko***@earthlink.net> wrote in message
news:1120103085.567144.288200@g14g2000cwa.googlegroups.com...
> >> What difference does it make if I use the informal term "field" instead
of your informal term "column?"   What difference does it make if I use the
informal term "record" instead of your informal term "row?"  <<
Show quote
>
> Column and row are terms in SQL while  field and record are the proper
> termd for physical file systems. Like most new ideas, the hard part of
> understanding what the relational model is comes in un-learning what
> you know about file systems.  As Artemus Ward (William Graham Sumner,
> 1840-1910) put it, "It ain't so much the things we don't know that get
> us into trouble. It's the things we know that just ain't so."
>
> If you already have a background in data processing with traditional
> file systems, the first things to un-learn are:
>
>  (0) Databases are not file sets.
>  (1) Tables are not files.
>  (2) Rows are not records.
>  (3) Columns are not fields.
>
> Modern data processing began with punch cards, or Hollerith cards used
> by the Bureau of the Census.  Their original size was that of a United
> States Dollar bill.  This was set by their inventor, Herman Hollerith,
> because he could get furniture to store the cards from the United
> States Treasury Department, just across the street.  Likewise, physical
> constraints limited each card to 80 columns of holes in which to record
> a symbol.
>
> The influence of the punch card lingered on long after the invention of
> magnetic tapes and disk for data storage.  This is why early video
> display  terminals were 80 columns across.  Even today, files which
> were migrated from cards to magnetic tape files or disk storage still
> use 80 column records.
>
> But the influence was not just on the physical side of data processing.
>  The methods for handling data from the prior media were imitated in
> the new  media.
>
> Data processing first consisted of sorting and merging decks of punch
> cards (later, sequential magnetic tape files) in a series of distinct
> steps.  The result of each step feed into the next step in the process.
>
>
> Relational databases do not work that way.  Each user connects to the
> entire database all at once, not to one file at time in a sequence of
> steps.  The users might not all have the same database access rights
> once they are connected, however.  Magnetic tapes could not be shared
> among users at the same time, but shared data is the point of a
> database.
>
> Tables versus Files
>
> A file is closely related to its physical storage media.  A table may
> or may  not be a physical file.  DB2 from IBM uses one file per table,
> while Sybase puts several entire databases inside one file.  A table is
> a <i>set<i> of rows of the same kind of thing.  A set has no ordering
> and it makes no sense to ask for the first or last row.
>
> A deck of punch cards is sequential, and so are magnetic tape files.
> Therefore, a <i>physical<i> file of ordered sequential records also
> became the <i>mental<i> model for data processing and it is still hard
> to shake.  Anytime you look at data, it is in some physical ordering.
>
> The various access methods for disk storage system came later, but even
> these access methods could not shake the mental model.
>
> Another conceptual difference is that a file is usually data that deals
> with  a whole business process.  A file has to have enough data in
> itself to support applications for that business process.  Files tend
> to be "mixed"  data which can be described by the name of the business
> process, such as "The Payroll file" or something like that.
>
> Tables can be either entities or relationships within a business
> process.  This means that the data which was held in one file is often
> put into several tables.  Tables tend to be "pure" data which can be
> described by single words.  The payroll would now have separate tables
> for timecards, employees, projects and so forth.
>
> Tables as Entities
>
> An entity is physical or conceptual "thing" which has meaning be
> itself.  A  person, a sale or a product would be an example.  In a
> relational database, an entity is defined by its attributes, which are
> shown as values in columns in rows in a table.
>
> To remind users that tables are sets of entities, I like to use plural
> or collective nouns that describe the function of the entities within
> the system for the names of tables.  Thus "Employee" is a bad name
> because it is singular; "Employees" is a better name because it is
> plural; "Personnel" is best because it is collective and does not
> summon up a mental picture of individual persons.
>
> If you have tables with exactly the same structure, then they are sets
> of the same kind of elements.  But you should have only one set for
> each kind of data element!  Files, on the other hand, were PHYSICALLY
> separate units of storage which coudl be alike -- each tape or disk
> file represents a step in the PROCEDURE , such as moving from raw data,
> to edited data, and finally to archived data.  In SQL, this should be a
> status flag in a table.
>
> Tables as Relationships
>
> A relationship is shown in a table by columns which reference one or
> more entity tables.  Without the entities, the relationship has no
> meaning, but the relationship can have attributes of its own.  For
> example, a show  business contract might have an agent, an employer and
> a talent.  The method of payment is an attribute of the contract
> itself, and not of any of the three parties.
>
> Rows versus Records
>
> Rows are not records.  A record is defined in the application program
> which  reads it; a row is defined in the database schema and not by a
> program at all.  The name of the field in the READ or INPUT statements
> of the application; a row is named in the database schema.
>
> All empty files look alike; they are a directory entry in the operating
> system with a name and a length of zero bytes of storage.  Empty tables
> still have columns, constraints, security privileges and other
> structures, even tho they have no rows.
>
> This is in keeping with the set theoretical model, in which the empty
> set is a perfectly good set.  The difference between SQL's set model
> and standard mathematical set theory is that set theory has only one
> empty set, but in SQL  each table has a different structure, so they
> cannot be used in places where non-empty versions of themselves could
> not be used.
>
> Another characteristic of rows in a table is that they are all alike in
> structure and they are all the "same kind of thing" in the model.  In a
> file system, records can vary in size, datatypes and structure by
> having flags in the data stream that tell the program reading the data
> how to interpret it.  The most common examples are Pascal's variant
> record, C's struct syntax and Cobol's OCCURS clause.
>
> The OCCURS keyword in Cobol and the Variant records in Pascal have a
> number which tells the program how many time a record structure is to
> be repeated in the current record.
>
> Unions in 'C' are not variant records, but variant mappings for the
> same physical memory. For example:
>
> union x {int ival; char j[4];} myStuff;
>
> defines myStuff to be either an integer (which are 4 bytes on most
> modern C  compilers, but this code is non-portable) or an array of 4
> bytes, depending on whether you say myStuff.ival or myStuff.j[0];
>
> But even more than that, files often contained records which were
> summaries of subsets of the other records -- so called control break
> reports.  There is no requirement that the records in a file be related
> in any way -- they are literally a stream of binary data whose meaning
> is assigned by the program reading them.
>
> Columns versus Fields
>
> A field within a record is defined by the application program that
> reads it.  A column in a row in a table is defined by the database
> schema.  The datatypes in a column are always scalar.
>
> The order of the application program variables in the READ or INPUT
> statements is important because the values are read into the program
> variables in that order.  In SQL, columns are referenced only by their
> names.  Yes, there are shorthands like the SELECT * clause and INSERT
> INTO <table name> statements which expand into a list of column names
> in the physical order in which the column names appear within their
> table declaration, but these are shorthands which resolve to named
> lists.
>
> The use of NULLs in SQL is also unique to the language.  Fields do not
> support a missing data marker as part of the field, record or file
> itself.  Nor do fields have constraints which can be added to them in
> the record, like the DEFAULT and CHECK() clauses in SQL.
>
> Relationships among tables within a database
>
> Files are pretty passive creatures and will take whatever an
> application program throws at them without much objection.  Files are
> also independent of each other simply because they are connected to one
> application program at a time and therefore have no idea what other
> files looks like.
>
> A database actively seeks to maintain the correctness of all its data.
> The methods used are triggers, constraints and declarative referential
> integrity.
>
> Declarative referential integrity (DRI) says, in effect, that data in
> one table has a particular relationship with data in a second (possibly
> the same)  table.  It is also possible to have the database change
> itself via referential actions associated with the DRI.
>
> For example, a business rule might be that we do not sell products
> which are not in inventory.  This rule would be enforce by a REFERENCES
> clause on the Orders table which references the Inventory table and a
> referential action of ON DELETE CASCADE
>
> Triggers are a more general way of doing much the same thing as DRI.  A
>  trigger is a block of procedural code which is executed before, after
> or instead of an INSERT INTO or UPDATE statement.  You can do anything
> with a trigger that you can do with DRI and more.
>
> However, there are problems with TRIGGERs.  While there is a standard
> syntax  for them in the SQL-92 standard, most vendors have not
> implemented it.  What they have is very proprietary syntax instead.
> Secondly, a trigger cannot pass information to the optimizer like DRI.
> In the example in this section, I know that for every product number in
> the Orders table, I have that same product number in the Inventory
> table.  The optimizer can use that information in setting up EXISTS()
> predicates and JOINs in the queries.  There is no reasonable way to
> parse procedural trigger code to determine this relationship.
>
> The CREATE ASSERTION statement in SQL-92 will allow the database to
> enforce conditions on the entire database as a whole.  An ASSERTION is
> not like a CHECK() clause, but the difference is subtle.  A CHECK()
> clause is executed when there are rows in the table to which it is
> attached.  If the table is empty then all CHECK() clauses are
> effectively TRUE.  Thus, if we wanted to be sure that the Inventory
> table is never empty, and we wrote:
>
>  CREATE TABLE Inventory
>  ( ...
>   CONSTRAINT inventory_not_empty
>        CHECK ((SELECT COUNT(*) FROM Inventory) > 0), ... );
>
> it would not work.  However, we could write:
>
>  CREATE ASSERTION Inventory_not_empty
>         CHECK ((SELECT COUNT(*) FROM Inventory) > 0);
>
> and we would get the desired results.  The assertion is checked at the
> schema level and not at the table level.
>
> >> The answer is that this is true 100% of the time in the real world.
Longitude and lattitude are two distinct numbers. <<
>
> See how easy it is to confuse a data element with its representation!
> They are scalar only when they are together.  This is one thing Chris
> Date and i agree on :)
>
> >> If you use a comma separated character field [sic], you have to write
code to extract each number and convert them to numeric values so that they
can be compared  <<
Show quote
>
> Actually, Moreau and Ben-Gan's path enumeration model can be handled
> nicely with string operrtions and grep() patterns that do not require
> extraction.
>
> Go over to www.dbdebunk.com and look up Date's paper on 1NF.
>
Author
30 Jun 2005 8:34 AM
--CELKO--
>> I think that there should be a fourth dimension: time.  You should be able to create queries like: what was the price of this part on such-n-such a date? Who changed what where, when and how?  <<

Read Rick Snodgrass's temporal proposals at his website at the
University of Arizona.  They never made it into the Standard, but he
has a ton of SQL for "faking it".

>> It appears that you and C. J. Date don't agree on very much. <<

Actually we do.  In that particular exchange, he was the one who
provided the referencs to the papers by Codd that had some of the
various versions of RM, RM II, and now he has his own version.

You might want to look at other implementation of SQL which do not
physically repeat the same values in PK=fk relationships  (Sand, SQL
Anywhere, etc.)
Author
30 Jun 2005 4:06 PM
Anith Sen
>> I still think that there are two problems with the relational model.  (1)
>> it allows volatile primary key values, and (2) it only has three
>> dimensions. First of all, the Information Rule SHOULD read, "Every datum
>> (atomic value) in the database is located in only one place and can be
>> accessed using (1) the table name, (2) the primary key value, and (3) the
>> column name."  Using natural keys in DRI relationships introduces
>> redundancy into the database which increases the complexity of managing
>> locks and security.

It is precisely due to the potential volatility of key values, did
relational proponents consider stability as a desirable criteria while
choosing a primary key. Lock management is a physical issue, and has nothing
to do with the logical model.

Date's work on inclusion dependencies(INCD) in early 80s partly deals with
this very issue where declarative constraints on volatile keys. Some of his
findings and proposals can be found under the chapters "Referential
Integrity and Foreign Keys" - Parts I, II & III in his book Relational
Database Writings 85-89.

>> I think that there should be a fourth dimension: time.  You should be
>> able to create queries like: what was the price of this part on
>> such-n-such a date? Who changed what where, when and how?  In other
>> words, which user changed which column in which row of which table at
>> what time and with which application?  You should be able to query what
>> the state of the database was at a specified point in time.  I have a
>> workaround for both problems, but I think it should be a built-in
>> feature.

True, facilities for temporal(semi- as well as bi- temporal) expressions
should be a built-in feature, but that is not a relational drawback.

Relational approaches to temporal databases have been around for quite some
time now. Since no SQL vendors have implemented any of the proposals and no
useful temporal type other types to represent discrete points in time exist
in current products, even the existence of such approaches is not widely
known.

Snodgrass' work is high quality but some areas of his work are suspect, like
the suggestion of hidden timestamps.

The other popular one ( by Darwen, Lorenzos & Date), though addresses some
implementation aspects, is strictly from a logical perspective. The
fundamentals on temporal constraints and granularity issues are well
addressed and can be found in their book: Temporal Data & the Relational
Model. You can also find a preview at:
http://web.onetel.com/~hughdarwen/TheThirdManifesto/TemporalData.Warwick.pdf

You will also find Date's 6NF ( "irreducible relation" : one with a key and
at most one other attribute ) is relation to fully temporal databases,
though it is not mandatory.

>> It appears that you and C. J. Date don't agree on very much.

Perhaps on some issues, but Joe actually suggested you look up his paper on
1NF at: http://www.dbdebunk.com/page/page/629796.htm

It is simple yet an outstanding work especially since so many freelance
gurus these days offer 1NF suggestions based on ill conceived notions of
repeated groups, "look-a-like" columns etc.

The proposed 1NF requirements have been detailed in Date's recent version of
Intro Book as the properties of a relation as well which can be paraphrased
as: A table is said to be in 1NF, if it directly and faithfully represents
some relation. This means, the following properties must be necessary &
sufficient for a table to be in 1NF
1. There is no ordering to the rows or columns.
2. There are no duplicate rows, so that a key exists by definition
3. Every row-column intersection contains exactly one value from the
applicable type.
4. All columns are regular ( no hidden columns, object ids etc. )
And strictly speaking, Nulls would invalidate 1NF since it is not a value
( other implications like information rule violation etc. ) but it is often
discounted.

Thus without reference to the underlying type claiming a CSV list violates
1NF is moot. If the system supports a CSV list type( with CSV list values
and operators on those values), having a CSV list as an attribute value
declared on that type would not violate the 1NF requirement.

--
Anith
Author
30 Jun 2005 8:05 PM
Brian Selzer
> Thus without reference to the underlying type claiming a CSV list violates
> 1NF is moot. If the system supports a CSV list type( with CSV list values
> and operators on those values), having a CSV list as an attribute value
> declared on that type would not violate the 1NF requirement.

My primary objection to implementing a CSV list domain and defining
relations containing attributes with a CSV list domain is that a CSV list is
by definition not atomic.  A CSV list is an ordered set of zero or more
values, separated by commas.  Every datum (atomic value) in a relational
database can be accessed if you know the table name, the primary key value,
and the column name.  How can you write a query to access the third element
in a CSV list column using only the table name, the primary key value, and
the column name.  The additional dimension (element number) makes this
impossible.  It follows then that a database that implements a CSV list
domain is by definition not a relational database.




Show quote
"Anith Sen" <an***@bizdatasolutions.com> wrote in message
news:eLMC23YfFHA.3944@TK2MSFTNGP10.phx.gbl...
> >> I still think that there are two problems with the relational model.
(1)
> >> it allows volatile primary key values, and (2) it only has three
> >> dimensions. First of all, the Information Rule SHOULD read, "Every
datum
> >> (atomic value) in the database is located in only one place and can be
> >> accessed using (1) the table name, (2) the primary key value, and (3)
the
> >> column name."  Using natural keys in DRI relationships introduces
> >> redundancy into the database which increases the complexity of managing
> >> locks and security.
>
> It is precisely due to the potential volatility of key values, did
> relational proponents consider stability as a desirable criteria while
> choosing a primary key. Lock management is a physical issue, and has
nothing
> to do with the logical model.
>
> Date's work on inclusion dependencies(INCD) in early 80s partly deals with
> this very issue where declarative constraints on volatile keys. Some of
his
> findings and proposals can be found under the chapters "Referential
> Integrity and Foreign Keys" - Parts I, II & III in his book Relational
> Database Writings 85-89.
>
> >> I think that there should be a fourth dimension: time.  You should be
> >> able to create queries like: what was the price of this part on
> >> such-n-such a date? Who changed what where, when and how?  In other
> >> words, which user changed which column in which row of which table at
> >> what time and with which application?  You should be able to query what
> >> the state of the database was at a specified point in time.  I have a
> >> workaround for both problems, but I think it should be a built-in
> >> feature.
>
> True, facilities for temporal(semi- as well as bi- temporal) expressions
> should be a built-in feature, but that is not a relational drawback.
>
> Relational approaches to temporal databases have been around for quite
some
> time now. Since no SQL vendors have implemented any of the proposals and
no
> useful temporal type other types to represent discrete points in time
exist
> in current products, even the existence of such approaches is not widely
> known.
>
> Snodgrass' work is high quality but some areas of his work are suspect,
like
> the suggestion of hidden timestamps.
>
> The other popular one ( by Darwen, Lorenzos & Date), though addresses some
> implementation aspects, is strictly from a logical perspective. The
> fundamentals on temporal constraints and granularity issues are well
> addressed and can be found in their book: Temporal Data & the Relational
> Model. You can also find a preview at:
>
http://web.onetel.com/~hughdarwen/TheThirdManifesto/TemporalData.Warwick.pdf
Show quote
>
> You will also find Date's 6NF ( "irreducible relation" : one with a key
and
> at most one other attribute ) is relation to fully temporal databases,
> though it is not mandatory.
>
> >> It appears that you and C. J. Date don't agree on very much.
>
> Perhaps on some issues, but Joe actually suggested you look up his paper
on
> 1NF at: http://www.dbdebunk.com/page/page/629796.htm
>
> It is simple yet an outstanding work especially since so many freelance
> gurus these days offer 1NF suggestions based on ill conceived notions of
> repeated groups, "look-a-like" columns etc.
>
> The proposed 1NF requirements have been detailed in Date's recent version
of
> Intro Book as the properties of a relation as well which can be
paraphrased
> as: A table is said to be in 1NF, if it directly and faithfully represents
> some relation. This means, the following properties must be necessary &
> sufficient for a table to be in 1NF
> 1. There is no ordering to the rows or columns.
> 2. There are no duplicate rows, so that a key exists by definition
> 3. Every row-column intersection contains exactly one value from the
> applicable type.
> 4. All columns are regular ( no hidden columns, object ids etc. )
> And strictly speaking, Nulls would invalidate 1NF since it is not a value
> ( other implications like information rule violation etc. ) but it is
often
> discounted.
>
> Thus without reference to the underlying type claiming a CSV list violates
> 1NF is moot. If the system supports a CSV list type( with CSV list values
> and operators on those values), having a CSV list as an attribute value
> declared on that type would not violate the 1NF requirement.
>
> --
> Anith
>
>
Author
30 Jun 2005 8:49 PM
Anith Sen
>> My primary objection to implementing a CSV ist domain and defining
>> relations containing attributes with a CSV list domain is that a CSV list
>> is by definition not atomic.

Whether some value is atomic or not is purely subjective and contextual; In
some contexts the "Bryan" is an atomic value while in some other context it
may be a list of letters. In some contexts the literal "2005/06/30" defines
a single date value while in some other contexts it is comprised of three
distinct values representing year, month and time.

Thus a definition of 1NF on a loose term like "atomicity" alone is neither
precise nor formal. For a formal definition one would need a reference to
something which is non-contextual and unambiguous. A type or domain provides
that. A type\ domain is a set of values along with associated operators
applicable on those values. Thus in a good type system, one could have a
type of arbitrary complexity and values are scalar with respect to the type.

>> A CSV list is an ordered set of zero or more values, separated by commas.

In some contexts it is. In some other contexts, it is just as scalar as an
int or datetime or char(10) value. A well defined type could precisely
capture the context and represent the values in a relational database
appropriately.

>> Every datum (atomic value) in a relational database can be accessed if
>> you know the table name, the primary key value, and the column name.

Nobody disagrees with that.

>> How can you write a query to access the third element in a CSV list
>> column using only the table name, the primary key value, and the column
>> name.

Using the operators applicable to the CSV list type.

To use an analogy, how do you extract the month from a datetime value? Using
DATEPART operator, which is an operator applicable to the DATETIME type.

>>  The additional dimension (element number) makes this impossible.

Well, that could be an argument to the operator, if an element needs to be
extracted based on its index or position in the list.

>> It follows then that a database that implements a CSV list domain is by
>> definition not a relational database

No, it does not follow anything of that sort. The link on 1NF provided early
in this thread should clarify such misconceptions.

--
Anith
Author
1 Jul 2005 1:10 AM
roy.anderson@gmail.com
Thanks for the informative posts all! I can practically feel my brain
growing...   Now time to regulate.  :-)

CELKO wrote:
">> The answer is that this is true 100% of the time in the real world.
Longitude and lattitude are two distinct numbers. <<

See how easy it is to confuse a data element with its representation!
They are scalar only when they are together.  This is one thing Chris
Date and i agree on :) "

You are incorrect if you are you implying that longitude/latitude
together constitue a scalar. A tell-tale indicator of a scalar is that
it cannot be broken down into a smaller logical part. A longitude would
be a scalar, a longitude+latitude would not.

Anith wrote:
">> How can you write a query to access the third element in a CSV list
>> column using only the table name, the primary key value, and the column >> name.

Using the operators applicable to the CSV list type.
To use an analogy, how do you extract the month from a datetime value?
Using DATEPART operator, which is an operator applicable to the
DATETIME type. "

Semantics. You muddy a pure theorectical concept with a human's
interference. Using that same logic, one could stuff any crap into a
database, call it a data element, and code an operator applicable to
the CRAP datatype. Then you could point at your database and claim it
still adhered to 1NF, no?

">> It follows then that a database that implements a CSV list domain
is by >> definition not a relational database

No, it does not follow anything of that sort. The link on 1NF provided
early in this thread should clarify such misconceptions. "

Boo! Hsss! No points for you. I checked out the link. It's
pay-per-view. You cannot use something to prove your case that someone
else must pay to see. Enter the cult leader who offers to sell you the
mysteries of life in 3 easy installments of 19.95 and if you don't pay
up, you're clearly one of the unenlightened masses.  ;-)

Let me insert some of my (admittedly limited) depth of knowledge into
this discussion as I think the core philosophy of relational db's is
being forgotten. As Celko points out, current databases have evolved
from the flat, 2-dimensional constructs of the past into the robust,
3-dimensional constructs of today, HOWEVER, the point of them remains
the same: storing and accessing data.

As we get better and better at that, we've come to understand that
nature has been doing a better job of it all along. Our fumbling
attempts at relational db's are, in fact, a conscious (or subconscious)
attempt to mimic nature. No longer do we have to rely on static
hardcopy, now we have the electronic ether of our PC's as our toolbox.
An entire virtual universe sits humming under our desks and we have a
17-inch window into it.

When we have these discussions of what does or does not constitute a
"relational" system we must keep the above in mind and focus not on the
letter of the laws (ok, yes, you can argue that CSV data could be
incorporated as a single data element adhering to 1NF) but on the
*spirit* of the laws. We're trying to mimic nature. We're trying to
create a robust, self-correcting, self-enforcing, multi-dimensional
construct that will not allow itself to be compromised. If you adhere
to the spirit of the concept well, I should be able to peruse your
database from any angle and find the same logical atomic setup. I
should be able to drill down to the most basic elements, not worrying
about custom datatypes (and associated operators) that individual
programmers created willy-nilly for their own short-sighted purposes.

Cheers,
Roy
Author
1 Jul 2005 2:35 AM
Adam Machanic
<roy.ander***@gmail.com> wrote in message
news:1120180243.548210.152550@g49g2000cwa.googlegroups.com...
>
> You are incorrect if you are you implying that longitude/latitude
> together constitue a scalar. A tell-tale indicator of a scalar is that
> it cannot be broken down into a smaller logical part. A longitude would
> be a scalar, a longitude+latitude would not.

    It all depends on the application.  In the types of geocoding
applications I work with, a longitude without a lattitude is utterly
meaningless.  Combine the two and we have something.  The two, in these
applications, are not at all seperable.  See Anith's post about atomicity.


> Semantics. You muddy a pure theorectical concept with a human's
> interference. Using that same logic, one could stuff any crap into a
> database, call it a data element, and code an operator applicable to
> the CRAP datatype. Then you could point at your database and claim it
> still adhered to 1NF, no?

    Yes.  1NF is really defined by the needs and semantics of each
particular database.


> When we have these discussions of what does or does not constitute a
> "relational" system we must keep the above in mind and focus not on the
> letter of the laws (ok, yes, you can argue that CSV data could be
> incorporated as a single data element adhering to 1NF) but on the
> *spirit* of the laws. We're trying to mimic nature. We're trying to

    You need to take a course in first order logic.  It doesn't matter what
the "spirit" is, as long as you can prove it.  1NF is not about some
idealistic view of data management.  It's a way of logically describing how
to avoid redundancy.  Nothing more -- and nothing less.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
1 Jul 2005 10:04 AM
Brian Selzer
>     It all depends on the application.  In the types of geocoding
> applications I work with, a longitude without a lattitude is utterly
> meaningless.  Combine the two and we have something.  The two, in these
> applications, are not at all seperable.  See Anith's post about atomicity.

The fact that they may be meaningless when separated does not change the
fact that each component element means something different.  If you isolate
the attribute AreaCode from the Employee relation, it is also meaningless.
The AreaCode for what phone number for whom?

If the position or existence of an element of a data type imparts meaning to
instances of that element, then those instances are atomic values with
respect to the database.   An instance of a data type cannot be atomic with
respect to the database if any component element value is atomic with
respect to the database.



Show quote
"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:udGcTWefFHA.3448@TK2MSFTNGP12.phx.gbl...
> <roy.ander***@gmail.com> wrote in message
> news:1120180243.548210.152550@g49g2000cwa.googlegroups.com...
> >
> > You are incorrect if you are you implying that longitude/latitude
> > together constitue a scalar. A tell-tale indicator of a scalar is that
> > it cannot be broken down into a smaller logical part. A longitude would
> > be a scalar, a longitude+latitude would not.
>
>     It all depends on the application.  In the types of geocoding
> applications I work with, a longitude without a lattitude is utterly
> meaningless.  Combine the two and we have something.  The two, in these
> applications, are not at all seperable.  See Anith's post about atomicity.
>
>
> > Semantics. You muddy a pure theorectical concept with a human's
> > interference. Using that same logic, one could stuff any crap into a
> > database, call it a data element, and code an operator applicable to
> > the CRAP datatype. Then you could point at your database and claim it
> > still adhered to 1NF, no?
>
>     Yes.  1NF is really defined by the needs and semantics of each
> particular database.
>
>
> > When we have these discussions of what does or does not constitute a
> > "relational" system we must keep the above in mind and focus not on the
> > letter of the laws (ok, yes, you can argue that CSV data could be
> > incorporated as a single data element adhering to 1NF) but on the
> > *spirit* of the laws. We're trying to mimic nature. We're trying to
>
>     You need to take a course in first order logic.  It doesn't matter
what
> the "spirit" is, as long as you can prove it.  1NF is not about some
> idealistic view of data management.  It's a way of logically describing
how
> to avoid redundancy.  Nothing more -- and nothing less.
>
>
> --
> Adam Machanic
> SQL Server MVP
> http://www.datamanipulation.net
> --
>
>
Author
1 Jul 2005 10:49 AM
Adam Machanic
"Brian Selzer" <br***@selzer-software.com> wrote in message
news:%23gmZCRifFHA.3940@TK2MSFTNGP14.phx.gbl...
>
> The fact that they may be meaningless when separated does not change the
> fact that each component element means something different.  If you
isolate
> the attribute AreaCode from the Employee relation, it is also meaningless.
> The AreaCode for what phone number for whom?

    By that argument, no column can EVER be said to be 'atomic'.  For
instance, I can split up any character string into its component characters,
and further decompose those characters into bits.  They all have meaning to
someone, somewhere, right?


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
1 Jul 2005 7:15 PM
Brian Selzer
>     By that argument, no column can EVER be said to be 'atomic'.  For
> instance, I can split up any character string into its component
characters,
> and further decompose those characters into bits.  They all have meaning
to
> someone, somewhere, right?

That is absurd.  An attribute must add semantic context to its relation, or
it wouldn't be included in the logical data model.  A date value in the
HireDate column is not only a date, it represents the date that a specific
employee was hired.  A character string derives its value from the
permutation of its component character values.  Regardless of the underlying
physical representation of a character, the value is the same whether its
Unicode, ASCII or EBCDIC.  Each individual character in the permutation does
not represent anything other than its own value.  In the same way an integer
derives its value from the permutation of its component bits, which do not
represent anything other than their own values.


Show quote
"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:OCqLCqifFHA.3436@tk2msftngp13.phx.gbl...
> "Brian Selzer" <br***@selzer-software.com> wrote in message
> news:%23gmZCRifFHA.3940@TK2MSFTNGP14.phx.gbl...
> >
> > The fact that they may be meaningless when separated does not change the
> > fact that each component element means something different.  If you
> isolate
> > the attribute AreaCode from the Employee relation, it is also
meaningless.
> > The AreaCode for what phone number for whom?
>
>     By that argument, no column can EVER be said to be 'atomic'.  For
> instance, I can split up any character string into its component
characters,
> and further decompose those characters into bits.  They all have meaning
to
> someone, somewhere, right?
>
>
> --
> Adam Machanic
> SQL Server MVP
> http://www.datamanipulation.net
> --
>
>
>
Author
1 Jul 2005 8:13 PM
David Portas
You've hit on exactly the point I made in my response. The decision
about whether a column is atomic or not is based on *semantics* and
therefore isn't subject to a formal definition. The Relational Model is
concerned with representation and that's why it doesn't have anything
to say on this point. You can formulate a rule of thumb if you like but
it's only that, and it won't change the meaning if 1NF.

--
David Portas
SQL Server MVP
--
Author
2 Jul 2005 4:49 AM
Brian Selzer
The decision about whether or not a functional dependency exists is based on
*semantics* as well, yet it has been formally defined.

I've seen so many databases that have tables with repeating groups or
columns with comma-delimited lists, and every time I ended up with a
migraine trying to work with them.  Theorists who advocate their use should
be burned at the stake for heresy.

Show quote
"David Portas" <REMOVE_BEFORE_REPLYING_dpor***@acm.org> wrote in message
news:1120248812.141505.51340@g49g2000cwa.googlegroups.com...
> You've hit on exactly the point I made in my response. The decision
> about whether a column is atomic or not is based on *semantics* and
> therefore isn't subject to a formal definition. The Relational Model is
> concerned with representation and that's why it doesn't have anything
> to say on this point. You can formulate a rule of thumb if you like but
> it's only that, and it won't change the meaning if 1NF.
>
> --
> David Portas
> SQL Server MVP
> --
>
Author
2 Jul 2005 1:50 PM
David Portas
The difference is that Functional Dependencies can be represented in theory
by the notional existence of a function - a transformation that maps one set
of values onto another. That is a mathematical concept.

It's very hard to define such a formal concept for atomicity that's useful
for every possible type. You want to forbid certain domains because you
regard them as non-atomic but is it really sufficient, as Codd originally
suggested, to say that non-atomic means that there exists some
transformation to some smaller more "primitive" domain of values? I don't
think so. Here's an example. How do you define the atomicity of a digital
graphic image? We have types for digital images so an RDBMS may also come
equipped with operators to transform images into other images - by slicing
them into smaller images for example. So when is an image non-atomic? Is an
atomic image one that contains only a single pixel? What use would such a
restriction have in theory or in practice?

Or maybe you want to require that an image is only non-atomic if smaller
transformations of it have some meaning in themselves. Again, this seems
like an arbitrary restriction. Smaller images may well be useful and
meaningful to the user. With digital images we frequently want to render
detailed clips of much larger graphics. Would you really want to disallow a
domain from containing an image that was a clip of a larger one in the same
domain? If so, how do you define which images are permitted in that domain
and which aren't? Apparently your definition may have to be self-referential
which poses some serious problems in defining constraints and other
operations on that domain.

--
David Portas
SQL Server MVP
--
Author
2 Jul 2005 5:50 PM
Brian Selzer
> ... to say that non-atomic means that there exists some
> transformation to some smaller more "primitive" domain of values?

I would use the term resolution instead of transformation, because
transformation implies change whereas resolution implies separation, but
yes, I would agree with that.

> ... How do you define the atomicity of a digital
> graphic image? ...

An image data type would be atomic.  An image is a two-dimensional visual
representation of something.  The mechanism that renders the image for human
consumption determines the value of the image, so both must be
implementation-specific.    The range of each dimension in an image may be
continuous (a vector drawing) or not (a raster drawing), which makes it
impossible to decompose a conceptual image into its individual component
parts without losing information.

I would argue that in order to maintain integrity, each image that was
generated by transforming another image must be related to the original
image, and the transformation operations must be performed any time the
original image changes.  It would be preferable, theoretically, to perform
the transformations when an image is rendered for human consumption.  That
way only one copy of the image exists in the database.


Show quote
"David Portas" <REMOVE_BEFORE_REPLYING_dpor***@acm.org> wrote in message
news:aPqdneoM7YgwBlvfRVn-vQ@giganews.com...
> The difference is that Functional Dependencies can be represented in
theory
> by the notional existence of a function - a transformation that maps one
set
> of values onto another. That is a mathematical concept.
>
> It's very hard to define such a formal concept for atomicity that's useful
> for every possible type. You want to forbid certain domains because you
> regard them as non-atomic but is it really sufficient, as Codd originally
> suggested, to say that non-atomic means that there exists some
> transformation to some smaller more "primitive" domain of values? I don't
> think so. Here's an example. How do you define the atomicity of a digital
> graphic image? We have types for digital images so an RDBMS may also come
> equipped with operators to transform images into other images - by slicing
> them into smaller images for example. So when is an image non-atomic? Is
an
> atomic image one that contains only a single pixel? What use would such a
> restriction have in theory or in practice?
>
> Or maybe you want to require that an image is only non-atomic if smaller
> transformations of it have some meaning in themselves. Again, this seems
> like an arbitrary restriction. Smaller images may well be useful and
> meaningful to the user. With digital images we frequently want to render
> detailed clips of much larger graphics. Would you really want to disallow
a
> domain from containing an image that was a clip of a larger one in the
same
> domain? If so, how do you define which images are permitted in that domain
> and which aren't? Apparently your definition may have to be
self-referential
> which poses some serious problems in defining constraints and other
> operations on that domain.
>
> --
> David Portas
> SQL Server MVP
> --
>
>
Author
3 Jul 2005 5:31 PM
David Portas
>> ... How do you define the atomicity of a digital
>> graphic image? ...
>
> An image data type would be atomic.

Fine, but my point was do you have a rigorous theoretical basis for that
statement? I don't see it. Intuitively you might feel that an image type is
atomic but this is where we came in: Atomicity isn't a formally defined
property of a domain or value, rather it's a subjective decision we make
based on our knowledge of the reality we intend to model and of the
capabilities of our RDBMS.

When stating your formal definition of why an image type is atomic, assume
that the digital image type is only defined as a simple bitmap that may
therefore be subdivided into multiple smaller bitmaps. The implication of
this is that if you add attributes to define how the bitmap pieces fit
together you can always decompose any bitmap larger than one pixel without
loss of information. As far as I can see, this is exactly analogous to the
example of an ordered delimited list.

--
David Portas
SQL Server MVP
--
Author
5 Jul 2005 3:54 PM
Thomas Coleman
A couple of better examples to further your point David, are phone numbers and
street addresses. In some systems, the phone number stored as a single value is
sufficiently atomic for the purposes of the problem domain. However, for the
phone company, it would be important to break up the number into its constituent
pieces. Street addresses pose the same problem. In most systems, "123 N. Main
St." is sufficiently encapsulated for any all queries against the system.
However, if you are building a system to be used by a delivery company, this is
insufficient and must be broken into it's constituent pieces (123, N, Main, St).

The determination of atomic is wholly dependent on the problem domain. There are
many such examples, including delimited lists, where a value can be considered
atomic in its raw form. Said another way, the determination of whether a value
is sufficiently atomic is wholly dependent on how it is used in the system.



Thomas


Show quote
"David Portas" <REMOVE_BEFORE_REPLYING_dpor***@acm.org> wrote in message
news:qOWdnVHVyuyUvFXfRVn-gw@giganews.com...
>>> ... How do you define the atomicity of a digital
>>> graphic image? ...
>>
>> An image data type would be atomic.
>
> Fine, but my point was do you have a rigorous theoretical basis for that
> statement? I don't see it. Intuitively you might feel that an image type is
> atomic but this is where we came in: Atomicity isn't a formally defined
> property of a domain or value, rather it's a subjective decision we make based
> on our knowledge of the reality we intend to model and of the capabilities of
> our RDBMS.
>
> When stating your formal definition of why an image type is atomic, assume
> that the digital image type is only defined as a simple bitmap that may
> therefore be subdivided into multiple smaller bitmaps. The implication of this
> is that if you add attributes to define how the bitmap pieces fit together you
> can always decompose any bitmap larger than one pixel without loss of
> information. As far as I can see, this is exactly analogous to the example of
> an ordered delimited list.
>
> --
> David Portas
> SQL Server MVP
> --
>
>
Author
2 Jul 2005 12:40 PM
Adam Machanic
"Brian Selzer" <br***@selzer-software.com> wrote in message
news:OM%23eJFnfFHA.1948@TK2MSFTNGP12.phx.gbl...
>
> employee was hired.  A character string derives its value from the
> permutation of its component character values.  Regardless of the
underlying
> physical representation of a character, the value is the same whether its
> Unicode, ASCII or EBCDIC.  Each individual character in the permutation
does
> not represent anything other than its own value.  In the same way an
integer
> derives its value from the permutation of its component bits, which do not
> represent anything other than their own values.

    And, in my databases, a longitude without a latitude is nothing but a
meaningless decimal.  Add a latitude to that longitude and we suddenly get a
datatype -- a point.  We can do various operations on the point.  We can
compute distances to other points, we can add offsets based on direction,
etc.  We can do none of those operations on JUST a longitude.  The longitude
itself, within my databases, just like a single bit of an integer, is
meaningless.

    But in some other database, perhaps some computation is possible based
on longitude, and some other computations are possible based on latitude.
And in THAT database, the combination of longitude and latitude would not be
atomic.

    Likewise with characters.  For instance, consider the following string:

    BRIAN

    Is that string 'atomic'?  I don't know!  In a given database, there
might be rules defined to perform logic based on the 5 characters in the
string.  Perhaps each character represents an access level to some resource.
In such a case, based on your previous posts to this thread, I'm assuming
that you would probably expect to see:

    B,R,I,A,N

    But the commas are not what makes this string problematic, because as
we've seen above, commas are not required for the access control scheme.  We
can parse the comma-less version and get the same net result.  The problem
is, in both cases the string is not being treated as merely a string -- it's
being treated as a collection of characters.  And that's a matter of
semantics, because we cas also insert that same string into a column called
'FirstName' -- and, assuming that column is used the way we understand first
names to be used, the string suddenly can be considered 'atomic'.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
3 Jul 2005 1:29 AM
Brian Selzer
>     And, in my databases, a longitude without a latitude is nothing but a
> meaningless decimal.  Add a latitude to that longitude and we suddenly get
a
> datatype -- a point.

You're wrong.  Because it is a longitude, it can only be used as a
longitude.  It cannot be used as a lattitude.  It cannot be used as a price.
It carries with it the definition and constraints associated with its
domain.  (You may not have defined any constraints, but if you didn't, you
probably should have.) It has a predefined range of possible values, and
represents the distance of a place east or west of an imaginary line from
the top to the bottom of the Earth.  It is NOT just a meaningless decimal.

>     But in some other database, perhaps some computation is possible based
> on longitude, and some other computations are possible based on latitude.
> And in THAT database, the combination of longitude and latitude would not
be
> atomic.

What happens when THAT database is combined with your database.

Your other example should not merit comment, because stuffing flags in a
character column is a clear violation of 1NF.  Each flag has a separate
purpose, therefore each flag value in a row is an atomic value with respect
to the relational model.  The fact that during implementation you chose to
stuff the flags into a single character column is an example of
denormalizing a relation for some implementation-specific reason.


Show quote
"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:uKx4vMwfFHA.2700@TK2MSFTNGP15.phx.gbl...
> "Brian Selzer" <br***@selzer-software.com> wrote in message
> news:OM%23eJFnfFHA.1948@TK2MSFTNGP12.phx.gbl...
> >
> > employee was hired.  A character string derives its value from the
> > permutation of its component character values.  Regardless of the
> underlying
> > physical representation of a character, the value is the same whether
its
> > Unicode, ASCII or EBCDIC.  Each individual character in the permutation
> does
> > not represent anything other than its own value.  In the same way an
> integer
> > derives its value from the permutation of its component bits, which do
not
> > represent anything other than their own values.
>
>     And, in my databases, a longitude without a latitude is nothing but a
> meaningless decimal.  Add a latitude to that longitude and we suddenly get
a
> datatype -- a point.  We can do various operations on the point.  We can
> compute distances to other points, we can add offsets based on direction,
> etc.  We can do none of those operations on JUST a longitude.  The
longitude
> itself, within my databases, just like a single bit of an integer, is
> meaningless.
>
>     But in some other database, perhaps some computation is possible based
> on longitude, and some other computations are possible based on latitude.
> And in THAT database, the combination of longitude and latitude would not
be
> atomic.
>
>     Likewise with characters.  For instance, consider the following
string:
>
>     BRIAN
>
>     Is that string 'atomic'?  I don't know!  In a given database, there
> might be rules defined to perform logic based on the 5 characters in the
> string.  Perhaps each character represents an access level to some
resource.
> In such a case, based on your previous posts to this thread, I'm assuming
> that you would probably expect to see:
>
>     B,R,I,A,N
>
>     But the commas are not what makes this string problematic, because as
> we've seen above, commas are not required for the access control scheme.
We
> can parse the comma-less version and get the same net result.  The problem
> is, in both cases the string is not being treated as merely a string --
it's
> being treated as a collection of characters.  And that's a matter of
> semantics, because we cas also insert that same string into a column
called
> 'FirstName' -- and, assuming that column is used the way we understand
first
> names to be used, the string suddenly can be considered 'atomic'.
>
>
> --
> Adam Machanic
> SQL Server MVP
> http://www.datamanipulation.net
> --
>
>
>
>
Author
3 Jul 2005 4:30 AM
Adam Machanic
"Brian Selzer" <br***@selzer-software.com> wrote in message
news:errQq62fFHA.3316@TK2MSFTNGP14.phx.gbl...
>
> to the relational model.  The fact that during implementation you chose to
> stuff the flags into a single character column is an example of
> denormalizing a relation for some --- implementation-specific --- reason.

    EXACTLY!


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
5 Jul 2005 4:06 PM
Thomas Coleman
Sorry, Brian, I disagree. Atomicity always depends on the system. For example,
suppose we have a mapping program that takes a location in the form (Long, Lat).
Further suppose, that there is *never* a need to query on longitude or latitude.
Rather, searches are against other attributes and at the end we map the location
from the coordinates of the given entity found in the search. Thus, the only
time the value will ever be used is to plug into this mapping software to map a
geographical location. It will never be queried against. In that case, it is
perfectly acceptable to create a column called "Coordinates" and store these two
values as single point. The values individually have no meaning in the context
of the system being built. They only have context together in the form (Long,
Lat).

If we did not accept that atomic was based on the system being designed, we
would have no end to debates on whether any given value is sufficiently atomic.
VINs could never be stored a single value because it is made of individual
attributes. Storage of phone numbers would always require three or four columns.
Storage of addresses would always require seven or more columns.

Atomicity must always be checked against the needs and requirements of the
system being built lest we get mired in "analysis-paralysis."


Thomas
Author
6 Jul 2005 9:15 PM
wbrianwhite
> Semantics. You muddy a pure theorectical concept with a human's
> interference. Using that same logic, one could stuff any crap into a
> database, call it a data element, and code an operator applicable to
> the CRAP datatype. Then you could point at your database and claim it
> still adhered to 1NF, no?

Amen brother.  I don't know how many times I've seen columns filled with
crap like '1~3~0~9~2' or '1,,3,5,,6'.  Who cares whether it might technically
qualify as 1NF?  Note, none of the databases I saw with that had any operator
for those fields, all hardcoded in the application layer.  What if I want to
add another element after the ~2?  How unnecessarily complicated does this
then become?  Do you need to update your operator first?  Still think you
have a scalable and reliable system?  How are you going to link a description
in another table to ~3 with a foreign key when it's not a column?  At the
very least, the first part of your organization's best practices should be
'DO NOT implement the CRAP datatype or you will be sacked.'
Author
6 Jul 2005 10:12 PM
roy.anderson@gmail.com
wbrianwhite wrote:
> Amen brother.  I don't know how many times I've seen columns filled with
> crap like '1~3~0~9~2' or '1,,3,5,,6'.  Who cares whether it might technically
> qualify as 1NF?  Note, none of the databases I saw with that had any operator
> for those fields, all hardcoded in the application layer.  What if I want to
> add another element after the ~2?  How unnecessarily complicated does this
> then become?  Do you need to update your operator first?  Still think you
> have a scalable and reliable system?  How are you going to link a description
> in another table to ~3 with a foreign key when it's not a column?  At the
> very least, the first part of your organization's best practices should be
> 'DO NOT implement the CRAP datatype or you will be sacked.'

As opposed to my current job:
"You MUST implement the CRAP datatype or you will be sacked."

And let's not forget the ever-infamous contractual job. Company hires
shmuck for 3-6 month project to get them off the ground. He does, but
it's clearly been done with a minimum amount of design and a maximum
amount of ad-hoc quick fixes.
It worked just long enough for his paychecks to cash and any
contractual obligations to expire. Enter you:

You: "Well, it appears that your system sucks. I'll need to virtually
start from scratch. My first clue was the CRAP datatype here."

Boss, blustery: "What do you mean it sucks? It's been working fine for
the past couple months! Are you sure you know what you're talking
about?"

You, blinking: "Oooooh. Ok, I see what you're saying now. You're right,
I was confused there for a minute. It's become clear to me that your
real problem are the widgets in the perforationator, I betcha dimes to
dollars they've crossed wires with the gibson regulators. Shouldn't
take me any time at all to clear up once we get the contract nailed
down. The contract expires after a month, you say?"
Author
1 Jul 2005 2:22 AM
Brian Selzer
An absurd example--let's say you have the following table:

Orders: Customer CSV, Item CSV, Quantity INT, OrderDate DATETIME

'Joe Smith, 16473 Wills Ave., Apt. 153, Atlanta, Georgia', 'Candle, Red, 1
ft., $1.99', 5, '2005-04-30 11:15:00'
'Bill Black, P.O Box 1535, Arlington, Virginia', 'Candle, Blue, 6 in.,
$1.29', 3, '2005-05-02 16:18:00'
'Bill Black, P.O Box 1535, Arlington, Virginia', 'Candle, White, 3 in.,
$0.79', 2, '2005-05-15 12:32:00'

In the strict sense, this table is in 1NF, because each attribute contains
only one value.

In my opinion, there must be some definitive criteria that separates the
ridiculous example above and the equally ridiculous but opposite
characterization that a string is not atomic because it is a list of
character values.  (A date can be thought of as the number of days since
some arbitrary point in time, and as such, year, month and day are derived
values; therefore, a date value is always scalar.)  Let's explore this.  The
permutation of character values determines the meaning of a string.  Each
individual character has no significance other than its own value, nor does
its inclusion or position in a string impart any semantic context to that
value (the character value, not the string value).  In most lists, each
value represents something tangible, and the entry in the list is a handle,
or shorthand for the tangible entity.  For example, each item on a to-do
list represents some task that must be performed.  In other lists, like an
ordered pair, the position of the value imparts meaning to the value.  For
example, in an ordered pair, the first number is the x coordinate, and the
second is the y coordinate.  Let's generalize a bit more.  Given the
following definitions for the different kinds of composite values:  a
collection is an unordered group of zero or more values, not necessarily
distinct, that share the same base type; a set is an unordered group of zero
or more distinct values that share the same base type; a list is an ordered
group of zero or more values, not necessarily distinct, that share the same
base type; an n-tuple is an ordered group of n values, whose position
determines both type and context.  I propose the following formal
definition: a composite data type is atomic if and only if: (1) each
distinct value is determined solely on the presence and/or order of its
component elements' values,  (2) all of its componenent elements are atomic,
and (3) none of its component elements is a surrogate for a tangible entity.
An n-tuple would never be atomic, because each element is a surrogate--that
is, each element represents something.  A string would always be atomic
because its value is determined solely by the permutation of its component
character values.  A set of integers could be considered atomic if the
integer values did not carry any additional meaning.  A list of sale dates
for item X could never be considered atomic because each date entry carries
additional meaning aside from the date value, and consequently each date
value is a surrogate for a sale date for item X.


Show quote
"Anith Sen" <an***@bizdatasolutions.com> wrote in message
news:ek3$8VbfFHA.3988@TK2MSFTNGP10.phx.gbl...
> >> My primary objection to implementing a CSV ist domain and defining
> >> relations containing attributes with a CSV list domain is that a CSV
list
> >> is by definition not atomic.
>
> Whether some value is atomic or not is purely subjective and contextual;
In
> some contexts the "Bryan" is an atomic value while in some other context
it
> may be a list of letters. In some contexts the literal "2005/06/30"
defines
> a single date value while in some other contexts it is comprised of three
> distinct values representing year, month and time.
>
> Thus a definition of 1NF on a loose term like "atomicity" alone is neither
> precise nor formal. For a formal definition one would need a reference to
> something which is non-contextual and unambiguous. A type or domain
provides
> that. A type\ domain is a set of values along with associated operators
> applicable on those values. Thus in a good type system, one could have a
> type of arbitrary complexity and values are scalar with respect to the
type.
>
> >> A CSV list is an ordered set of zero or more values, separated by
commas.
>
> In some contexts it is. In some other contexts, it is just as scalar as an
> int or datetime or char(10) value. A well defined type could precisely
> capture the context and represent the values in a relational database
> appropriately.
>
> >> Every datum (atomic value) in a relational database can be accessed if
> >> you know the table name, the primary key value, and the column name.
>
> Nobody disagrees with that.
>
> >> How can you write a query to access the third element in a CSV list
> >> column using only the table name, the primary key value, and the column
> >> name.
>
> Using the operators applicable to the CSV list type.
>
> To use an analogy, how do you extract the month from a datetime value?
Using
> DATEPART operator, which is an operator applicable to the DATETIME type.
>
> >>  The additional dimension (element number) makes this impossible.
>
> Well, that could be an argument to the operator, if an element needs to be
> extracted based on its index or position in the list.
>
> >> It follows then that a database that implements a CSV list domain is by
> >> definition not a relational database
>
> No, it does not follow anything of that sort. The link on 1NF provided
early
> in this thread should clarify such misconceptions.
>
> --
> Anith
>
>
Author
1 Jul 2005 10:09 AM
David Portas
Your proposals may make sense for some real-world products but they are
not consistent with the Relational Model. In fact you are proposing
exactly the kind of constraints that some people incorrectly identify
and criticize as being the limitations of the model - namely, that it
can only cope with certain primitive types of data. These are product
limitations, not theoretical ones.

Relation-valued attributes, CSV list types, XML types and other
"exotic" types are perfectly respectable in theory. Given that one is
provided with operators for every built-in datatype, the quality of
atomicity of values in that type is primarily a subjective and
pragmatic issue - i.e. does your RDBMS support operators and other
features that make particular representations of data a practical and
useful design. Database professionals understand the features and
constraints of the products they use and make judgements about
non-atomic values based on their knowledge and experience of what will
work well in those products.

I'm as guilty as many others of instinctively labelling delimited lists
or repeating groups of columns as a violation of 1NF. It's very
convenient to roll out that argument sometimes but it's also a tiny bit
lazy and prejudiced. It is shaped purely by my experience of the
products and situations I've come across.

--
David Portas
SQL Server MVP
--
Author
1 Jul 2005 10:28 PM
Anith Sen
David,

>> I'm as guilty as many others of instinctively labelling delimited lists
>> or repeating groups of columns as a violation of 1NF. It's very
>> convenient to roll out that argument sometimes but it's also a tiny bit
>> lazy and prejudiced. It is shaped purely by my experience of the products
>> and situations I've come across.

Don't be hard on yourself :-) All of us have done it; even some of the early
papers by Codd has similar suggestions. In fact, till some good research on
type inheritance and most specific types (MSTs) became fruitful, most folks
familiar with  relational databases were stating the same. The true nature
of first normal form, at least from the perspective of scalar (encapsulated)
types has not been well understood in the early years of relational
databases.

But then common sense generally tells us, when someone in a public forum
requests a CSV to be crammed into a single VARCHAR column it is most likely
an informal violation of 1NF.

--
Anith
Author
1 Jul 2005 10:23 PM
Anith Sen
>> An absurd example--let's say you have the following table: [snipped for
>> brevity] In the strict sense, this table is in 1NF, because each
>> attribute contains only one value.

Provided each of those values is drawn from its corresponding well defined
type.

>> In my opinion, there must be some definitive criteria that separates the
>> ridiculous example above and the equally ridiculous but opposite
>> characterization that a string is not atomic because it is a list of
>> character values.

It is already mentioned that such a definitive criteria can be provided by a
precisely defined type. For a simple introduction to this concept see:
http://www.cs.ust.hk/vldb2002/VLDB2002-proceedings/slides/S01P02slides.pdf

However, as David said that the discretion regarding type support in a RDBMS
is not strictly in the realm of relational theory, which occasionally
results in subjective perceptions even within the logical model.

>> (A date can be [snipped for brevity] I propose the following formal
>> definition:

None of us here need to propose anything new, since qualified technical
literature is out there on value atomicity, t