Home All Groups Group Topic Archive Search About

how to represent very long primary key

Author
15 Dec 2005 8:10 PM
hroussel_at_delphes.com
Hi!

let's say that I have a table that represent files:

CREATE TABLE DocumentFiles (
name ntext not null,
size bigint not null constraint filesize_check check(size >= 0),
lastmodif datetime not null default GetUTCDate()
)

Now I cannot use the name of the file as a primary key.
But I want a way to find a specific file base on it's name
and for the sake of the discussion let assume that the file name can be
longer than 1000 chars

I thought of using a hash on the name and using the SHA function to
compute it.
ex:

CREATE TABLE DocumentFiles (
nameid binary(20) not null primary key,
name ntext not null,
size bigint not null constraint filesize_check check(size >= 0),
lastmodif datetime not null default GetUTCDate()
)

everytime I want to insert some file information, I compute the SHA of
the name
everytime my program want to find a file, it first compute the SHA of
the name and use it  in the WHERE part of the query.

is it a good idea ?

what are the other alternative ?

Thanks

Wonixen

Author
15 Dec 2005 8:45 PM
GregO
Hi ,
I would use the CHECKSUM() function and create a hash index on the column.
Look up hash indinces in the books online.  You can use a computed column on
the name column as suggested in the BOL.


--
kind regards
Greg O
-------
Looking to use CLR in SQL 2005.  Try some pre-build CLR Functions and SP
AGS SQL 2005 Utilities, over 20+ functions
http://www.ag-software.com/?tabid=38
<hroussel_at_delphes.***@hotmail.com> wrote in message
Show quote
news:1134677449.202934.144970@f14g2000cwb.googlegroups.com...
> Hi!
>
> let's say that I have a table that represent files:
>
> CREATE TABLE DocumentFiles (
> name ntext not null,
> size bigint not null constraint filesize_check check(size >= 0),
> lastmodif datetime not null default GetUTCDate()
> )
>
> Now I cannot use the name of the file as a primary key.
> But I want a way to find a specific file base on it's name
> and for the sake of the discussion let assume that the file name can be
> longer than 1000 chars
>
> I thought of using a hash on the name and using the SHA function to
> compute it.
> ex:
>
> CREATE TABLE DocumentFiles (
> nameid binary(20) not null primary key,
> name ntext not null,
> size bigint not null constraint filesize_check check(size >= 0),
> lastmodif datetime not null default GetUTCDate()
> )
>
> everytime I want to insert some file information, I compute the SHA of
> the name
> everytime my program want to find a file, it first compute the SHA of
> the name and use it  in the WHERE part of the query.
>
> is it a good idea ?
>
> what are the other alternative ?
>
> Thanks
>
> Wonixen
>
Author
15 Dec 2005 8:50 PM
JT
If this table is to have a foreign key relationship with other tables, then
you will need to implement a surrogate key, perhaps an integer based
identity column and use that as the primary key. With only a couple of
possible exceptions... you would not find anyone here advocating that a 1000
char filename should be used as a foreign key.

Just becuase FileName is not the primary key, you can still search
DocumentFiles based on FileName. If it were VarChar instead of (2 byte
unicode) ntext, you could just index it. Are the files names really as long
as 1000 chars, perhaps you could store FilePath and FileName in seperate
VarChar columns and index just FileName.

If indexing on FileName is not practical, then you could add an Int column
called FileNameCS that contains the checksum() value of FileName and then
index on that. The stored procedure you use to query the table could accept
@FileName as a parameter let SQL Server calculate @FileNameCS since there is
no equivalent CheckSum() function at the application level. You would need
to search on both FileName and FileNameCS since two different character
values can possibly translate into the same checksum value.

CREATE proc usp_GetDocumentFileInfo
@FileName VarChar(8000)
as

declare @FileNameCS int
select @FileNameCS = checksum(@Name)

select
    *
from
    DocumentFiles
where
    FileNameCS = @FileNameCS and
    FileName = @FileName

GO


<hroussel_at_delphes.***@hotmail.com> wrote in message
Show quote
news:1134677449.202934.144970@f14g2000cwb.googlegroups.com...
> Hi!
>
> let's say that I have a table that represent files:
>
> CREATE TABLE DocumentFiles (
> name ntext not null,
> size bigint not null constraint filesize_check check(size >= 0),
> lastmodif datetime not null default GetUTCDate()
> )
>
> Now I cannot use the name of the file as a primary key.
> But I want a way to find a specific file base on it's name
> and for the sake of the discussion let assume that the file name can be
> longer than 1000 chars
>
> I thought of using a hash on the name and using the SHA function to
> compute it.
> ex:
>
> CREATE TABLE DocumentFiles (
> nameid binary(20) not null primary key,
> name ntext not null,
> size bigint not null constraint filesize_check check(size >= 0),
> lastmodif datetime not null default GetUTCDate()
> )
>
> everytime I want to insert some file information, I compute the SHA of
> the name
> everytime my program want to find a file, it first compute the SHA of
> the name and use it  in the WHERE part of the query.
>
> is it a good idea ?
>
> what are the other alternative ?
>
> Thanks
>
> Wonixen
>
Author
15 Dec 2005 9:30 PM
hroussel_at_delphes.com
Hi!

Thanks for the answers. I'll try checksum with varchar.

JT,
yes the file name can be longer than 1000, I had some that were around
2600 chars
and in theory they can be as long as 32767, that is why I am using
ntext so that i have no practical limits. but the problem is how to
find them once you have the name. So I thought of SHA1 which is almost
collision free.

But then I could not use store proc to find file base on name.

Wonixen

maybe instead of using ntext, I can use varchar with an upper limit and
log errors if i ever encounter a file name that bust that upper limits.

Wonixen
Author
15 Dec 2005 10:02 PM
JT
VarChar can support a maximum of 8000 and NVarchar a maximum of 4000
(unicode). Just out of curiosity, how are file names assigned and what is
the naming convention?

By the way, 32,767 characters is practically a short story or at least a
term paper. Let's hope you never have to actually recite the name of one of
those files.   ;-)

<hroussel_at_delphes.***@hotmail.com> wrote in message
Show quote
news:1134682240.151473.103960@f14g2000cwb.googlegroups.com...
> Hi!
>
> Thanks for the answers. I'll try checksum with varchar.
>
> JT,
> yes the file name can be longer than 1000, I had some that were around
> 2600 chars
> and in theory they can be as long as 32767, that is why I am using
> ntext so that i have no practical limits. but the problem is how to
> find them once you have the name. So I thought of SHA1 which is almost
> collision free.
>
> But then I could not use store proc to find file base on name.
>
> Wonixen
>
> maybe instead of using ntext, I can use varchar with an upper limit and
> log errors if i ever encounter a file name that bust that upper limits.
>
> Wonixen
>
Author
16 Dec 2005 3:33 PM
hroussel_at_delphes.com
Thanks JT!

Well we do not assign filename, we merely "crawl" a customer hard disk,
web site, database, archive to do some text analysis of their data.

so the file name look like this: scheme://some_document_identifier.
that way, with special retrievers, we can do files, zip archived,
database, website etc...
ex of file name.
file://c:\mydata\file1.doc
filezip://c:\archived\whatever.zip
db://database/table/column
http://www.delphes.com/datainfo/showbusinessplan.asp?month=6;year=2006

the document identifier can be very long, and no known structure, each
retriever is responsible for returning file name.
everytime a new name is received, it must be put in the db, unless it
is already there, and each name is unique so I must have a way to
prevent duplicate of file name.

in general names tend to be around 120 chars, but it is an average, but
some customer had some names that were much larger.

Wonixen

AddThis Social Bookmark Button