|
database
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
how to represent very long primary keylet's say that I have a table that represent files: CREATE TABLE DocumentFiles ( name ntext not null, size bigint not null constraint filesize_check check(size >= 0), lastmodif datetime not null default GetUTCDate() ) Now I cannot use the name of the file as a primary key. But I want a way to find a specific file base on it's name and for the sake of the discussion let assume that the file name can be longer than 1000 chars I thought of using a hash on the name and using the SHA function to compute it. ex: CREATE TABLE DocumentFiles ( nameid binary(20) not null primary key, name ntext not null, size bigint not null constraint filesize_check check(size >= 0), lastmodif datetime not null default GetUTCDate() ) everytime I want to insert some file information, I compute the SHA of the name everytime my program want to find a file, it first compute the SHA of the name and use it in the WHERE part of the query. is it a good idea ? what are the other alternative ? Thanks Wonixen Hi ,
I would use the CHECKSUM() function and create a hash index on the column. Look up hash indinces in the books online. You can use a computed column on the name column as suggested in the BOL. -- Show quotekind regards Greg O ------- Looking to use CLR in SQL 2005. Try some pre-build CLR Functions and SP AGS SQL 2005 Utilities, over 20+ functions http://www.ag-software.com/?tabid=38 <hroussel_at_delphes.***@hotmail.com> wrote in message news:1134677449.202934.144970@f14g2000cwb.googlegroups.com... > Hi! > > let's say that I have a table that represent files: > > CREATE TABLE DocumentFiles ( > name ntext not null, > size bigint not null constraint filesize_check check(size >= 0), > lastmodif datetime not null default GetUTCDate() > ) > > Now I cannot use the name of the file as a primary key. > But I want a way to find a specific file base on it's name > and for the sake of the discussion let assume that the file name can be > longer than 1000 chars > > I thought of using a hash on the name and using the SHA function to > compute it. > ex: > > CREATE TABLE DocumentFiles ( > nameid binary(20) not null primary key, > name ntext not null, > size bigint not null constraint filesize_check check(size >= 0), > lastmodif datetime not null default GetUTCDate() > ) > > everytime I want to insert some file information, I compute the SHA of > the name > everytime my program want to find a file, it first compute the SHA of > the name and use it in the WHERE part of the query. > > is it a good idea ? > > what are the other alternative ? > > Thanks > > Wonixen > If this table is to have a foreign key relationship with other tables, then
you will need to implement a surrogate key, perhaps an integer based identity column and use that as the primary key. With only a couple of possible exceptions... you would not find anyone here advocating that a 1000 char filename should be used as a foreign key. Just becuase FileName is not the primary key, you can still search DocumentFiles based on FileName. If it were VarChar instead of (2 byte unicode) ntext, you could just index it. Are the files names really as long as 1000 chars, perhaps you could store FilePath and FileName in seperate VarChar columns and index just FileName. If indexing on FileName is not practical, then you could add an Int column called FileNameCS that contains the checksum() value of FileName and then index on that. The stored procedure you use to query the table could accept @FileName as a parameter let SQL Server calculate @FileNameCS since there is no equivalent CheckSum() function at the application level. You would need to search on both FileName and FileNameCS since two different character values can possibly translate into the same checksum value. CREATE proc usp_GetDocumentFileInfo @FileName VarChar(8000) as declare @FileNameCS int select @FileNameCS = checksum(@Name) select * from DocumentFiles where FileNameCS = @FileNameCS and FileName = @FileName GO <hroussel_at_delphes.***@hotmail.com> wrote in message Show quote news:1134677449.202934.144970@f14g2000cwb.googlegroups.com... > Hi! > > let's say that I have a table that represent files: > > CREATE TABLE DocumentFiles ( > name ntext not null, > size bigint not null constraint filesize_check check(size >= 0), > lastmodif datetime not null default GetUTCDate() > ) > > Now I cannot use the name of the file as a primary key. > But I want a way to find a specific file base on it's name > and for the sake of the discussion let assume that the file name can be > longer than 1000 chars > > I thought of using a hash on the name and using the SHA function to > compute it. > ex: > > CREATE TABLE DocumentFiles ( > nameid binary(20) not null primary key, > name ntext not null, > size bigint not null constraint filesize_check check(size >= 0), > lastmodif datetime not null default GetUTCDate() > ) > > everytime I want to insert some file information, I compute the SHA of > the name > everytime my program want to find a file, it first compute the SHA of > the name and use it in the WHERE part of the query. > > is it a good idea ? > > what are the other alternative ? > > Thanks > > Wonixen > Hi!
Thanks for the answers. I'll try checksum with varchar. JT, yes the file name can be longer than 1000, I had some that were around 2600 chars and in theory they can be as long as 32767, that is why I am using ntext so that i have no practical limits. but the problem is how to find them once you have the name. So I thought of SHA1 which is almost collision free. But then I could not use store proc to find file base on name. Wonixen maybe instead of using ntext, I can use varchar with an upper limit and log errors if i ever encounter a file name that bust that upper limits. Wonixen VarChar can support a maximum of 8000 and NVarchar a maximum of 4000
(unicode). Just out of curiosity, how are file names assigned and what is the naming convention? By the way, 32,767 characters is practically a short story or at least a term paper. Let's hope you never have to actually recite the name of one of those files. ;-) <hroussel_at_delphes.***@hotmail.com> wrote in message Show quote news:1134682240.151473.103960@f14g2000cwb.googlegroups.com... > Hi! > > Thanks for the answers. I'll try checksum with varchar. > > JT, > yes the file name can be longer than 1000, I had some that were around > 2600 chars > and in theory they can be as long as 32767, that is why I am using > ntext so that i have no practical limits. but the problem is how to > find them once you have the name. So I thought of SHA1 which is almost > collision free. > > But then I could not use store proc to find file base on name. > > Wonixen > > maybe instead of using ntext, I can use varchar with an upper limit and > log errors if i ever encounter a file name that bust that upper limits. > > Wonixen > Thanks JT!
Well we do not assign filename, we merely "crawl" a customer hard disk, web site, database, archive to do some text analysis of their data. so the file name look like this: scheme://some_document_identifier. that way, with special retrievers, we can do files, zip archived, database, website etc... ex of file name. file://c:\mydata\file1.doc filezip://c:\archived\whatever.zip db://database/table/column http://www.delphes.com/datainfo/showbusinessplan.asp?month=6;year=2006 the document identifier can be very long, and no known structure, each retriever is responsible for returning file name. everytime a new name is received, it must be put in the db, unless it is already there, and each name is unique so I must have a way to prevent duplicate of file name. in general names tend to be around 120 chars, but it is an average, but some customer had some names that were much larger. Wonixen |
|||||||||||||||||||||||