|
database
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
convert flat file to native file format for bcpHello,
We bulk insert terabytes of data in delimited flat files every day. I recently heard from our DBA that using bcp especially in native format would result in reduction in file size and much faster inserts. Since we transfer these files across the network, I think we would save a lot of bandwidth as well by producing native mode files. Is there an API to convert the files in character mode to native mode/unicode native mode? What are the caveats? I would be grateful for a few pointers to where I should start looking. Thanks, mar Mar,
I don't know of any documentation of the format of native bcp files, nor do I know of an API to convert a text file to one. However, I assume a bcp-able native file has a straightforward structure that matches how SQL Server stores data in table rows (which is documented). If so, it should be a simple matter to write a conversion program in C++ or awk that could be run before the file is pulled down. To do this, I would generate native files from the data you have, then study them in a hex editor to discern their underlying format. You might also want to compare them with a dump of the destination table without any indexes, in the hope they are the same, and you can use the existing documentation about how data is stored in a non-indexed table. You can look at this with DBCC PAGE. Caveats would be any special cases you don't know about (there is probably special handling of text, ntext, and image, for example), and the smallest error in your conversion program would likely corrupt the native file or cause the import to give wrong results. I've posted threads in the past to show how to dump the table data using DBCC PAGE (adapted from Kalen Delaney's Inside Sql Server 2000: http://groups.google.com/groups?hl=en&q=kass+%22dbccpageout%22+m_nextpage Steve Kass Drew University mar wrote: Show quote >Hello, > >We bulk insert terabytes of data in delimited flat files every day. I >recently heard from our DBA that using bcp especially in native format >would result in reduction in file size and much faster inserts. Since >we transfer these files across the network, I think we would save a lot >of bandwidth as well by producing native mode files. > >Is there an API to convert the files in character mode to native >mode/unicode native mode? What are the caveats? I would be grateful for >a few pointers to where I should start looking. > >Thanks, >mar > > > Steve Kass wrote:
Show quote > Mar, Question is whether it's worth the effort. I don't think import> > I don't know of any documentation of the format of native bcp > files, nor do I know of an API to convert a text file to one. > > However, I assume a bcp-able native file has a straightforward > structure that matches how SQL Server stores data in table rows > (which is documented). If so, it should be a simple matter to write > a conversion program in C++ or awk that could be run before the > file is pulled down. To do this, I would generate native files from > the data you have, then study them in a hex editor to discern their > underlying format. You might also want to compare them with > a dump of the destination table without any indexes, in the hope > they are the same, and you can use the existing documentation > about how data is stored in a non-indexed table. You can look > at this with DBCC PAGE. performance will improve if files have to be read, converted to native, written and then read again (as native file). ASCII CSV files usually have quite good compression ratios so I suggest to simply gzip files for transfer and uncompress them prior to bulk loading (or even use a piped version of that if bcp supports reading from stdin). Kind regards robert mar (marvind***@yahoo.com) writes:
> We bulk insert terabytes of data in delimited flat files every day. I You could write a C++ program that that uses the IDataConvert interface> recently heard from our DBA that using bcp especially in native format > would result in reduction in file size and much faster inserts. Since > we transfer these files across the network, I think we would save a lot > of bandwidth as well by producing native mode files. > > Is there an API to convert the files in character mode to native > mode/unicode native mode? What are the caveats? I would be grateful for > a few pointers to where I should start looking. to convert the data to native data types. IDataConvert is described in the OLE DB Programmers Reference, Appendix I. While you might get a way with specifying just native format, you are probably better of writing a format file, to have better control. But I am doubts that this will be worth the pain. If there is a lot of numeric or datetime data, using native format may decrease the number of bytes that crosses the network a bit, but it would not be drastic. A good compression scheme is much more effective. Likewise I don't expect that bulk-load with native format will be that much faster. An easy way to benchmark this, is first load a text file and time that. The bulk out that table in native format, clear the table, and load the native file. If you really see huge gains, you may consider writing that C++ application. -- Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx I am the DBA that am attempting to wring the arm of one of our developers to
look into this. If I knew C++ I would attempt this myself but since I don’t I am bothering anyone that will listen ;) and I appreciate your imput. Native mode is much faster than character mode. BCP runs out of process and therefore requires inter process data copying and parameter marshaling to move data across process memory spaces. Inter process data marshaling is the process of converting parameters of a method call into a stream of bytes and can add significantly to CPU overhead. Bulk Insert skips these steps and can go straight to OLE-DB. Anyhow what good is theory with out a test……Here are my tests and results I created a sample table CREATE TABLE [dbo].[t1] ( [c1] [int] NOT NULL , [c2] [int] NOT NULL , [c3] [datetime] NOT NULL , [c4] [datetime] NOT NULL , [c5] [int] NOT NULL , [c6] [varchar] (48) , [c7] [varchar] (48) ) This table contains 45,010,015 records for a total size of 3,065,744 KB I exported the table out twice once in native mode and once in character mode Native – 2,534,695 KB Character – 4,314,642 KB I then bulked in each file using both bcp and BULK INSERT 1. BULK INSERT t1 FROM 'm:\t1_native.out' WITH (DATAFILETYPE = 'native',TABLOCK ) 2. BULK INSERT t1 FROM 'm:\t1_char.out' WITH (DATAFILETYPE = 'char',TABLOCK ) 3. EXEC master..xp_cmdshell 'bcp "BulkInsert.dbo.t1" in "m:\t1_char.out" -c -h"TABLOCK" -U"sa" -P""' 4. EXEC master..xp_cmdshell 'bcp "BulkInsert.dbo.t1" in "m:\t1_native.out" -n -h"TABLOCK" -U"sa" -P""' Results – I ran each test 3 times and averaged the results BULK INSERT native = 103 seconds BULK INSERT character = 186 seconds BCP native = 216 seconds BCP character = 352 seconds As far as it being worth the effort…I think so. We have some servers that load hundreds of millions of records in a day. We have a farm of servers that currently generate the BCP files this servers could also be responsible for creating the native format files. These servers are not in a 1 to 1 relation to the DB servers so if the conversion process slows these servers down we can just add more. Erland, I am not sure what you are suggestion to do with a format file, could you elaborate? Native files do not use format files…. Thanks Bert Bert (B***@discussions.microsoft.com) writes:
Show quote > Results - I ran each test 3 times and averaged the results While your benchmark shows that native is faster, is not drastic. But> > BULK INSERT native = 103 seconds > BULK INSERT character = 186 seconds > BCP native = 216 seconds > BCP character = 352 seconds > > > As far as it being worth the effort I think so. We have some servers > that load hundreds of millions of records in a day. We have a farm of > servers that currently generate the BCP files this servers could also be > responsible for creating the native format files. These servers are not > in a 1 to 1 relation to the DB servers so if the conversion process > slows these servers down we can just add more. if you think gain is worthwhile, I guess it is. Just a word of warning, though: I don't know from what source these files are generated, but just make sure that the cost for generating binary files in native format does not outweighs the benefit for BCP. > Erland, I am not sure what you are suggestion to do with a format file, It was a while since I wrote that, so I don't recall exactly what I had> could you elaborate? Native files do not use format files. in mind. But I guess the idea was that -n is really a shortcut for a -f with a format file that specifies a file format equal to the table defintion. By the way, have you considered DTS (or SQL Integration Services). I'm not used neither of them myself, but as I understand DTS/SSIS is what you are supposed to used when you load tons of data everyday like you do. -- Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx >>Just a word of warning, though: I don't know from what source these files My understanding is that the conversion needs to happen anyhow. Correct me >>are generated, but just make sure that the cost for generating binary >>files in native format does not outweighs the benefit for BCP. if I am wrong but my understanding is as follows BULK INSERT using a native mode file File is converted into binary TDS stream on the SQL Server and Loaded BULK INSERT using a character mode file File does not need to be converted to binary TDS because that is how the native file is stored. Having the ability to offload the conversion to another server also allows me to offload it to multiple servers. Even if the load is increased on one or more server the load will be decreased on the database server and this is my goal. I took a look at the documentation for format files, one of the only advantages to using this file would be if I only wanted to load a subset of the columns into the database. I did some testing on format files in the past and under heavy load I would have a small percentage of the loads fail. Not only did I have failures but the load using the format file was significantly lower than not using a format file. Our application currently is on SQL200 so I haven’t had much time to look at 2005. Hopefully 2005 SSIS has an in Process mechanism to load data such as the BULK INSERT statement. I will defiantly talk a look at this. Thanks Bert Bert (B***@discussions.microsoft.com) writes:
> Having the ability to offload the conversion to another server also Of course, because that's the server you are responsible for. Wonder> allows me to offload it to multiple servers. Even if the load is > increased on one or more server the load will be decreased on the > database server and this is my goal. what the admins of the other servers think. :-) > Not only did I have failures but the load using the format file was I am correct to guess that "lower" is a typo for "slower"?> significantly lower than not using a format file. -- Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx Haha, no I manage those servers also. We have a bank of about 20 servers
that except external customers firewall and IDS logs. For example a customer might have 100 firewalls that we are monitoring in real time. The raw log files come in and the bank of servers convert them to BCP files (currently in character format) The goal of our application is to load the data in real time but sometimes if the customer has for example a worm outbreak the load can increase for bursts of time. If it last too long we start to get backed up on the SQL Server side. Currently we buy big SQL Server boxes so they can handle these burst in traffic. I feel that we could ride these bursts if we added a few more log processing servers in fro t of the SQL Servers. (We have about 150 SQL Servers) Instead of buying so many heavy duty SQL boxes we add a few more servers to the log processing group. >>I am correct to guess that "lower" is a typo for "slower"? Yep much Slower…. I tested this on the June CTP of SQL 2005 as well, it also was much slower than with no format file. Do you concur that native format files are actually stored as TDS? Thanks again Bert Bert (B***@discussions.microsoft.com) writes:
> Yep much Slower. I tested this on the June CTP of SQL 2005 as well, it I don't know exactly what is going on, but there may be a fast track> also was much slower than with no format file. > > Do you concur that native format files are actually stored as TDS? for -n and with a format file, it always runs the conversions although it may not have to in some cases. Pure speculation on my part. -- Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx |
|||||||||||||||||||||||