Home All Groups Group Topic Archive Search About

convert flat file to native file format for bcp

Author
8 Dec 2005 11:03 PM
mar
Hello,

We bulk insert terabytes of data in delimited flat files every day. I
recently heard from our DBA that using bcp especially in native format
would result in reduction in file size and much faster inserts. Since
we transfer these files across the network, I think we would save a lot
of bandwidth as well by producing native mode files.

Is there an API to convert the files in character mode to native
mode/unicode native mode? What are the caveats? I would be grateful for
a few pointers to where I should start looking.

Thanks,
mar

Author
9 Dec 2005 5:57 AM
Steve Kass
Mar,

I don't know of any documentation of the format of native bcp
files, nor do I know of an API to convert a text file to one.

However, I assume a bcp-able native file has a straightforward
structure that matches how SQL Server stores data in table rows
(which is documented).  If so, it should be a simple matter to write
a conversion program in C++ or awk that could be run before the
file is pulled down.  To do this, I would generate native files from
the data you have, then study them in a hex editor to discern their
underlying format.  You might also want to compare them with
a dump of the destination table without any indexes, in the hope
they are the same, and you can use the existing documentation
about how data is stored in a non-indexed table.  You can look
at this with DBCC PAGE.

Caveats would be any special cases you don't know about
(there is probably special handling of text, ntext, and image,
for example), and the smallest error in your conversion program
would likely corrupt the native file or cause the import to give
wrong results.

I've posted threads in the past to show how to dump the
table data using DBCC PAGE (adapted from Kalen Delaney's
Inside Sql Server 2000:

http://groups.google.com/groups?hl=en&q=kass+%22dbccpageout%22+m_nextpage

Steve Kass
Drew University

mar wrote:

Show quote
>Hello,
>
>We bulk insert terabytes of data in delimited flat files every day. I
>recently heard from our DBA that using bcp especially in native format
>would result in reduction in file size and much faster inserts. Since
>we transfer these files across the network, I think we would save a lot
>of bandwidth as well by producing native mode files.
>
>Is there an API to convert the files in character mode to native
>mode/unicode native mode? What are the caveats? I would be grateful for
>a few pointers to where I should start looking.
>
>Thanks,
>mar
>

>
Author
9 Dec 2005 11:16 AM
Robert Klemme
Steve Kass wrote:
Show quote
> Mar,
>
> I don't know of any documentation of the format of native bcp
> files, nor do I know of an API to convert a text file to one.
>
> However, I assume a bcp-able native file has a straightforward
> structure that matches how SQL Server stores data in table rows
> (which is documented).  If so, it should be a simple matter to write
> a conversion program in C++ or awk that could be run before the
> file is pulled down.  To do this, I would generate native files from
> the data you have, then study them in a hex editor to discern their
> underlying format.  You might also want to compare them with
> a dump of the destination table without any indexes, in the hope
> they are the same, and you can use the existing documentation
> about how data is stored in a non-indexed table.  You can look
> at this with DBCC PAGE.

Question is whether it's worth the effort.  I don't think import
performance will improve if files have to be read, converted to native,
written and then read again (as native file).  ASCII CSV files usually
have quite good compression ratios so I suggest to simply gzip files for
transfer and uncompress them prior to bulk loading (or even use a piped
version of that if bcp supports reading from stdin).

Kind regards

    robert
Author
9 Dec 2005 9:57 PM
Erland Sommarskog
mar (marvind***@yahoo.com) writes:
> We bulk insert terabytes of data in delimited flat files every day. I
> recently heard from our DBA that using bcp especially in native format
> would result in reduction in file size and much faster inserts. Since
> we transfer these files across the network, I think we would save a lot
> of bandwidth as well by producing native mode files.
>
> Is there an API to convert the files in character mode to native
> mode/unicode native mode? What are the caveats? I would be grateful for
> a few pointers to where I should start looking.

You could write a C++ program that that uses the IDataConvert interface
to convert the data to native data types. IDataConvert is described in
the OLE DB Programmers Reference, Appendix I.

While you might get a way with specifying just native format, you are
probably better of writing a format file, to have better control.

But I am doubts that this will be worth the pain. If there is a lot
of numeric or datetime data, using native format may decrease the
number of bytes that crosses the network a bit, but it would not be
drastic. A good compression scheme is much more effective. Likewise I
don't expect that bulk-load with native format will be that much faster.

An easy way to benchmark this, is first load a text file and time that.
The bulk out that table in native format, clear the table, and load the
native file. If you really see huge gains, you may consider writing that
C++ application.


--
Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
Author
20 Dec 2005 3:29 PM
mar
Thank you all for input.
Author
24 Jan 2006 10:49 PM
Bert
I am the DBA that am attempting to wring the arm of one of our developers to
look into this.  If I knew C++ I would attempt this myself but since I don’t
I am bothering anyone that will listen ;) and I appreciate your imput.

Native mode is much faster than character mode.  BCP runs out of process and
therefore requires inter process data copying and parameter marshaling to
move data across process memory spaces. Inter process data marshaling is the
process of converting parameters of a method call into a stream of bytes and
can add significantly to CPU overhead. Bulk Insert skips these steps and can
go straight to OLE-DB.

Anyhow what good is theory with out a test……Here are my tests and results

I created a sample table

CREATE TABLE [dbo].[t1] (
    [c1] [int] NOT NULL ,
    [c2] [int] NOT NULL ,
    [c3] [datetime] NOT NULL ,
    [c4] [datetime] NOT NULL ,
    [c5] [int] NOT NULL ,
    [c6] [varchar] (48) ,
    [c7] [varchar] (48)
)

This table contains 45,010,015 records for a total size of 3,065,744 KB

I exported the table out twice once in native mode and once in character mode

Native – 2,534,695 KB
Character – 4,314,642 KB

I then bulked in each file using both bcp and BULK INSERT

1.  BULK INSERT t1 FROM 'm:\t1_native.out'
WITH (DATAFILETYPE = 'native',TABLOCK )

2.  BULK INSERT t1 FROM 'm:\t1_char.out'
WITH (DATAFILETYPE = 'char',TABLOCK )

3.  EXEC master..xp_cmdshell
'bcp "BulkInsert.dbo.t1" in "m:\t1_char.out" -c -h"TABLOCK" -U"sa" -P""'

4.  EXEC master..xp_cmdshell
'bcp "BulkInsert.dbo.t1" in "m:\t1_native.out" -n -h"TABLOCK" -U"sa" -P""'


Results – I ran each test 3 times and averaged the results

BULK INSERT native = 103 seconds
BULK INSERT character = 186 seconds
BCP native = 216 seconds
BCP character = 352 seconds


As far as it being worth the effort…I think so. We have some servers that
load hundreds of millions of records in a day.  We have a farm of servers
that currently generate the BCP files this servers could also be responsible
for creating the native format files.  These servers are not in a 1 to 1
relation to the DB servers so if the conversion process slows these servers
down we can just add more.

Erland, I am not sure what you are suggestion to do with a format file,
could you elaborate?  Native files do not use format files….

Thanks
Bert
Author
24 Jan 2006 11:19 PM
Erland Sommarskog
Bert (B***@discussions.microsoft.com) writes:
Show quote
> Results - I ran each test 3 times and averaged the results
>
> BULK INSERT native = 103 seconds
> BULK INSERT character = 186 seconds
> BCP native = 216 seconds
> BCP character = 352 seconds
>
>
> As far as it being worth the effort I think so. We have some servers
> that load hundreds of millions of records in a day.  We have a farm of
> servers that currently generate the BCP files this servers could also be
> responsible for creating the native format files.  These servers are not
> in a 1 to 1 relation to the DB servers so if the conversion process
> slows these servers down we can just add more.

While your benchmark shows that native is faster, is not drastic. But
if you think gain is worthwhile, I guess it is.

Just a word of warning, though: I don't know from what source these files
are generated, but just make sure that the cost for generating binary
files in native format does not outweighs the benefit for BCP.

> Erland, I am not sure what you are suggestion to do with a format file,
> could you elaborate?  Native files do not use format files.

It was a while since I wrote that, so I don't recall exactly what I had
in mind. But I guess the idea was that -n is really a shortcut for a
-f with a format file that specifies a file format equal to the table
defintion.

By the way, have you considered DTS (or SQL Integration Services). I'm
not used neither of them myself, but as I understand DTS/SSIS is
what you are supposed to used when you load tons of data everyday
like you do.

--
Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
Author
25 Jan 2006 9:16 PM
Bert
>>Just a word of warning, though: I don't know from what source these files
>>are generated, but just make sure that the cost for generating binary
>>files in native format does not outweighs the benefit for BCP.

My understanding is that the conversion needs to happen anyhow.  Correct me
if I am wrong but my understanding is as follows

BULK INSERT using a native mode file
    File is converted into binary TDS stream on the SQL Server and Loaded

BULK INSERT using a character mode file
File does not need to be converted to binary TDS because that is how the
native file is stored.

Having the ability to offload the conversion to another server also allows
me to offload it to multiple servers.  Even if the load is increased on one
or more server the load will be decreased on the database server and this is
my goal.

I took a look at the documentation for format files, one of the only 
advantages to using this file would be if I only wanted to load a subset of
the columns into the database.  I did some testing on format files in the
past and  under heavy load I would have a small percentage of the loads fail.
Not only did I have failures but the load using the format file was
significantly lower than not using a format file.

Our application currently is on SQL200 so I haven’t had much time to look at
2005.  Hopefully 2005 SSIS has an in Process mechanism to load data such as
the BULK INSERT statement.  I will defiantly talk a look at this.

Thanks

Bert
Author
25 Jan 2006 11:17 PM
Erland Sommarskog
Bert (B***@discussions.microsoft.com) writes:
> Having the ability to offload the conversion to another server also
> allows me to offload it to multiple servers.  Even if the load is
> increased on one or more server the load will be decreased on the
> database server and this is my goal.

Of course, because that's the server you are responsible for. Wonder
what the admins of the other servers think. :-)

>  Not only did I have failures but the load using the format file was
> significantly lower than not using a format file.

I am correct to guess that "lower" is a typo for "slower"?


--
Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
Author
25 Jan 2006 11:45 PM
Bert
Haha, no I manage those servers also.  We have a bank of about 20 servers
that except external customers firewall and IDS logs.  For example a customer
might have 100 firewalls that we are monitoring in real time.  The raw log
files come in and the bank of servers convert them to BCP files (currently in
character format)

The goal of our application is to load the data in real time but sometimes
if the customer has for example a worm outbreak the load can increase for
bursts of time.  If it last too long we start to get backed up on the SQL
Server side.

Currently we buy big SQL Server boxes so they can handle these burst in
traffic.  I feel that we could ride these bursts if we added a few more log
processing servers in fro t of the SQL Servers. (We have about 150 SQL
Servers)  Instead of buying so many heavy duty SQL boxes we add a few more
servers to the log processing group.

>>I am correct to guess that "lower" is a typo for "slower"?
Yep much Slower…. I tested this on the June CTP of SQL 2005 as well, it also
was much slower than with no format file.

Do you concur that native format files are actually stored as TDS?

Thanks again
Bert
Author
26 Jan 2006 11:10 PM
Erland Sommarskog
Bert (B***@discussions.microsoft.com) writes:
> Yep much Slower. I tested this on the June CTP of SQL 2005 as well, it
> also was much slower than with no format file.
>
> Do you concur that native format files are actually stored as TDS?

I don't know exactly what is going on, but there may be a fast track
for -n and with a format file, it always runs the conversions although
it may not have to in some cases. Pure speculation on my part.


--
Erland Sommarskog, SQL Server MVP, esq***@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx

AddThis Social Bookmark Button