Home All Groups Group Topic Archive Search About

How can I do this without a cursor??

Author
1 Sep 2005 8:43 PM
Mike
I have a slow-running stored procedure that imports several million
rows of data from a staging table into a data warehouse table. The
slowness, I know, is because the SP uses a cursor to process these n
million rows, and the reason for the cursor is simply because I can't
figure out any other way to do it.

Basically, we are using IDENTITY columns to generate our primary keys,
so somehow we need to capture each individual identity value as we
insert into a parent table so we can then use it in the next insert
statement on the child table to form the relationship between the two.
Surely this is a common problem.

Any suggestions on how best to get data from the un-normalized staging
table into the normalized warehouse tables would be greatly
appreciated. See DDL statements below; I did not include my SP code
because it is ugly, and as I said, uses a cursor, which I want to
eliminate. Thanks!

-- The relevant tables in my warehouse are structured similar to the
following:

CREATE TABLE Person
(
    Person_ID INT IDENTITY PRIMARY KEY
    ,First_Name VARCHAR(25)
    ,Last_Name VARCHAR(25)
--    ,additional person-specific columns
)

CREATE TABLE Address
(
    Address_ID INT IDENTITY PRIMARY KEY
    ,Street_Address VARCHAR(100)
    ,ZIP_Code CHAR(10) --REFERENCES ZIP_Code (ZIP_Code)
)

CREATE TABLE Person_Address
(
    Person_ID INT REFERENCES Person (Person_ID)
    ,Address_ID INT REFERENCES Address (Address_ID)
    ,Address_Label VARCHAR(25)
    ,CONSTRAINT PK_Person_Address PRIMARY KEY (Person_ID, Address_ID,
Address_Label)
)

CREATE TABLE Customer
(
    Customer_ID INT IDENTITY PRIMARY KEY
    ,Person_ID INT REFERENCES Person (Person_ID)
)

CREATE TABLE Purchase
(
    Purchase_ID INT IDENTITY PRIMARY KEY
    ,Customer_ID INT REFERENCES Customer (Customer_ID)
--    ,additional puchase-specific columns
)

-- My staging table looks like this:

CREATE TABLE Extract1
(
    First_Name VARCHAR(100)
    ,Last_Name VARCHAR(100)
    ,Address_Label VARCHAR(100)
    ,Street_Address VARCHAR(100)
    ,ZIP_Code VARCHAR(100)
--    ,additional purchase-specific extract columns
)

-- Some test data:

INSERT INTO Extract1 VALUES ('Allen', 'Anderson', 'HOME', '123 Main
Street', '12345')
INSERT INTO Extract1 VALUES ('Barbara', 'Brown', 'HOME', '456 Water
Street', '98765')
INSERT INTO Extract1 VALUES ('Carl', 'Carter', 'OFFICE', '789 Church
Street', '11223')
INSERT INTO Extract1 VALUES ('Diana', 'Dickens', 'HOME', '321 Broadway
Avenue', '44556')
INSERT INTO Extract1 VALUES ('Ed', 'Ellison', 'WORK', '999 Elm Street',
'99113')

Author
1 Sep 2005 9:15 PM
David Portas
It's easy if you have the right constraints in place. IDENTITY should never
be the only key of any table (except maybe a staging table).

ALTER TABLE person ALTER COLUMN first_name VARCHAR(25) NOT NULL
ALTER TABLE person ALTER COLUMN last_name VARCHAR(25) NOT NULL
ALTER TABLE person ADD CONSTRAINT ak_person UNIQUE (first_name, last_name)

ALTER TABLE address ALTER COLUMN street_address VARCHAR(100) NOT NULL
ALTER TABLE address ALTER COLUMN zip_code CHAR(10) NOT NULL
ALTER TABLE address ADD CONSTRAINT ak_address UNIQUE (street_address,
zip_code)

INSERT INTO Person (first_name, last_name)
SELECT DISTINCT first_name, last_name
  FROM Extract1

INSERT INTO Address (street_address, zip_code)
SELECT DISTINCT street_address, zip_code
  FROM Extract1

INSERT INTO Person_Address (person_id, address_id, address_label)
SELECT DISTINCT P.person_id, A.address_id, E.address_label
FROM Extract1 AS E
JOIN Person AS P
  ON E.first_name = P.first_name
   AND E.last_name = P.last_name
JOIN Address AS A
  ON E.street_address = A.street_address
   AND E.zip_code = A.zip_code

--
David Portas
SQL Server MVP
--
Author
1 Sep 2005 9:33 PM
--CELKO--
The IDENTITY means that you have forced a  sequence on the insertion of
rows to the tables.  People building a Data Warehouse usually pick keys
that can be inserted in parallel so they would never use IDENTITY .

Next, a DW usually has a strong temporal component wqhich I do not see
in the reduced schema.

Finally, according to the Forester Group, you should not be using SQL
Server.  Can you get Red Brick, Teradata, Sand, or Model 204?  These
are SQL products meant for DW apps.
Author
2 Sep 2005 9:47 AM
David Portas
> Finally, according to the Forester Group, you should not be using SQL
> Server.  Can you get Red Brick, Teradata, Sand, or Model 204?  These
> are SQL products meant for DW apps.

I assume you are referring to this report:
http://www.forrester.com/Research/Document/Excerpt/0,7211,36529,00.html

It's out of date. The report claimed that the 2005 beta hadn't
delivered evidence of performance and scalability benefits and pointed
to the lack of a TCP-C benchmark. Since then the TPC-C results have
been published and SQL Server 2005 is at Number 4:
http://www.tpc.org/tpcc/results/tpcc_perf_results.asp

The TPC-C is a benchmark for OLTP-type performance but 2005 also has
TPC-H benchmarks and these are more relevant to DW. In my opinion
however, it's less useful to draw firm conclusions from generic
benchmarks in a decision-support setting. The user's actual experience
will depend on requirements, complexity and choice of middleware and
other components - things that vary so much between different
installations.

--
David Portas
SQL Server MVP
--
Author
2 Sep 2005 1:34 PM
AK
1. add an identity to the staging table
CREATE TABLE Extract1
(
Extract1_Id ...
        First_Name VARCHAR(100)
        ,Last_Name VARCHAR(100)
        ,Address_Label VARCHAR(100)
        ,Street_Address VARCHAR(100)
        ,ZIP_Code VARCHAR(100)
--      ,additional purchase-specific extract columns
)

then you can either add Extract1_Id to Address, Person, or use one more
layer of staging tables like

CREATE TABLE #Person
(
        Person_ID INT IDENTITY PRIMARY KEY
,Extract1_Id int not null
        ,First_Name VARCHAR(25)
        ,Last_Name VARCHAR(25)
--      ,additional person-specific columns
)

AddThis Social Bookmark Button