|
database
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
How can I do this without a cursor??rows of data from a staging table into a data warehouse table. The slowness, I know, is because the SP uses a cursor to process these n million rows, and the reason for the cursor is simply because I can't figure out any other way to do it. Basically, we are using IDENTITY columns to generate our primary keys, so somehow we need to capture each individual identity value as we insert into a parent table so we can then use it in the next insert statement on the child table to form the relationship between the two. Surely this is a common problem. Any suggestions on how best to get data from the un-normalized staging table into the normalized warehouse tables would be greatly appreciated. See DDL statements below; I did not include my SP code because it is ugly, and as I said, uses a cursor, which I want to eliminate. Thanks! -- The relevant tables in my warehouse are structured similar to the following: CREATE TABLE Person ( Person_ID INT IDENTITY PRIMARY KEY ,First_Name VARCHAR(25) ,Last_Name VARCHAR(25) -- ,additional person-specific columns ) CREATE TABLE Address ( Address_ID INT IDENTITY PRIMARY KEY ,Street_Address VARCHAR(100) ,ZIP_Code CHAR(10) --REFERENCES ZIP_Code (ZIP_Code) ) CREATE TABLE Person_Address ( Person_ID INT REFERENCES Person (Person_ID) ,Address_ID INT REFERENCES Address (Address_ID) ,Address_Label VARCHAR(25) ,CONSTRAINT PK_Person_Address PRIMARY KEY (Person_ID, Address_ID, Address_Label) ) CREATE TABLE Customer ( Customer_ID INT IDENTITY PRIMARY KEY ,Person_ID INT REFERENCES Person (Person_ID) ) CREATE TABLE Purchase ( Purchase_ID INT IDENTITY PRIMARY KEY ,Customer_ID INT REFERENCES Customer (Customer_ID) -- ,additional puchase-specific columns ) -- My staging table looks like this: CREATE TABLE Extract1 ( First_Name VARCHAR(100) ,Last_Name VARCHAR(100) ,Address_Label VARCHAR(100) ,Street_Address VARCHAR(100) ,ZIP_Code VARCHAR(100) -- ,additional purchase-specific extract columns ) -- Some test data: INSERT INTO Extract1 VALUES ('Allen', 'Anderson', 'HOME', '123 Main Street', '12345') INSERT INTO Extract1 VALUES ('Barbara', 'Brown', 'HOME', '456 Water Street', '98765') INSERT INTO Extract1 VALUES ('Carl', 'Carter', 'OFFICE', '789 Church Street', '11223') INSERT INTO Extract1 VALUES ('Diana', 'Dickens', 'HOME', '321 Broadway Avenue', '44556') INSERT INTO Extract1 VALUES ('Ed', 'Ellison', 'WORK', '999 Elm Street', '99113') It's easy if you have the right constraints in place. IDENTITY should never
be the only key of any table (except maybe a staging table). ALTER TABLE person ALTER COLUMN first_name VARCHAR(25) NOT NULL ALTER TABLE person ALTER COLUMN last_name VARCHAR(25) NOT NULL ALTER TABLE person ADD CONSTRAINT ak_person UNIQUE (first_name, last_name) ALTER TABLE address ALTER COLUMN street_address VARCHAR(100) NOT NULL ALTER TABLE address ALTER COLUMN zip_code CHAR(10) NOT NULL ALTER TABLE address ADD CONSTRAINT ak_address UNIQUE (street_address, zip_code) INSERT INTO Person (first_name, last_name) SELECT DISTINCT first_name, last_name FROM Extract1 INSERT INTO Address (street_address, zip_code) SELECT DISTINCT street_address, zip_code FROM Extract1 INSERT INTO Person_Address (person_id, address_id, address_label) SELECT DISTINCT P.person_id, A.address_id, E.address_label FROM Extract1 AS E JOIN Person AS P ON E.first_name = P.first_name AND E.last_name = P.last_name JOIN Address AS A ON E.street_address = A.street_address AND E.zip_code = A.zip_code -- David Portas SQL Server MVP -- The IDENTITY means that you have forced a sequence on the insertion of
rows to the tables. People building a Data Warehouse usually pick keys that can be inserted in parallel so they would never use IDENTITY . Next, a DW usually has a strong temporal component wqhich I do not see in the reduced schema. Finally, according to the Forester Group, you should not be using SQL Server. Can you get Red Brick, Teradata, Sand, or Model 204? These are SQL products meant for DW apps. > Finally, according to the Forester Group, you should not be using SQL I assume you are referring to this report:> Server. Can you get Red Brick, Teradata, Sand, or Model 204? These > are SQL products meant for DW apps. http://www.forrester.com/Research/Document/Excerpt/0,7211,36529,00.html It's out of date. The report claimed that the 2005 beta hadn't delivered evidence of performance and scalability benefits and pointed to the lack of a TCP-C benchmark. Since then the TPC-C results have been published and SQL Server 2005 is at Number 4: http://www.tpc.org/tpcc/results/tpcc_perf_results.asp The TPC-C is a benchmark for OLTP-type performance but 2005 also has TPC-H benchmarks and these are more relevant to DW. In my opinion however, it's less useful to draw firm conclusions from generic benchmarks in a decision-support setting. The user's actual experience will depend on requirements, complexity and choice of middleware and other components - things that vary so much between different installations. -- David Portas SQL Server MVP -- 1. add an identity to the staging table
CREATE TABLE Extract1 ( Extract1_Id ... First_Name VARCHAR(100) ,Last_Name VARCHAR(100) ,Address_Label VARCHAR(100) ,Street_Address VARCHAR(100) ,ZIP_Code VARCHAR(100) -- ,additional purchase-specific extract columns ) then you can either add Extract1_Id to Address, Person, or use one more layer of staging tables like CREATE TABLE #Person ( Person_ID INT IDENTITY PRIMARY KEY ,Extract1_Id int not null ,First_Name VARCHAR(25) ,Last_Name VARCHAR(25) -- ,additional person-specific columns ) |
|||||||||||||||||||||||