Monday, March 19, 2012

lookup to check values in 13 columns with same values

hi,

it is my first post on this forum, please be patient if i miss any important bit of information.

i am transporting data from a legacy system into mssql 2k5 using SSIS.
among those column of a dataset there are 13 columns, all necessary for operational reasons, that i need to ensure data consistance.

i believe i could do this check using the lookup data flow item, but surely there must be a way to do it in a more streamlined fashion.

since column names contain numbers to distinguish the version, eg; col01, col02, col03 .. col13.

i thought i could include the lookup within a loop and use a couple of variables to do this trick, but since i have not done it before i am asking for some sort of guidance from a guru among you folks.

please let me know if further clarification is necessary.

regards,

nicolas

If you are trying to confirm that the 13 values for each row in the source match the corresponding row in the destination, then a lookup is the way to go. Using a single lookup with the 13 columns all mapped makes sense. Trying to do something dynamic is not going to work, and even if it did this would mean 13 individual lookup calls, so probably a lot slower due to the extra overhead.

Another method when you have lots of columns to compare is to use a hash or checksum across the columns. You can compute this checksum in the pipeline on the incoming source, and compare that against the checksum stored in the destination. This means the lookup is only one column, so for wide column data and lots of them it will be faster due to the reduced data being transferred, and it is easier to just select columns for checksum generation compared to mapping in the lookup. This does mean that your destination needs to be expaned to store the checksum.

A checksum transform is available here (http://www.sqlis.com/default.aspx?21) or you can implement your own in the Script Component.

|||darren,

i understand your point about looking through 13 columns been slow, but i am not certain if i understand how to hash all those columns to compare their values.

for clarification, i need to map or rather insure that those values in all 13 columns will comply with a foreign key constraint, thus i am ensuring that only values listed in my reference table exist in those columns.

would this scenario allow me to use hash for comparing values?

many thanks|||

I may have misunderstood this. If the 13 columns are the key for one table, then the hash concept will work. the hash serves as a compund key for those 13 columns,. If they are 13 different reference tables, then i messed up, you woudl use 13 lookups, one to check each table, probably what you meant first time around. There is no shortcut for these 13 lookups if that is what you require.

I digress a bit, but a good strategy can be to just load the data and let it fail on the FK violation. Set the component to re-direct any failed rows to the error output. You may then need to do your lookups or whatever you want to do when you have missing reference data. This can be much faster than lookups for every row. If 100 rows are loaded, but only 1 is a new key, the error output will get 1 row. This means 1 lookup (fix work) as opposed to 100.

|||

Hi Darren,

In the example - you have assumed that the base tables contain the Foreign Key column values - What if we need to lookup on a secondary table, Obtain the primary key from the secondary table and populate them in the base table column.

Is it beneficial to handle these relations in SQL JOIN Statements while fetching records from the source tables rather than doing the lookup using LookUP Components in the SSIS pipeline (which I believe works at record level - meaning a bulk lookup is impossible.)

Please share your comments.

Thanks,
Loonysan

|||Ok,

since i have similar situation with 4 other dataset, i would like to probe this idea further as well as maek sure that i clarify it so we understand this situation properly.

this source dataset that i extracted from our legacy data source has these 13 columns (col1..col13) among many.

in on our destination system, i created a table (lookup table) which will enable me to map and transform legacy codes into relevant new codes in our destination table.

the dataset is intented to be inserted into a table that has 13 foreign keys (one for each of those 13 columns) which references another table, thus ensuring that only new codes will be inserted in any of those 13 columns.

having said that, you believe that i should join the lookup table, try to insert into the destination table & catch any error?

many thanks

No comments:

Post a Comment