Friday, March 30, 2012

Losing rows from file to destination table - need troubleshooting help

I am losing data from time to time and can not figure out where the rows are vanishing to. In essence, I have 5 files that I process. The processing occurs on a daily basis by one SQLAgent job that calls 5 individual packages, each doing a different piece of work.

I stage each file to a temp table, then perform some minor transformations, then union the rowsets from each file before landing to another temporary table. A subsequent package performs additional transformations that apply to the entire dataset, before inserting them to their final destination. Along the way, I reject some records based on the business rules applied. Each package in the entire Job is logging(OnError, TaskFailed, Pre and Post execute). Their are no errors being generated. No rows are being rejected to my reject tables either.

Without getting into the specific transforms, etc. being used in this complex process, has anyone seen similar unexplained behaviour? I've not been able to identify any pattern, except that it is usually only 1 or 2 of the record types (specific to a source file), that will ever not be loaded. No patterns around volumes for specific source files. There are some lookups and dedupes, however I've seen the records 'drop off' before reaching these transforms. I have noticed I had my final desination load not using fast load. However sometimes the records disappear before even getting to my final staging table which is inserting using fast load. I am going to turn on logging the pipelinerowssent event. Any other suggestions for troubleshooting/tracking down these disappearing records?

Thanks

Couple more clarifications:

I have run the same files thru manually in debug mode to find that I can watch all the rows thru the entire process.

We have seen strange behaviour in running packages as scheduled jobs thru SQLAgent

Utilizes unions which seem a bit klunky.

|||

Joe,

We have seen similar problems on my current project so yesterday we turned on OnPipelineRowsSent logging.

Another thing we have done is output the data from each transform component to a file for later examination. The MULTICAST transform is invaluable in order to do this.

as yet we haven't found out what is going on. Its strange.

-Jamie

|||

I'm not sure I would use the adjective strange, but..

In trying to troubleshoot this process, I first changed the union tranform that was taking 6 input streams, and busted it out to 5 individual waterfalling unions each with 2 input streams. No change in behaviour.

I then changed the package that moves this data, by adding in multicasts to output to a file after every transform along the way up to the final destination, after the 5 unions. Just by adding the multicasts into the flow has resulted in no rows vanishing for the daily loads for the past week. Unfortunately, I don't have time to really troubleshoot further, but I think that this demonstrates that there is indeed a serious bug here. I still suspect it has to do with the union transform. I am quite scared for anyone else's shop that has decided to standardize ETL to this tool, as we have. As developers, we have only time to test our code, not testing that the native tool functionality is behaving as expected. In addition ,to have to monitor on a regular basis that it is performing properly, is not acceptable.

Hoping this problem magically went away with SP1....

JH

No comments:

Post a Comment