Showing posts with label processing. Show all posts
Showing posts with label processing. Show all posts

Friday, March 30, 2012

Losing rows from file to destination table - need troubleshooting help

I am losing data from time to time and can not figure out where the rows are vanishing to. In essence, I have 5 files that I process. The processing occurs on a daily basis by one SQLAgent job that calls 5 individual packages, each doing a different piece of work.

I stage each file to a temp table, then perform some minor transformations, then union the rowsets from each file before landing to another temporary table. A subsequent package performs additional transformations that apply to the entire dataset, before inserting them to their final destination. Along the way, I reject some records based on the business rules applied. Each package in the entire Job is logging(OnError, TaskFailed, Pre and Post execute). Their are no errors being generated. No rows are being rejected to my reject tables either.

Without getting into the specific transforms, etc. being used in this complex process, has anyone seen similar unexplained behaviour? I've not been able to identify any pattern, except that it is usually only 1 or 2 of the record types (specific to a source file), that will ever not be loaded. No patterns around volumes for specific source files. There are some lookups and dedupes, however I've seen the records 'drop off' before reaching these transforms. I have noticed I had my final desination load not using fast load. However sometimes the records disappear before even getting to my final staging table which is inserting using fast load. I am going to turn on logging the pipelinerowssent event. Any other suggestions for troubleshooting/tracking down these disappearing records?

Thanks

Couple more clarifications:

I have run the same files thru manually in debug mode to find that I can watch all the rows thru the entire process.

We have seen strange behaviour in running packages as scheduled jobs thru SQLAgent

Utilizes unions which seem a bit klunky.

|||

Joe,

We have seen similar problems on my current project so yesterday we turned on OnPipelineRowsSent logging.

Another thing we have done is output the data from each transform component to a file for later examination. The MULTICAST transform is invaluable in order to do this.

as yet we haven't found out what is going on. Its strange.

-Jamie

|||

I'm not sure I would use the adjective strange, but..

In trying to troubleshoot this process, I first changed the union tranform that was taking 6 input streams, and busted it out to 5 individual waterfalling unions each with 2 input streams. No change in behaviour.

I then changed the package that moves this data, by adding in multicasts to output to a file after every transform along the way up to the final destination, after the 5 unions. Just by adding the multicasts into the flow has resulted in no rows vanishing for the daily loads for the past week. Unfortunately, I don't have time to really troubleshoot further, but I think that this demonstrates that there is indeed a serious bug here. I still suspect it has to do with the union transform. I am quite scared for anyone else's shop that has decided to standardize ETL to this tool, as we have. As developers, we have only time to test our code, not testing that the native tool functionality is behaving as expected. In addition ,to have to monitor on a regular basis that it is performing properly, is not acceptable.

Hoping this problem magically went away with SP1....

JH

Monday, March 26, 2012

looping record by record and processing

I have 2 tables they are identical what i need to do is when i make a update on one table the other table needs to reflect that same table. so lets say i insert into table A i need to look in table B if the record exist if it doesnt insert it if it does update it and if it exist in table B and doesnt exist in table A delete the record but i have to do this record by record..can anyone help me with thisUse a trigger...

You'll probably have to do an initial clean up first...|||Please brett can you show me how to do this in code...i have figured out i will need a triger but i have been trying all kinds of stuff and not getting anywhere code wise...

Wednesday, March 7, 2012

Looking for suggestions

I have two stored procedures (l'll call them P1 & P2). P1, after a lot of processing, creates a temporary table that is used by P2 after an "exec P1" is done. I've separated the logic into two stored procedures because, ultimately, other sprocs will need the output of P1.

I get an error if I use #tempTable as the output table in P1 because it no longer exists after P1 finishes. ##tempTable works, but I'm concerned about concurrency issues. Any suggestions on what construct(s) I should be using?

Thanks in advance!

Global temporary table (GTT) (##tempTable) might not be the one you need

here is the downside of using GTT.

1. you cannot use it to cache Data for a long period of time - let say 1 day

If all the reference to GTT is drop then the data in it is also lost forever

2. GTT is visible to all stored procedure using it - thus concerrency issue

This wont cause you any problem if the data in it is readonly. meaning

You process only once for the entire day. if you'll be cjhanging the content

of GTT from time to time this will cause you headache

two recommendations:

1. if the resultset of P1 will have to live for at least 1 day and there will be no processing required for that span of time,

I would recommend a physical table instead.

2. if the result set of P1 is dynamic and P1 needs to be reused from time to time

I recommend that P1 be transformed into a table-valued function

|||

I totally agree. In fact, I would probably suggest (based on the phrase "lot of processing") that you create a couple of permanent tables. One to hold the results, and the other to keep either:

1. Users of the data that you expect to read the data. As these users read the data, delete their rows, when the last reader finishes, delete them.

2. An amount of time before the results are invalidated. So if you could run it daily, have it be invalidated at midnight, and then users of the data would do their select, and if no data was there, or it was past it's date, it would build the cache, and if the data was there, then fetch it.

If a lot of processing is not really a "lot" then you might use a table-valued function, but there are limitations, like no side effects, if there is any data being written in the process.