Showing posts with label flat. Show all posts
Showing posts with label flat. Show all posts

Friday, March 30, 2012

Loss of records

Hi, I have the following problem
In a flat file I have 250000 rows when I them go on to the DB only 249995 come, 5 got lost

Not where it can spend the mistake the loggind does not say anything of mistake
Not which can be the reason of the problem
If someone save for that it can be spending this?

helps please.

If you execute the package in BIDS you see how many rows are output from each component. This should make it very easy to see where the rows are being lost from.

-Jamie

Monday, March 26, 2012

Looping through rows in flat file

I'd appreciate some guidance on how to solve a problem. I have a flat file of the format

Header1
Header2
Header3
Data Record
Data Record
... etc. ...
Data Record
Trailer1
Trailer2

I want to loop through each data record. I want to take a column value (string) from each row, and call a stored procedure passing in that value. The s/p will return another string which I want to use to update the original column value from the flat file. If I can't update the original column value I'm happy to add a derived column.

Once I've done this "lookup" for all records, I want to re-create the flat file again with the new column values.

I don't need a detailed solution ... just some pointers on how to go about it.

Greg.

On the surface it looks like you should be able to do this (i.e. execute a sproc) using the OLE DB Command transform however I don't think this will work because you want the return value from the sproc put into the pipeline - which you cannot do with the OLE DB Command transform.

Can you not use the LOOKUP transform to get the new values? If all the sproc does is lookup the input value in a mapping table then LOOKUP transform will do the job.

If not - this will be a little harder. But let's eliminate (or otherwise) the option of using the LOOKUP transform first.

-Jamie

|||The sproc parses the incoming string to attempt to extract two distinct values from it. Each of these values are primary key columns in separate tables, both of which are needed to obtain the a pk value from another table, so it's not a straighfoward lookup.

I guess I could do the lookup using custom SQL in the Lookup transform? and re-write the parsing of the input column as a Script Component (I've already done it in C# so should be simple to port to VB.NET)

Apologies as it's a bit of a muddle, but just trying to get my head around a decent solution.

(I've already done this in BizTalk but am looking to use SSIS instead - it's a pity SSIS support for the Web Service tasks is so poor otherwise it would be simpler).

Thanks,

Greg.

|||Did you think about using of staging table? In this scenario you would load the file to a temp (staging table) 1 to 1. Then apply a stored procedure to a temp table that wouold transfer only needed rows to the production table...|||

Custom SQL in the LOOKUP is a good way to go. In fact, best practice states that this is the best thing to do because if you simply select a table you will be selecting unused data into your LOOKUP cache - VERY BAD.

You could probably do the parsing in a derived column component - no need to resort to code (unless you really want to of course). If you need to derive some values in order to do the LOOKUP then there is nothing at all wrong with doing this in the pipeline - that's what the pipeline is for.

Anyway...anything that enables you to use the LOOKUP is definately the way to go otherwise you'll have to hack around it and that isn't nice.

Incidentally, I agree that the Web Services Task is basically useless. I am currently working on a project where we are consuming web services in SSIS left right and centre and we haven't used the Web Services Task once - it is all done in script. Donald Farmer's book contains an example of how to do this.

-Jamie

|||I'd like to avoid a staging table here as these files really only "pass" the database on their way somewhere else, doing a lookup on the way.

Thanks Jamie. I'll have a look at derived columns. Last question for the moment is,

If I have a Script Component and define an InputColumn with Usage Type of Read/Write ... how do I write to it? Do I need to do it in script using Row.ColumnName = "whatever value"?

Greg.

|||

Jamie,

you can get values from a sproc back to the pipeline using OUTPUT parameters... That shouldn't be any problem...

However I aggree that using lookups (perhaps even some lookups behing eachother) will be the better solution...

|||

Thomas Pagel wrote:

Jamie,

you can get values from a sproc back to the pipeline using OUTPUT parameters... That shouldn't be any problem...

However I aggree that using lookups (perhaps even some lookups behing eachother) will be the better solution...

Ah, thanks Thomas. Care to post a demo on your blog? :)

|||

GregAbd wrote:


If I have a Script Component and define an InputColumn with Usage Type of Read/Write ... how do I write to it? Do I need to do it in script using Row.ColumnName = "whatever value"?

Yeah, basically. There's loads of demo code about for doing this. Here's some from my own page - you'll find stuff on Kirk's blog, Ash's blog, SQLIS.com amongst others as well I'm sure:

http://blogs.conchango.com/jamiethomson/archive/2005/07/25/1841.aspx

http://blogs.conchango.com/jamiethomson/archive/2005/04/09/1265.aspx

-Jamie

|||

Jamie,

here you are: http://sqljunkies.com/WebLog/tpagel/archive/2006/01/03/17744.aspx

|||

BTW, in SP1, the Web Services task gets dynamic. :)

K

|||Thanks for all the help so far folks. In my data flow, I've got my flat file coming in. This goes into a script component which takes one column (string), parses it (complicated) and sets two variables in the PostExecute method. This then goes into a lookup.

How can I use those variables in the custom SQL window? I know I can build a custom SQL query in a variable as an expression and use that in the OLE DB Command, but this doesn't seem possible in the Lookup

FWIW, my lookup is a three table join along the lines of

SELECT FieldB FROM A, B, C
WHERE A.FieldA = Variable1
AND C.FieldC = Variable2
AND A.PK = B.FK1
AND C.PK = B.FK2

Greg.
|||

Greg,

I guess you didn't understand the concept of the Lookup, yet. The lookup takes a query (when the package starts) and caches the result of that query (by default, you can change that...). In your dataflow you define one field (or many) from the pipeline to be matched with the same number of fields in the cache. If there is a match you'll get the other fields of the found record in return.

So you don't have to pass the variables of each record to the lookup. What you have to do is that you store the results of the script not in variables but in new fields in the pipeline. Then you have to match these fields with the result of the complete query in the lookup (so the lookup doesn't have one record but all possible records but you get only the fields of the matching record back to the pipeline).

HTH

|||I was just in the process of realising that when you posted Thomas. It's much clearer now.

SSIS is a pretty steep learning curve in terms of understanding what it's capable of but it's a pretty cool tool to use.

Plenty more questions to come.

Greg.

Wednesday, March 21, 2012

Loop Through Flat Files Based On A Date Range

Hello,

I currently have a For Each File container that loops through all files from a specific directory. The files have a naming convention that looks like this;

CDNSC.CDNSC.SC00015.01012007

The last segment of the file name is the date of the data in the file (mmddyyyy). The create date for these files is always a day later than indicated in the file name.

What I would like to do is to have more control over the 'range' of files that are looped through by using the date portion of the file name to define what group of files should be looped through. Ideally, I would like to have a 'StartDate' variable and an 'EndDate' variable that I could define at run time for the package, and the package would loop through all of the files where the date portion of the file name fell between 'StartDate' and 'EndDate'.

Any ideas on this?

Thank you for your help!

cdun2

One way is to move the files to an archive directory when you're done with them. That way you only pick up the files you need to process. Just throwing that idea out there.|||Thanks for your response. One thing that I have to be prepared for is that I may have to 'reload' data. The date range for any reloading could be anything.|||Well, you could (and perhaps should?) stage the data from the flat files in a staging table. Then, once in the staging table you can keep n number of loads around, all marked by either the load date, or a unique load number. After n number of loads, you can delete the old data. Each row in the staging table would also contain the source file name.

Or, you could simply load a table with the load date and a filename. That way, you've got a table that tells you which files to grab for a given load.
Load File
1 xyz.txt
1 fhs.txt
2 jfb.txt

The above still assumes that you "archive" the files into a subdirectory or somewhere.|||The data from the flat files will wind up in one ore more sql server tables. One idea I was considering was to stored the file names in a table, and maybe parse the date portion of the file name into a column. Then somehow use that as a lookup table to determine which files would be processed.|||

cdun2 wrote:

The data from the flat files will wind up in one ore more sql server tables. One idea I was considering was to stored the file names in a table, and maybe parse the date portion of the file name into a column. Then somehow use that as a lookup table to determine which files would be processed.

Right, but the point is that your first data flow simply loads the data from the files AS IS into a table. Then in the data flow that you use today, you point them to the staging table instead of the flat file(s).|||Oh, I see. The data that I need to process is a subset, and will be aggregated for reporting. The concern that I have is that somewhere down the road, the reporting requirements may change, and the users may need another column from the data. I'm thinking that if that happens, I'll need to add the additional column to the subset table, and reload everything.|||

cdun2 wrote:

Oh, I see. The data that I need to process is a subset, and will be aggregated for reporting. The concern that I have is that somewhere down the road, the reporting requirements may change, and the users may need another column from the data. I'm thinking that if that happens, I'll need to add the additional column to the subset table, and reload everything.

You're going to have to add the column in the flat file source anyway, would be one argument.

Regardless, this is just one way to tackle this problem. There are other ways I'm sure.|||

Actually, the destination table will have just a subset of the columns that are found in the flat files. If a need is found for an addional column of data, it will already be in the flat files. I'll just need to add it to the destination table.

What kind of component could I use in a DataFlow Task to read filenames?

Thanks again.

|||

cdun2 wrote:

Actually, the destination table will have just a subset of the columns that are found in the flat files. If a need is found for an addional column of data, it will already be in the flat files. I'll just need to add it to the destination table.

What kind of component could I use in a DataFlow Task to read filenames?

Thanks again.

If this is the case, then the staging table example still works. In your CONTROL flow, you'd use a foreach loop to spin through the files. Each filename can be put in a variable, which can be used in a data flow inside the foreach loop. Then, using a derived column in the data flow, you'd have access to the variable that contains the current filename.|||

cdun2 wrote:


Hello,

I currently have a For Each File container that loops through all files from a specific directory. The files have a naming convention that looks like this;

CDNSC.CDNSC.SC00015.01012007

The last segment of the file name is the date of the data in the file (mmddyyyy). The create date for these files is always a day later than indicated in the file name.

What I would like to do is to have more control over the 'range' of files that are looped through by using the date portion of the file name to define what group of files should be looped through. Ideally, I would like to have a 'StartDate' variable and an 'EndDate' variable that I could define at run time for the package, and the package would loop through all of the files where the date portion of the file name fell between 'StartDate' and 'EndDate'.

Any ideas on this?

Thank you for your help!

cdun2

Cdun2,

This is an alternative approach to the issue of processing only the files that are within a start/end date range. It uses Expression and precedence constraints:

http://rafael-salas.blogspot.com/2007/02/ssis-loop-through-files-in-date-range.html

BTW,

Just to save extra work down the road; i would recomment to import all the columns of the files from the begining.

|||Thanks again for your input. I'll dig a little deeper into this next week. As far as bringing in all of the columns of data, the attempt is to aggregate so that we can meet a reporting need, and reduce the number of rows of data that we need to store. Its a resource/performance issue for us. I can't bring in any more than I need.|||

cdun2 wrote:

I can't bring in any more than I need.

But you can. To Rafael's point, you can at least have ALL of the flat file columns mapped in the connection manager. You don't have to do anything with them in the data flow, but at least they are there if you need to pick them up later; you won't have to redefine the connection manager.|||

Phil Brammer wrote:

cdun2 wrote:

I can't bring in any more than I need.

But you can. To Rafael's point, you can at least have ALL of the flat file columns mapped in the connection manager. You don't have to do anything with them in the data flow, but at least they are there if you need to pick them up later; you won't have to redefine the connection manager.

Yes, that makes sense. Thanks again.

|||

I took a look at the example here; http://rafael-salas.blogspot.com/2007/02/ssis-loop-through-files-in-date-range.html

Thank you for providing this example. I am having a difficult time making the bridge between the precendent constraint, and the Data Flow Task where the content of the qualifying flat files are inserted into the SQL Server table.