Showing posts with label range. Show all posts
Showing posts with label range. Show all posts

Wednesday, March 21, 2012

Loop Through Flat Files Based On A Date Range

Hello,

I currently have a For Each File container that loops through all files from a specific directory. The files have a naming convention that looks like this;

CDNSC.CDNSC.SC00015.01012007

The last segment of the file name is the date of the data in the file (mmddyyyy). The create date for these files is always a day later than indicated in the file name.

What I would like to do is to have more control over the 'range' of files that are looped through by using the date portion of the file name to define what group of files should be looped through. Ideally, I would like to have a 'StartDate' variable and an 'EndDate' variable that I could define at run time for the package, and the package would loop through all of the files where the date portion of the file name fell between 'StartDate' and 'EndDate'.

Any ideas on this?

Thank you for your help!

cdun2

One way is to move the files to an archive directory when you're done with them. That way you only pick up the files you need to process. Just throwing that idea out there.|||Thanks for your response. One thing that I have to be prepared for is that I may have to 'reload' data. The date range for any reloading could be anything.|||Well, you could (and perhaps should?) stage the data from the flat files in a staging table. Then, once in the staging table you can keep n number of loads around, all marked by either the load date, or a unique load number. After n number of loads, you can delete the old data. Each row in the staging table would also contain the source file name.

Or, you could simply load a table with the load date and a filename. That way, you've got a table that tells you which files to grab for a given load.
Load File
1 xyz.txt
1 fhs.txt
2 jfb.txt

The above still assumes that you "archive" the files into a subdirectory or somewhere.|||The data from the flat files will wind up in one ore more sql server tables. One idea I was considering was to stored the file names in a table, and maybe parse the date portion of the file name into a column. Then somehow use that as a lookup table to determine which files would be processed.|||

cdun2 wrote:

The data from the flat files will wind up in one ore more sql server tables. One idea I was considering was to stored the file names in a table, and maybe parse the date portion of the file name into a column. Then somehow use that as a lookup table to determine which files would be processed.

Right, but the point is that your first data flow simply loads the data from the files AS IS into a table. Then in the data flow that you use today, you point them to the staging table instead of the flat file(s).|||Oh, I see. The data that I need to process is a subset, and will be aggregated for reporting. The concern that I have is that somewhere down the road, the reporting requirements may change, and the users may need another column from the data. I'm thinking that if that happens, I'll need to add the additional column to the subset table, and reload everything.|||

cdun2 wrote:

Oh, I see. The data that I need to process is a subset, and will be aggregated for reporting. The concern that I have is that somewhere down the road, the reporting requirements may change, and the users may need another column from the data. I'm thinking that if that happens, I'll need to add the additional column to the subset table, and reload everything.

You're going to have to add the column in the flat file source anyway, would be one argument.

Regardless, this is just one way to tackle this problem. There are other ways I'm sure.|||

Actually, the destination table will have just a subset of the columns that are found in the flat files. If a need is found for an addional column of data, it will already be in the flat files. I'll just need to add it to the destination table.

What kind of component could I use in a DataFlow Task to read filenames?

Thanks again.

|||

cdun2 wrote:

Actually, the destination table will have just a subset of the columns that are found in the flat files. If a need is found for an addional column of data, it will already be in the flat files. I'll just need to add it to the destination table.

What kind of component could I use in a DataFlow Task to read filenames?

Thanks again.

If this is the case, then the staging table example still works. In your CONTROL flow, you'd use a foreach loop to spin through the files. Each filename can be put in a variable, which can be used in a data flow inside the foreach loop. Then, using a derived column in the data flow, you'd have access to the variable that contains the current filename.|||

cdun2 wrote:


Hello,

I currently have a For Each File container that loops through all files from a specific directory. The files have a naming convention that looks like this;

CDNSC.CDNSC.SC00015.01012007

The last segment of the file name is the date of the data in the file (mmddyyyy). The create date for these files is always a day later than indicated in the file name.

What I would like to do is to have more control over the 'range' of files that are looped through by using the date portion of the file name to define what group of files should be looped through. Ideally, I would like to have a 'StartDate' variable and an 'EndDate' variable that I could define at run time for the package, and the package would loop through all of the files where the date portion of the file name fell between 'StartDate' and 'EndDate'.

Any ideas on this?

Thank you for your help!

cdun2

Cdun2,

This is an alternative approach to the issue of processing only the files that are within a start/end date range. It uses Expression and precedence constraints:

http://rafael-salas.blogspot.com/2007/02/ssis-loop-through-files-in-date-range.html

BTW,

Just to save extra work down the road; i would recomment to import all the columns of the files from the begining.

|||Thanks again for your input. I'll dig a little deeper into this next week. As far as bringing in all of the columns of data, the attempt is to aggregate so that we can meet a reporting need, and reduce the number of rows of data that we need to store. Its a resource/performance issue for us. I can't bring in any more than I need.|||

cdun2 wrote:

I can't bring in any more than I need.

But you can. To Rafael's point, you can at least have ALL of the flat file columns mapped in the connection manager. You don't have to do anything with them in the data flow, but at least they are there if you need to pick them up later; you won't have to redefine the connection manager.|||

Phil Brammer wrote:

cdun2 wrote:

I can't bring in any more than I need.

But you can. To Rafael's point, you can at least have ALL of the flat file columns mapped in the connection manager. You don't have to do anything with them in the data flow, but at least they are there if you need to pick them up later; you won't have to redefine the connection manager.

Yes, that makes sense. Thanks again.

|||

I took a look at the example here; http://rafael-salas.blogspot.com/2007/02/ssis-loop-through-files-in-date-range.html

Thank you for providing this example. I am having a difficult time making the bridge between the precendent constraint, and the Data Flow Task where the content of the qualifying flat files are inserted into the SQL Server table.

Monday, March 12, 2012

Lookup on a range

I want to check to see if my document number (let's say, 11) is in between two values: from a table in my database, I have a min value and a max value; how can I check to see if the number is between the min and the max?

Thanks!

Jim Work
In the control flow, you may want to issue an execute sql task with the query being: select min(value) as minvalue, max(value) as maxvalue from table.

Then map the results to two variables.

Then, in the data flow, you can compare input to the two variables using a conditional split, derived column, etc...

Lookup / Merge Join / Script - Howto look up values by comparing to a range of values?

Hello all,

I am trying to think my way through a solution which I believe others have probably come across... I am trying to implement a matching routine wherein I need to match an address against a high value and a low value (or, for that matter an input date vs. a start and end date) to return the desired row ... i.e. if I were to use a straight vb program I would just use the following lookup:

"SELECT DISTINCT fire_id, police_ID, fire_opt_in_out, police_opt_in_out FROM ipt_tbl " & _

" WHERE zip_code = @.zip_code AND addr_prim_lo <= @.street_number AND addr_prim_hi >= @.street_number " & _

" AND addr_prim_oe = @.addr_prim_oe AND street_pre = @.street_pre AND street_name = @.street_name " & _

" AND street_suff = @.street_suff AND street_post = @.street_post " & _

" AND (expiry_date = '' OR expiry_date = '00000000' OR expiry_date > @.expiry_date)" & _

" GROUP BY fire_ID, police_ID, fire_opt_in_out, police_opt_in_out"

My question, then, is how would you perform this type of query using a lookup / merge join or script? I have not found a way to implement a way to set the input columns? I can set the straight matches without a problem, i.e. lookup zip code = input zip code, but can't think of the correct way to set comparisons, i.e. lookup value 1 <= input value AND lookup value 2 >= input value

Any suggestions?

thanks for your time...

After a bit of research, I have found a few different solutions to this problem. The first uses the lookup by altering the sql statement and parameter set under the advanced tab (enable memory restriction / modify the sql statement) as discussed here http://www.julian-kuiters.id.au/article.php/ssis-lookup-with-range. The second (a bit less likeable) option was to use a script component and found here http://sqlblogcasts.com/blogs/simons/archive/2005/10/04/628.aspx

I am a bit disappointed that the range functionality was not prebuilt in to the components (other than through the memory restriction)... Does anyone else have any ideas or suggestions?

Thanks for your time.

|||Although the logic in your query is pretty clear, the context and output is not - at least not to me. What exactly do you want to have happen in your data flow as the output of this process?|||

Reading back I can see how you would be confused...

I am trying to look up information based on an address range (i.e. if you are on the 23rd block (2300 - 2400) there will be one set of police and fire municipality codes, if you are on the 24th block 2401 - 2500 there will be seperate municipality code outputs) We are aggregatting these facts based on addresses supplied from our customer base to a table provided by the state which maps out the various address blocks for which we will need to break down our earned premiums such that we can pay taxes to the firefighter and police pension plans.

I think that the main point was that there is not an intuitive way of looking up information based on a fact tables value being within the dimension tables value range. I have also had this problem looking up values based on date ranges...

|||

I don't know how much anyone is interested, but here is a topic on the feedback site which you can vote on to bring this to the ssis teams attention...

https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=263594

Monday, February 20, 2012

Looking for help with an SQL query.

well, i want to select rows by date from a file. but I want
in particular one sum of values from the rows that fall WITHIN
a supplied date range, and a second sum of values from the
rows that have dates FOR ALL TIME UP TO the second date in the date range.

the former, by itself, might be:

SELECT id, value RangedValue
FROM myFile
WHERE date >= [lower date range value]
AND date <= [higher date range value]

and the latter, by itself, might be:

SELECT id, value AllTimeValue
FROM myFile
WHERE date <= [higher date range value]

but I need to grab the two separate sums (RangedValue and AllTimeValue)
using one SQL statement.

I'm thinking that the UNION might work, but my preliminary results are
taking a huge amount of time, and apparently smegging up the (rather
stupid, external) report generator to boot.
If you like the UNION idea, please give me an example.

I should mention that the report generator at very least can do the
(summing) part. I could do the summing at either the SQL level or the
report level. I should also mention that although I only talk about the
one file here (myFile), in fact I need to join to and pull values from
its "parent" file, although I don't think that that should change my
fundamental problem.

Any ideas?

Cheers in advance!

-GlennYou could try something like this:
Select Id
, Sum(Value) Alltimevalue
, Sum(Case
When Date >= [Lower Date Range Value] Then Value
Else 0 End) As Rangedvalue
From Myfile
Where Date <= [Higher Date Range Value];
;)