Showing posts with label input. Show all posts
Showing posts with label input. Show all posts

Wednesday, March 28, 2012

'Loosing' input variables

Hi,

Excuse me for the 'noobish' question but it seems in my mining models, i am 'loosing' input variables. I am using the Microsoft Decision Tree algorithm and eventhough i have set 4 variables as 'input' and all 4 of them are in my mining structure, the model is using only 3. That 4th variable is also missing from the dependency network graph. Can anyone help me solve this problem?

An input attribute might not be used for a split in Decision Tree if the input has a low correlation to the output. Try changing the algorithm parameter COMPLEXITY_PENALTY to increase the number of splits and see if the additional input shows up in the tree.

Another possible factor might be the MINIMUM_SUPPORT parameter. In some cases, a attribute used to split a tree at a level might result in leaf nodes with support less than MINIMUM_SUPPORT. Try decreasing this parameter along with the first change and see if this is the case.

|||

You're not "losing" it - the algorithm simply isn't finding that it is relevant. This could be for a couple of reasons:

1: It's not relevant - it has no relationship to the target variable

2: It is relevant, but it closely correlated to another input. Say for instance you were trying to predict "Will Buy Glasses" and had two input variables, among others, "Has Poor Eyesight" and "Currently Wears Glasses". Say again, that it happens to be that almost everyone with poor eyesight already wears glasses. Now assume the tree splits on "Has Poor Eyesight". Now you have two sub-populations where one has "Has Poor Eyesight" = true and the other "Has Poor Eyesight" = false. Since "Has Poor Eyesight" and "Will Buy Glasses" are so highly correlated, there is no differentiation in "Will Buy Glasses" in either sub-population, and nothing for the tree to further split on.

One way to find if your fourth variable has any impact on your target is to create a Naive Bayes model where all attributes are treated independently and you don't have the "sub-population" behavior.

sql

Monday, March 19, 2012

Lookup with multiple hits

In a Lookup component I've defined a SQL query which returns a sorted resultset. For each Lookup component input row I want to have a single output row. Problem is that for each input row there is possibility of multiple matches in SQL query resultset. From all of the possible multiple hits I want only the first one to be returned, and if no match is found then no output row. How to implement this?

Try changing your lookup query to only return the rows you are interesting in.

Code Block

select

colA

,colB

,max or min (colC)

from

tableA

group by

colA

,colB

|||

The lookup component does this by default does it not?

Edit: that is return only the first result that it runs into. Just make sure that you return the correct row first and you should be fine...

Try the following to prove:

select 19000101 as datekey, '01/01/1900' as datename

UNION

select 19000102 as datekey, '01/02/1900' as datename

UNION

select 19000103 as datekey, '01/03/1900' as datename

UNION

select 19000104 as datekey, '01/04/1900' as datename

UNION

select 19000105 as datekey, '01/05/1900' as datename

in an oledb source

and then in the lookup

select 19000101 as datekey, 'myname11' as name

UNION

select 19000102 as datekey, 'myname21' as name

UNION

select 19000103 as datekey, 'myname31' as name

UNION

select 19000104 as datekey, 'myname41' as name

UNION

select 19000101 as datekey, 'myname12' as name

UNION

select 19000102 as datekey, 'myname22' as name

UNION

select 19000102 as datekey, 'myname23' as name

Notice, there is no match for 19000105, it will redirect.

You will return the values

19000101, '01/01/1900', 'myname11'

19000102, '01/02/1900', 'myname21'

19000103, '01/03/1900', 'myname31'

19000104, '01/04/1900', 'myname41'

|||

Eric Wisdahl wrote:

The lookup component does this by default does it not?

I believe so. The message about duplicate values is just a warning, not an error. I don't think there's a guaranteed order -- it just picks the first one it comes across.|||

Eric Wisdahl wrote:

The lookup component does this by default does it not?

Yes. Lookup is a synchronous component; hence the number of rows in the output is the same than the number of rows in the input. If your lookup query returns more than one row for an incoming row; then 'the first' one would be used; and you don't have control over which one would be used.|||SQL query returns result set of multiple contact persons for multiple companies. On output I need for each company to filter out just one of the contact persons. Result set is sorted (ORDER BY) so that if many contact persons are found per comapny one contact person that should be chosen as highest in order. On input of the Lookup component I've put OLE_SRC component which fetches all the companies.

I couldn't use just GROUP BY because I need columns in resultset which aren't used in aggregate function nor should be be group on.
|||

I would not feel comfortable relying in SSIS picking the 1st from the list. If the query in the lookup component is against SQL Server, Oracle or other RDBS where the rank function I would use a query like:

http://rafael-salas.blogspot.com/2007/04/remove-duplicates-using-t-sql-rank.html

Notice that you could mimic the result set even without the rank() function; but the query could get little complex|||Great, thank you all, especially to Rafael.

In an OLE_SRC I manually entered SQL command. I had to do it manually because it seems that Query Builder doesn't (yet) support OVER construct - when I click Build Query on OLE_SRC component it says "The OVER SQL construct or statement is not supported.".
|||

Is your source sql server 2005?

If not, you'll have to use another technique to get the row number

Code Block

select

colA

,(

select count(*)

from tableA b

where b.colA <= a.colA

) as RowNum

from tableA a

|||Yes, the source is SQL Server 2005. OLE DB Source component has Query Builder GUI which doesn't support OVER construct. I entered query manually and it works.

Lookup transform with multiple matches

Please indulge my ignorance, as I have only been using SSIS for a couple of weeks.
I'm trying to create a data warehouse using two input tables.
A column needs to be added to one table by using a lookup into the second table.
SSIS seems to handle the "no matches" and "single match" cases perfectly.
I can't for the life of me figure out how to properly handle multiple matches.
SSIS defaults to the first match, but I need to compute the "best" match.

Many thanks in advance
Scott!

What's the criteria for the "best" match? Knowing the criteria would make it easier to help Smile|||Here are some specifics:
Table A contains repair records.
Table B contains inspection records.
Each piece of equipment can have multiple repairs and many more inspections.
A lookup can be performed based on the equipment number.

I need to determine which repair immediately proceeds each inspection.
This can be found by computing the smallest positive date difference.

Programmactically, I would normally do this with nested loops.
Using SSIS, I suppose it could also be done this way, using a foreach loop
and a condition...
|||

Interesting problem Smile

I haven't tested this, but could you join the rows in SQL on the equipment, calc the date difference and order by it, and use a rank function to get the first row?

Monday, March 12, 2012

Lookup on input

I need to validate my input rows. The row is valid if there exist some other input rows in the same table (I am importing data from excel and access). I'll give an example to make everything clear:

Input table boys has following columns:First_Name ,Surname and Date_of_birth.

Output table is Twin_Triple_More_Brothers. I would like to insert into this table only boys that surnames are equal and difference in date of birth is less then one day.

I was thinking about lookup component, but I cannot use it in that way (or I just do not know how).

Maybe someone has an idea how to do this?
Thanks for help.

You could try something like this...

-Send your source to a lookup with the lookup query on the surnames and the date of birth

-Send error rows of this lookup to another lookup with the lookup query on the surnames and the date of birth-1

-Send the successfull lookups from the two lookup components to a union all component

-Send the output of the union all component to your destination

|||Use a lookup to match on Surname. Then return Date_of_birth from the lookup table.

Next, hook up to a derived column to calculate the difference between the input Date_of_birth and the lookup Date_of_birth.

Hook that up to a conditional split, where you test the calculation from above. If it's within your parameters, send to the insert path, otherwise do nothing.|||But how can I lookup in input? I know how to search in other table, but I do not know how to search in input columns.
Again thanks for help.
|||

Piotr Stapp wrote:

But how can I lookup in input? I know how to search in other table, but I do not know how to search in input columns.
Again thanks for help.

Create an initial data flow that simply takes your input and loads it to a table. Then in your second data flow, you can use an OLE DB source hooked up to that staging table, having a lookup pointing to the same table. You might have to work through the matching logic a bit, but it could work for you.|||So the only way is to use temporary table?
Maybe it is not as good as I think, but it is also not so bad.
|||You could do this in SQL, probably, but this isn't the forum for that... Wink

Lookup finds match on empty reference table

Hi all,

In BOL it says: "The Lookup transformation performs an equi-join between values in the transformation input and values in the reference dataset. Using an equi-join means that each row in the transformation input must match at least one row from the reference dataset. If there is no matching entry in the reference dataset, no join occurs and no values are returned from the reference dataset. This is an error, and the transformation fails, unless it is configured to ignore errors or redirect error rows to the error output. "

I have a lookup transformation which is supposed to find a match on two fields in the reference dataset (a table in my case) but strangely, when I execute my package and the reference table is empty the lookup still finds match for each row of my input dataset.

Does anyone have an idea why? I could'nt find anything about that in BOL.

Sbastien.

If the lookup is returning rows, yet now rows exists, it cannot be matching, the returned columns if any will be null. You can however allow this behaviour, by ignoring errors. A failed lookup match is classed as an error, and can be set under the Configure Error Outputs button/screen. See if you have set Ignore for errors.|||Indeed I have configured the error to "Ignore errors" but I thought it will simply avoid the package to fail if it couldn't find any match. I don't want to use Redirect rows either as I have nothing to do with the non matching rows.|||

Sebastian,

The solution here then is quite simple. Use a Conditional Split to filter out the columns that have NULL values in the lookup columns - thereby giving you the rows that yielded a match.

-Jamie

Saturday, February 25, 2012

Looking for input on SQL 2005 and IIS Security

I am just looking for feedback and maybe pointers for research regarding securing SQL 2005 for IIS access. I am currently working on a project for building a new retail website and our sysadm guys have some concerns regarding exposing our SQL box to the DMZ via direct connections to the IIS box. Now we have not completely come to a conclusion of saying that it is not possible, but we are very concerned due to recent credit card and customer data problems in the industry.

So far we have mainly talked about just exposing web services(limiting the exposure of data that can be retrieved) on an internal IIS box that would be called allowed to be accessed from only the IIS box in the DMZ using User Accounts and Client SSL Certs. What I am most concerned about is the performance of this design. I would like to try and provide a connection to the SQL box directly for the devs, but I am not sure what the best practice would be for securing this connection through the firewall and also monitoring it in case our DMZ IIS box gets owned.

Any input or direction to resources would be much appreciated. I have read quite a few papers so far and just want to get feedback on architectures and designs.

Thanks in advance.

Options include: using IPSEC (or SSL) to control the connection between the IIS box and SQL Server (and Developers). Only 'approved' IP addresses would be allowed to connect to SQL Server. That protects SQL Server from being directly attacked, but not from attack from the IIS box.

Using IIS's capability to use Application Pools, you can further restrict access to SQL Server through the Application.

Here are some resources that might be useful:

Web Applications -Connect to SQL Server
Configuring an ASP.NET 2.0 Application to Work with Microsoft SQL Server 2000 or SQL Server 2005
http://imar.spaanjaars.com/QuickDocId.aspx?quickdoc=395
How To: Create a Service Account for an ASP.NET 2.0 Application
http://msdn2.microsoft.com/en-us/library/ms998297.aspx
How To: Connect to SQL Server Using Windows Authentication in ASP.NET 2.0
http://msdn2.microsoft.com/en-us/library/ms998300.aspx