Wednesday, March 28, 2012

'Loosing' input variables

Hi,

Excuse me for the 'noobish' question but it seems in my mining models, i am 'loosing' input variables. I am using the Microsoft Decision Tree algorithm and eventhough i have set 4 variables as 'input' and all 4 of them are in my mining structure, the model is using only 3. That 4th variable is also missing from the dependency network graph. Can anyone help me solve this problem?

An input attribute might not be used for a split in Decision Tree if the input has a low correlation to the output. Try changing the algorithm parameter COMPLEXITY_PENALTY to increase the number of splits and see if the additional input shows up in the tree.

Another possible factor might be the MINIMUM_SUPPORT parameter. In some cases, a attribute used to split a tree at a level might result in leaf nodes with support less than MINIMUM_SUPPORT. Try decreasing this parameter along with the first change and see if this is the case.

|||

You're not "losing" it - the algorithm simply isn't finding that it is relevant. This could be for a couple of reasons:

1: It's not relevant - it has no relationship to the target variable

2: It is relevant, but it closely correlated to another input. Say for instance you were trying to predict "Will Buy Glasses" and had two input variables, among others, "Has Poor Eyesight" and "Currently Wears Glasses". Say again, that it happens to be that almost everyone with poor eyesight already wears glasses. Now assume the tree splits on "Has Poor Eyesight". Now you have two sub-populations where one has "Has Poor Eyesight" = true and the other "Has Poor Eyesight" = false. Since "Has Poor Eyesight" and "Will Buy Glasses" are so highly correlated, there is no differentiation in "Will Buy Glasses" in either sub-population, and nothing for the tree to further split on.

One way to find if your fourth variable has any impact on your target is to create a Naive Bayes model where all attributes are treated independently and you don't have the "sub-population" behavior.

sql

No comments:

Post a Comment