Overview
Continuing on from the previous blog on random forest I decided to investigate how to perform automated feature selection using feature importance measurements. This type of process becomes import when the feature set becomes very large and the speed of the machine learning algorithm is a function of the number of features. Dimensionality reduction, feature set reduction, solves this problem using various algorithms such as principal component analysis. At this point in time I choose to keep my problem simple to expand my knowledge of the random forest algorithm while improving my R skills.
The implementation i present here is far from perfect or even production level, for example notice how I have omitted all the function arguments to randomforest never mind the R inefficiencies. However, the code is reasonable enough to demonstrate the basic idea of automated feature selection and some interesting R snippets for the novice R programmer. Source code is now availble on Github
Sequential feature selection algorithm
The algorithm to reduce the number of features while finding the minimum error rate is based on a two step process.
- Feature selection is performed by taking the feature MeanDecreaseGini value and determine if this passed the incrementing threshold function. If the value is greater than the threshold it will be a member of the feature set otherwise it is removed from further processing.
- Using the new feature set find the optimal number of tree with the lowest error rate.
The following sections present the code to perform learning and feature selection. I have reduce the code to the most interesting components, noticed I have introduced a user library and imported it using the 'source' function.
The program flow is as follows, section below:
- Load and prepare the data.
- Set the algorithm parameters.
- Execute the algorithm.
- Display various plots and model measurements.
- Export and save model to PMML.
# Load the user lib files source( 'lib/rforestHelperFunctions.R') # Load the data, partition to train/test sets etc. creditapps <- read.table("../data/creditcard_applications.txt",sep=",", header=FALSE) etc....... # Algorithm parameters formula <- "V16 ~." attrThreshold <- 4 iterations <- 10 initTreeCount <- 55 treeCountStep <- 25 # Now call the algorithm bestFittingTree <- sqBestFeatureModel.rf( formula, trainingSet, "V16", iterations, initTreeCount, treeCountStep, attrThreshold ) tree <- bestFittingTree$tree errors <- bestFittingTree$errors # Output error plot plot( errors) # importance of each predictor print(bestFittingTree ) importance( tree ) varImpPlot( tree) plot(predict( object=tree, newdata=testingSet), testingTarget)
Algorithm implementation
The algorithm is split into two sections: selecting the best features and find the best number of trees for the given features. Below is a description of the required parameters.
Function parameters
- modelFormula - The R formula used for the randomForest function.
- data - The training data frame.
- targetVar - The target variable the
- maxIterations - Number of iterations to perform the search.
- numStartTrees - The initial number of trees to grow.
- treeStepCount - The tree growth increment for each step.
- attrThreshold - The minimum Gini threshold to use to remove features.
sqBestFeatureModel.rf <- function(modelFormula, data, targetVar, maxIterations, numStartTrees, treeStepCount, attrThreshold ){ features <- colnames(trainingSet, do.NULL = TRUE, prefix = "col") errMeasure <- c() bestFit <- c() bestFeatureErr <- 2000 # Loop through all features we have for (n in 1:maxIterations){ X <- data[ features ] # Find optimal number of trees bestTree <- optimalTreeSize.rf( modelFormula, X, maxIterations, numStartTrees, treeStepCount) # Get error of training and track featureErr <- sum( data.frame(bestTree$confusion)["class.error"] )/2 errMeasure <- append(errMeasure, featureErr, 1) # Save the best performing forest and feature set if( featureErr < bestFeatureErr ){ bestFit <- bestTree bestFeatures <- features bestFeatureErr <- featureErr imp <- data.frame( bestFit$importance ) keepAttr <- ifelse( imp$MeanDecreaseGini > attrThreshold, TRUE, FALSE) features <- features[ keepAttr ] #print(bestFit) #print( features) } # Add back in the target variable if( ! targetVar %in% features){ features <- append( features, targetVar) } # Increase threshold attrThreshold <- attrThreshold + 1 } return( list(tree=bestFit, errors=c(errMeasure))) }
Optimal trees function
This function returns the optimal number of tree to grow for the given feature. This is performed by growing N number of trees per iteration while keeping the best formed forest.
Function parameters
- modelFormula - The R formula used for the randomForest function.
- trainingData - The training data frame.
- iterations - Number of iterations to perform the search.
- initNumTree - The initial number of trees to grow.
- treeStep - The tree growth increment for each step.
optimalTreeSize.rf <- function( modelFormula, trainingData, iterations, initNumTree, treeStep ){ bestTree <- c() prevErr <- 100 for(i in 1:iterations){ # Train the RF using selected set of features for N random trees fit <- randomForest( as.formula(modelFormula), data=trainingData, ntree=initNumTree) currErr <- sum( data.frame(fit$confusion)["class.error"] )/2 if( currErr < prevErr ){ bestTree <- fit } prevErr <- currErr initNumTree <- initNumTree + treeStep } return(bestTree) }
Algorithm Results
This section proves the algorithm is working by keeping the best model with the lowest error while reducing the feature set. As we can see from the textual and plot output the final tree has the lowest error rate while reducing the number of features to nine. The result is significantly better than manually working through permutations of features.
> print(bestFittingTree) $tree Call: randomForest(formula = as.formula(modelFormula), data = trainingData, ntree = initNumTree) Type of random forest: classification Number of trees: 230 No. of variables tried at each split: 3 OOB estimate of error rate: 11.59% Confusion matrix: NO YES class.error NO 190 31 0.1402715 YES 17 176 0.0880829 $errors [1] 0.1222774 0.1141772 0.1238834 0.1216210 0.1193585 0.1264741 0.1245399 0.1238834 0.1290648 0.1271306 > importance( tree ) MeanDecreaseGini V2 14.73728 V3 13.69773 V6 20.97015 V8 20.19242 V9 71.86184 V10 12.66245 V11 23.24506 V14 12.78868 V15 13.48829
Summary
This algorithm improves on the previous blog error rate by 29% and while testing I seen this increase to 39% with an error rate approaching 10%. Given it simple approach its rather effective although before committing myself with this algorithm I would want to work with much larger datasets to determine its true abilities to select features. Another algorithm I would like to explore is the Wiggle Algorithm, made up name, where you adjust the feature variables to determine the effective prediction output, although this is only applicable for pure numerical datasets.
On an ending note I can imagine there are a number of R inefficiencies so please feel free to either enhance or send me the details.