Wednesday 4 June 2014

Automated feature selection using random forest

Overview

Continuing on from the previous blog on random forest I decided to investigate how to perform automated feature selection using feature importance measurements. This type of process becomes import when the feature set becomes very large and the speed of the machine learning algorithm is a function of the number of features. Dimensionality reduction, feature set reduction, solves this problem using various algorithms such as principal component analysis. At this point in time I choose to keep my problem simple to expand my knowledge of the random forest algorithm while improving my R skills.

The implementation i present here is far from perfect or even production level, for example notice how I have omitted all the function arguments to randomforest never mind the R inefficiencies. However, the code is reasonable enough to demonstrate the basic idea of automated feature selection and some interesting R snippets for the novice R programmer. Source code is now availble on Github

Sequential feature selection algorithm

The algorithm to reduce the number of features while finding the minimum error rate is based on a two step process.

  1. Feature selection is performed by taking the feature MeanDecreaseGini value and determine if this passed the incrementing threshold function. If the value is greater than the threshold it will be a member of the feature set otherwise it is removed from further processing.
  2. Using the new feature set find the optimal number of tree with the lowest error rate.
On start up all features are used. On each pass of the feature selection loop the threshold is increase by 1% to help reduce the feature space.

The following sections present the code to perform learning and feature selection. I have reduce the code to the most interesting components, noticed I have introduced a user library and imported it using the 'source' function.

The program flow is as follows, section below:

  1. Load and prepare the data.
  2. Set the algorithm parameters.
  3. Execute the algorithm.
  4. Display various plots and model measurements.
  5. Export and save model to PMML.
The initial number of features including the target feature consisted of sixteen independent variables.

# Load the user lib files
source( 'lib/rforestHelperFunctions.R')

# Load the data, partition to train/test sets etc.
creditapps <- read.table("../data/creditcard_applications.txt",sep=",", header=FALSE)

etc.......

# Algorithm parameters
formula <- "V16 ~." 
attrThreshold <- 4
iterations <- 10
initTreeCount <- 55
treeCountStep <- 25

# Now call the algorithm
bestFittingTree <- sqBestFeatureModel.rf( formula,  trainingSet, "V16", iterations, initTreeCount, treeCountStep, attrThreshold )

tree <- bestFittingTree$tree
errors <- bestFittingTree$errors

# Output error plot
plot( errors)

# importance of each predictor
print(bestFittingTree )
importance( tree )
varImpPlot( tree)
plot(predict( object=tree, newdata=testingSet), testingTarget)


Algorithm implementation

The algorithm is split into two sections: selecting the best features and find the best number of trees for the given features. Below is a description of the required parameters.

Function parameters

  1. modelFormula - The R formula used for the randomForest function.
  2. data - The training data frame.
  3. targetVar - The target variable the
  4. maxIterations - Number of iterations to perform the search.
  5. numStartTrees - The initial number of trees to grow.
  6. treeStepCount - The tree growth increment for each step.
  7. attrThreshold - The minimum Gini threshold to use to remove features.
sqBestFeatureModel.rf <- function(modelFormula, data, targetVar, maxIterations, numStartTrees, treeStepCount, attrThreshold ){
  
  features <- colnames(trainingSet, do.NULL = TRUE, prefix = "col")
  errMeasure <- c()
  bestFit <- c()
  bestFeatureErr <- 2000
  
  # Loop through all features we have
  for (n in 1:maxIterations){
    
    X <- data[ features ]
    
    # Find optimal number of trees
    bestTree <- optimalTreeSize.rf( modelFormula, X, maxIterations, numStartTrees, treeStepCount)
    
    # Get error of training and track
    featureErr <- sum( data.frame(bestTree$confusion)["class.error"] )/2
    errMeasure <- append(errMeasure, featureErr, 1)
    
    # Save the best performing forest and feature set
    if( featureErr < bestFeatureErr ){
      bestFit <- bestTree  
      bestFeatures <- features
      bestFeatureErr <- featureErr
      
      imp <- data.frame( bestFit$importance )
      keepAttr <- ifelse( imp$MeanDecreaseGini > attrThreshold, TRUE, FALSE)
      features <- features[ keepAttr ]
      #print(bestFit)
      #print( features)
    }
    
    # Add back in the target variable
    if( ! targetVar %in% features){
      features <- append( features, targetVar)
    } 
    
    # Increase threshold  
    attrThreshold <- attrThreshold + 1
  }
  
  return( list(tree=bestFit, errors=c(errMeasure)))
}


Optimal trees function

This function returns the optimal number of tree to grow for the given feature. This is performed by growing N number of trees per iteration while keeping the best formed forest.

Function parameters

  1. modelFormula - The R formula used for the randomForest function.
  2. trainingData - The training data frame.
  3. iterations - Number of iterations to perform the search.
  4. initNumTree - The initial number of trees to grow.
  5. treeStep - The tree growth increment for each step.
optimalTreeSize.rf <- function( modelFormula, trainingData, iterations, initNumTree, treeStep ){
  bestTree <- c()
  prevErr <- 100
  for(i in 1:iterations){
    # Train the RF using selected set of features for N random trees
    fit <- randomForest( as.formula(modelFormula), data=trainingData, ntree=initNumTree)
    currErr <- sum( data.frame(fit$confusion)["class.error"] )/2
    if(  currErr < prevErr ){
      bestTree <- fit
    }
    prevErr <- currErr
    initNumTree <- initNumTree + treeStep
  }
  return(bestTree)
}


Algorithm Results

This section proves the algorithm is working by keeping the best model with the lowest error while reducing the feature set. As we can see from the textual and plot output the final tree has the lowest error rate while reducing the number of features to nine. The result is significantly better than manually working through permutations of features.

> print(bestFittingTree)
$tree

Call:
 randomForest(formula = as.formula(modelFormula), data = trainingData, ntree = initNumTree) 
               Type of random forest: classification
                     Number of trees: 230
No. of variables tried at each split: 3

        OOB estimate of  error rate: 11.59%
Confusion matrix:
     NO YES class.error
NO  190  31   0.1402715
YES  17 176   0.0880829

$errors
 [1] 0.1222774 0.1141772 0.1238834 0.1216210 0.1193585 0.1264741 0.1245399 0.1238834 0.1290648 0.1271306

> importance( tree )
    MeanDecreaseGini
V2          14.73728
V3          13.69773
V6          20.97015
V8          20.19242
V9          71.86184
V10         12.66245
V11         23.24506
V14         12.78868
V15         13.48829


Summary

This algorithm improves on the previous blog error rate by 29% and while testing I seen this increase to 39% with an error rate approaching 10%. Given it simple approach its rather effective although before committing myself with this algorithm I would want to work with much larger datasets to determine its true abilities to select features. Another algorithm I would like to explore is the Wiggle Algorithm, made up name, where you adjust the feature variables to determine the effective prediction output, although this is only applicable for pure numerical datasets.

On an ending note I can imagine there are a number of R inefficiencies so please feel free to either enhance or send me the details.

Technologies

  1. Source code now availble on Github
  2. RStudio
  3. Random Forest R package
  4. Good explanation of Random Forest

Monday 5 May 2014

Random Forest - Credit Approval using R/JPMML

Overview

The last blog post showed how R (offline modelling) and JPMML can be used to predict the number of rings an abalone would have using linear regression. Now I did say I would cover kNN next, as I am interested how clustering can be/is used to for machine learning problems, but I got side tracked by a random conversation, and thought I would have a look at decision trees, Random Forest instead. So if you were expecting kNN i am sorry however this blog will describe my journey while roaming the random forest.

Initially the journey started with the Iris data set to classify flower species. This worked out rather well and felt I needed to push a bit further to gain a more realistic use case while gaining a better understanding of the random forest algorithm. This lead me to look at the Credit Approval data set to determine if decision trees worked well for differing types of feature points and a real world use case.

Random Forest - R sample

The R script below sets out how to train a random forest, type column fields and convert cells using the ifelse statement. The processing flow has been repeated from previous examples; Starting with loading, preparing and splitting the data to training and review with the final set saving the resulting model as PMML and the test data set as CSV.

require(randomForest)
require(pmml)  # for storing the model in xml

# select random rows function
randomRows = function(df,n){
  return(df[sample(nrow(df),n),])
}

# Load the data and keep the column headers
ca <- read.table("../data/credit_apps.txt",sep=",", header=FALSE)

# Get the number of rows loaded
sizeOfData <- nrow( ca )

# Convert +/- to YES/NO
ca$V16 <- ifelse(ca$V16 == "+", "YES", "NO")

# Type numeric columns
ca$V2 <- as.numeric( ca$V2 )
ca$V3 <- as.numeric( ca$V3 )
ca$V8 <- as.numeric( ca$V8 )
ca$V11 <- as.numeric( ca$V11 )
ca$V15 <- as.numeric( ca$V15 )
ca$V16 <- as.factor( ca$V16 )

# Randomise the data
ca <- randomRows( ca, sizeOfData)

# Now split the dataset in to test 60%  and train 40%
indexes <- sample(1:sizeOfData, size=0.6*sizeOfData)
test <- ca[indexes,]
train <- ca[-indexes,]

# Train the RF using selected features for 35 internal trees
fit <- randomForest(V16 ~ V6+V8+V9+V10+V11+V15, data=train, ntree=35)
test <- subset(test, select = -c(V1,V2,V3,V4,V5,V7,V12,V13,V14))

# Write the model to pmml file
localfilename <- "../models/creditapp-randomforest-prediction.xml"
saveXML(pmml( fit, model.name = "CreditAppPredictionRForest", 
              app.name = "RR/PMML", dataset = dataset) , 
              file = localfilename)

# Write test file to csv
write.table( test, file="../data/test_creditapp.csv", 
             sep=",", row.names=FALSE)

Model Review

Lets take a look how well the tree has been trained. An error rate of 16% isn't so bad given the small training set used. By printing out the resulting trained model we can get an insight on how well the model was trained.

# view results 
print(fit) 

               Type of random forest: classification
                     Number of trees: 35
No. of variables tried at each split: 2

        OOB estimate of error rate: 16.3%
Confusion matrix:
     NO YES class.error
NO  127  24   0.1589404
YES  21 104   0.1680000

A useful tool to gain an understanding of which features have a stronger influence for the prediction process to successfully classify data with respect to picking random predictor variables is the importance(fit) function. This function provides key mean values with respect to predictor variable prediction performance. Below are is a description of the key mean values calculated.

  1. MeanDecreaseAccuracy: Measures how much inclusion of this predictor in the model reduces classification error.
  2. MeanDecreaseGini: Gini is defined as "inequity" when used in describing a society's distribution of income, or a measure of "node impurity" in tree-based classification. A low Gini (i.e. higher decreases in Gini) means that a particular predictor variable plays a greater role in partitioning the data into the defined classes. It's a hard one to describe without talking about the fact that data in classification trees are split at individual nodes based on values of predictors. I'm not so clear on how this translates into better performance.

# importance of each predictor
> importance(fit)
           NO        YES MeanDecreaseAccuracy MeanDecreaseGini
V6  3.0935653  0.4156078            2.1715664         23.67625
V8  0.5473186  1.3803261            1.1956919         14.57790
V9  8.8322674 10.7936519           13.3683007         39.24454
V10 2.4112358  1.3696413            2.1851807          6.49598
V11 3.6485718  0.1613717            2.7293181         16.47854
V15 2.0332454 -3.6738851           -0.9700909         11.24369 

# Variable importance plot
varImpPlot(fit)

Also we can see this on a plot chart too using the varImpPlot routine to gain a visual aspect.


JPMML - Java Processing (Streaming)

The Java code to perform the actual prediction is straightforward thanks to the JPMML API. The below code block performs the core calls for prediction.

// Get the list of required feature set model needs to predict.
List requiredModelFeatures = evaluator.getActiveFields();

// Build the feature set
features = JPMMLUtils.buildFeatureSet( 
             evaluator, 
             requiredModelFeatures,    
             rfeatures                // CSV features
            );

// Execute the prediction
Map rs = evaluator.evaluate( features );

// Get the set of prediction responses
Double yesProb = (Double)rs.get(new FieldName("Probability_YES"));
Double noProb = (Double)rs.get(new FieldName("Probability_NO"));

Prediction Results

The final section presents the results using the held back test set. The resulting prediction, 87% correct, is ok given the time spent putting this example together. Notice how little Java code is required to perform the prediction process. Again credit for the JPMML framework.

[CORRECT] Prediction: YES(0.085714) NO(0.914286) : Expected [NO]
[CORRECT] Prediction: YES(0.000000) NO(1.000000) : Expected [NO]
[INCORRECT] Prediction: YES(0.028571) NO(0.971429) : Expected [YES]
[CORRECT] Prediction: YES(1.000000) NO(0.000000) : Expected [YES]
[CORRECT] Prediction: YES(0.028571) NO(0.971429) : Expected [NO]
[CORRECT] Prediction: YES(0.114286) NO(0.885714) : Expected [NO]
[CORRECT] Prediction: YES(0.000000) NO(1.000000) : Expected [NO]
[CORRECT] Prediction: YES(0.971429) NO(0.028571) : Expected [YES]
[CORRECT] Prediction: YES(0.971429) NO(0.028571) : Expected [YES]
[CORRECT] Prediction: YES(0.028571) NO(0.971429) : Expected [NO]
[CORRECT] Prediction: YES(0.342857) NO(0.657143) : Expected [NO]
[CORRECT] Prediction: YES(0.771429) NO(0.228571) : Expected [YES]
[CORRECT] Prediction: YES(1.000000) NO(0.000000) : Expected [YES]
[CORRECT] Prediction: YES(0.828571) NO(0.171429) : Expected [YES]

Predicted 414 items in 595ms. Correct[362] Incorrect [52]

Summary

Overall I am very impressed how little one needs to do to get a reasonable model and would justify additional effort for a deeper understanding and to reduce the total misclassification error. Further sessions will be looking at feature selection, number of trees to create and how to automate the process. Stay tuned for my next post hopefully with a conclusion to this algorithm. If you require any further information on how I did this please email me directly.

Technologies

  1. JPMML
  2. RStudio
  3. Random Forest R package
  4. Good explanation of Random Forest
  5. Java 1.7

Friday 4 April 2014

Java (JPMML) Prediction using R PMML model (Part 2)

Last post showed how to generate a linear regression PMML model using R. This blog will load this PMML model into Java, via the excellent JPMML package, and perform predictions using a streaming test data set.


The method below represents how a PMML file is loaded in to memory using the JPMML library. Once the PMML has been load we are ready to perform prediction using streaming data. In this example our stream is a test csv file but in a production environment this can be real-time streams or some offline prediction mechanism using captured data ready for a start of day process.
/**
 * Load a PMML model from the file system.
 *
 * @param file
 * @return PMML
 * @throws Exception
 */
public final static PMML loadModel(final String file) throws Exception {

   PMML pmml = null;

   File inputFilePath = new File( file );

   try( InputStream in = new FileInputStream( inputFilePath ) ){

     Source source = ImportFilter.apply(new InputSource(in));
     pmml = JAXBUtil.unmarshalPMML(source);

   } catch( Exception e) {
      logger.error( e.toString() );
      throw e;
   }
   return pmml;
}

Now we have loaded the linear regression model as a PMML object we can start the real work of setting up the prediction processing. The next code section shows how to load, gain a predictive evaluator and get the list of input feature fields needed for the evaluator.

// Load the file using our simple util function.
PMML pmml = JPMMLUtils.loadModel( pmmlFilePath );

// Now we need a prediction evaluator for the loaded model
PMMLManager mgr = new PMMLManager( pmml );

ModelEvaluator modelEvaluator = (ModelEvaluator) mgr.getModelManager(modelName, ModelEvaluatorFactory.getInstance());
Evaluator evaluator = modelEvaluator;

// Get the list of required feature set model needs to predict.
List requiredModelFeatures = evaluator.getActiveFields();

Below is the main body of the prediction routine. The code performs the following steps:

  1. Parse the passed csv line into tokens and split into required variables.
  2. Get the required feature variable field name.
  3. Prepare the field to value mapping and assign to feature hashmap.
  4. Perform the prediction (evaluate function).
  5. Convert values back to original state and print out result.

 // For each CSV line perform a predict.
 while ((line = br.readLine()) != null) {

     String[] tokens = line.split( cvsSplitBy );
     
     double d =  Double.valueOf( tokens[2] );
     double e = Double.valueOf( tokens[3] );

     FieldName fieldName = requiredModelFeatures.get(0);

     // In this instance I know there is only one feature
     // For a production system this would be performed in a transformation stage and may collect data externally.
     FieldValue value = evaluator.prepare(fieldName, Double.valueOf(d));
     features.put( fieldName, value );

     Map results = evaluator.evaluate( features );

     // Convert back to original ring value so the prediction become meaningful.
     double y = (Double)results.get( evaluator.getTargetField());
     int predictedRings = (int) Math.abs( Math.pow( 10, y));

     int expectedRings = (int) Math.abs( Math.pow( 10, e));

     double diameter =  Math.pow( 10, d);
     System.out.println(String.format("Diameter %f - Expected rings %d : Predicted rings: %d", diameter, expectedRings, predictedRings));
}

The section below presents the predictions performed using the JPMML linear regression evaluator using the passed offline R model. As you can see they are not perfect but the algorithm is generalizing reasonable well give the minor effort placed to build the model.
    Diameter 0.225000 - Expected rings 7 : Predicted rings: 6
    Diameter 0.530000 - Expected rings 11 : Predicted rings: 11
    Diameter 0.500000 - Expected rings 11 : Predicted rings: 11
    Diameter 0.460000 - Expected rings 8 : Predicted rings: 10
    Diameter 0.270000 - Expected rings 7 : Predicted rings: 7
    Diameter 0.510000 - Expected rings 11 : Predicted rings: 11
    Diameter 0.295000 - Expected rings 10 : Predicted rings: 7
    Diameter 0.450000 - Expected rings 9 : Predicted rings: 10
    Diameter 0.350000 - Expected rings 7 : Predicted rings: 8
    Diameter 0.195000 - Expected rings 6 : Predicted rings: 5
    Diameter 0.435000 - Expected rings 18 : Predicted rings: 10
    Diameter 0.410000 - Expected rings 11 : Predicted rings: 9
    Diameter 0.455000 - Expected rings 9 : Predicted rings: 10
    Diameter 0.440000 - Expected rings 11 : Predicted rings: 10
    Diameter 0.270000 - Expected rings 7 : Predicted rings: 7
    Diameter 0.495000 - Expected rings 11 : Predicted rings: 11
    Diameter 0.525000 - Expected rings 11 : Predicted rings: 11
    Diameter 0.205000 - Expected rings 3 : Predicted rings: 5
    Diameter 0.510000 - Expected rings 10 : Predicted rings: 11
    Diameter 0.375000 - Expected rings 8 : Predicted rings: 9
    Diameter 0.135000 - Expected rings 5 : Predicted rings: 4
    Diameter 0.465000 - Expected rings 11 : Predicted rings: 10

This now concludes the R to Java PMML integration sessions. My next algorithm I will be looking at is k-NN using the same technique. If you require any further information on how I did this please email me directly.

Saturday 29 March 2014

Linear Regression in R to PMML (Part 1)

As a part of my efforts to increase my understanding of machine learning algorithms I am putting together a set of blogs to experiment algorithm usage. My chosen languages are R, Java and Python to implement, train and test algorithms. I am very much a novice in the area of R but very much the opposite in Java, so any feedback is most welcome.

Models will be created in R and will be exported using Predictive Model Markup Language (PMML). This enables cross language support for ML algorithms while providing the platform to perform deep offline learning using Big Data solutions, SAS and GreenPlum DB are good platform examples to investigate.

The first ML algorithm I shall start with is the most basic; linear regression. This project I shall be using the famous abalone dataset to predict the number of rings an abalone would have based on a set of chosen features. To acheive this I shall be using R to train and validate the model, PMML model export and a Java JPMML example to read the resulting offline model to predict number of abalone rings from the held out test set. So lets get started.

Data Preparation stage
First load the data, add the column names, select required features and scale only those features. In this instance the data does not need to be cleaned.
require(pmml)     # Exporting the model in xml
require(ggplot2)  # Visualization

# Load the data
abalone <- read.table("../data/abalone.csv",sep=",", header=TRUE)
abaloneNames <- c("sex","length","diameter","height","whole_weight","shucked_weight","viscera_weight","shell_weight","rings")
colnames(abalone) <- abaloneNames

# Remove Sex column.
abalone$Sex <- NULL

# Select the feature space
features <- c("whole_weight", "diameter", "rings", "length", "height")
abalone <- abalone[ features ]

# Move features in to log space since lm works better when all features are equally scaled. 
abalone <- scale( abalone )

# Now split the dataset in to train 80%  and test 20%
indexes <- sample(1:nrow(abalone), size=0.2*nrow(abalone))
test <- abalone[indexes,]
train <- abalone[-indexes,]

Training stage
Now start the training process. In this instance I am only going to use a single feature as an indicator to the number of rings. The resulting graph provides a sense how the prediction performs by plotting the training data against linear separator.
# Now train only using a single feature
ringsmodel <- lm(formula = rings ~ diameter, data = train)

# Make sure the p-value is less than 5%
summary( ringsmodel)

Call:
lm(formula = rings ~ diameter, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.25960 -0.06873 -0.01987  0.05219  0.40065 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.284061   0.005876  218.52   <2e-16 ***
diameter    0.761835   0.013837   55.06   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.1005 on 3340 degrees of freedom
Multiple R-squared: 0.4758, Adjusted R-squared: 0.4756 
F-statistic:  3031 on 1 and 3340 DF,  p-value: < 2.2e-16 


# Get a sense how the prediction performed
qplot(x = diameter, 
      y = rings,
      data = train,
      alpha = I(0.2), # alpha makes the points semitransparent so you can see stacked points
      geom = "jitter") + # jitter helps spread the points so they don't stack so much
  geom_smooth(method = lm)

At this point we could iterate the training process by adding in additional features hoping to reduce p-value further. However, given this is low enough for this example I will stop here. As a side point a p-value less than 5% is good enough for predictive generalisation.

Test the prediction to actual outcomes stage
Now we are ready to test how out model performs using the test data we put aside earlier. To help review the I have performed a join with the predicted data with the actual data along with the error.
# Use the test data against the predictive model.
p <- predict.lm( ringsmodel, test, se.fit=TRUE )
pred.w.plim <- predict( ringsmodel, test, interval="prediction")
pred.w.clim <- predict( ringsmodel, test, interval="confidence")
matplot(test$diameter,cbind(pred.w.clim, pred.w.plim[,-1]),
        lty=c(1,2,2,3,3), type="l", ylab="predicted y")


# Join the actual with the predicted rings
prediction <- data.frame(actual = test$rings,  predicted = p$fit, error = test$rings - p$fit )
As you can see the result is reasonable enough to suggest the model has not overfitted the data and is generalising reasonable well.

Export the final model stage
The final part of this process is to export the model to pmml for a consuming application to predict with. The next blog with read the xml file using JPMML and perform the same tests used above.
# Export the resulting model as PMML file.
localfilename <- "../models/abalone-rings-lm-prediction.xml"
saveXML(pmml( ringsmodel, model.name = "AbaloneRingsPredictionLM", app.name = "RR/PMML", dataset = dataset) , file = localfilename)

# Save the test data for later use as csv
write.csv(test, file = "../data/test_abalone.csv")