Monday, 5 May 2014

Random Forest - Credit Approval using R/JPMML

Overview

The last blog post showed how R (offline modelling) and JPMML can be used to predict the number of rings an abalone would have using linear regression. Now I did say I would cover kNN next, as I am interested how clustering can be/is used to for machine learning problems, but I got side tracked by a random conversation, and thought I would have a look at decision trees, Random Forest instead. So if you were expecting kNN i am sorry however this blog will describe my journey while roaming the random forest.

Initially the journey started with the Iris data set to classify flower species. This worked out rather well and felt I needed to push a bit further to gain a more realistic use case while gaining a better understanding of the random forest algorithm. This lead me to look at the Credit Approval data set to determine if decision trees worked well for differing types of feature points and a real world use case.

Random Forest - R sample

The R script below sets out how to train a random forest, type column fields and convert cells using the ifelse statement. The processing flow has been repeated from previous examples; Starting with loading, preparing and splitting the data to training and review with the final set saving the resulting model as PMML and the test data set as CSV.

require(randomForest)
require(pmml)  # for storing the model in xml

# select random rows function
randomRows = function(df,n){
  return(df[sample(nrow(df),n),])
}

# Load the data and keep the column headers
ca <- read.table("../data/credit_apps.txt",sep=",", header=FALSE)

# Get the number of rows loaded
sizeOfData <- nrow( ca )

# Convert +/- to YES/NO
ca$V16 <- ifelse(ca$V16 == "+", "YES", "NO")

# Type numeric columns
ca$V2 <- as.numeric( ca$V2 )
ca$V3 <- as.numeric( ca$V3 )
ca$V8 <- as.numeric( ca$V8 )
ca$V11 <- as.numeric( ca$V11 )
ca$V15 <- as.numeric( ca$V15 )
ca$V16 <- as.factor( ca$V16 )

# Randomise the data
ca <- randomRows( ca, sizeOfData)

# Now split the dataset in to test 60%  and train 40%
indexes <- sample(1:sizeOfData, size=0.6*sizeOfData)
test <- ca[indexes,]
train <- ca[-indexes,]

# Train the RF using selected features for 35 internal trees
fit <- randomForest(V16 ~ V6+V8+V9+V10+V11+V15, data=train, ntree=35)
test <- subset(test, select = -c(V1,V2,V3,V4,V5,V7,V12,V13,V14))

# Write the model to pmml file
localfilename <- "../models/creditapp-randomforest-prediction.xml"
saveXML(pmml( fit, model.name = "CreditAppPredictionRForest", 
              app.name = "RR/PMML", dataset = dataset) , 
              file = localfilename)

# Write test file to csv
write.table( test, file="../data/test_creditapp.csv", 
             sep=",", row.names=FALSE)

Model Review

Lets take a look how well the tree has been trained. An error rate of 16% isn't so bad given the small training set used. By printing out the resulting trained model we can get an insight on how well the model was trained.

# view results 
print(fit) 

               Type of random forest: classification
                     Number of trees: 35
No. of variables tried at each split: 2

        OOB estimate of error rate: 16.3%
Confusion matrix:
     NO YES class.error
NO  127  24   0.1589404
YES  21 104   0.1680000

A useful tool to gain an understanding of which features have a stronger influence for the prediction process to successfully classify data with respect to picking random predictor variables is the importance(fit) function. This function provides key mean values with respect to predictor variable prediction performance. Below are is a description of the key mean values calculated.

  1. MeanDecreaseAccuracy: Measures how much inclusion of this predictor in the model reduces classification error.
  2. MeanDecreaseGini: Gini is defined as "inequity" when used in describing a society's distribution of income, or a measure of "node impurity" in tree-based classification. A low Gini (i.e. higher decreases in Gini) means that a particular predictor variable plays a greater role in partitioning the data into the defined classes. It's a hard one to describe without talking about the fact that data in classification trees are split at individual nodes based on values of predictors. I'm not so clear on how this translates into better performance.

# importance of each predictor
> importance(fit)
           NO        YES MeanDecreaseAccuracy MeanDecreaseGini
V6  3.0935653  0.4156078            2.1715664         23.67625
V8  0.5473186  1.3803261            1.1956919         14.57790
V9  8.8322674 10.7936519           13.3683007         39.24454
V10 2.4112358  1.3696413            2.1851807          6.49598
V11 3.6485718  0.1613717            2.7293181         16.47854
V15 2.0332454 -3.6738851           -0.9700909         11.24369 

# Variable importance plot
varImpPlot(fit)

Also we can see this on a plot chart too using the varImpPlot routine to gain a visual aspect.


JPMML - Java Processing (Streaming)

The Java code to perform the actual prediction is straightforward thanks to the JPMML API. The below code block performs the core calls for prediction.

// Get the list of required feature set model needs to predict.
List requiredModelFeatures = evaluator.getActiveFields();

// Build the feature set
features = JPMMLUtils.buildFeatureSet( 
             evaluator, 
             requiredModelFeatures,    
             rfeatures                // CSV features
            );

// Execute the prediction
Map rs = evaluator.evaluate( features );

// Get the set of prediction responses
Double yesProb = (Double)rs.get(new FieldName("Probability_YES"));
Double noProb = (Double)rs.get(new FieldName("Probability_NO"));

Prediction Results

The final section presents the results using the held back test set. The resulting prediction, 87% correct, is ok given the time spent putting this example together. Notice how little Java code is required to perform the prediction process. Again credit for the JPMML framework.

[CORRECT] Prediction: YES(0.085714) NO(0.914286) : Expected [NO]
[CORRECT] Prediction: YES(0.000000) NO(1.000000) : Expected [NO]
[INCORRECT] Prediction: YES(0.028571) NO(0.971429) : Expected [YES]
[CORRECT] Prediction: YES(1.000000) NO(0.000000) : Expected [YES]
[CORRECT] Prediction: YES(0.028571) NO(0.971429) : Expected [NO]
[CORRECT] Prediction: YES(0.114286) NO(0.885714) : Expected [NO]
[CORRECT] Prediction: YES(0.000000) NO(1.000000) : Expected [NO]
[CORRECT] Prediction: YES(0.971429) NO(0.028571) : Expected [YES]
[CORRECT] Prediction: YES(0.971429) NO(0.028571) : Expected [YES]
[CORRECT] Prediction: YES(0.028571) NO(0.971429) : Expected [NO]
[CORRECT] Prediction: YES(0.342857) NO(0.657143) : Expected [NO]
[CORRECT] Prediction: YES(0.771429) NO(0.228571) : Expected [YES]
[CORRECT] Prediction: YES(1.000000) NO(0.000000) : Expected [YES]
[CORRECT] Prediction: YES(0.828571) NO(0.171429) : Expected [YES]

Predicted 414 items in 595ms. Correct[362] Incorrect [52]

Summary

Overall I am very impressed how little one needs to do to get a reasonable model and would justify additional effort for a deeper understanding and to reduce the total misclassification error. Further sessions will be looking at feature selection, number of trees to create and how to automate the process. Stay tuned for my next post hopefully with a conclusion to this algorithm. If you require any further information on how I did this please email me directly.

Technologies

  1. JPMML
  2. RStudio
  3. Random Forest R package
  4. Good explanation of Random Forest
  5. Java 1.7