Fast Data Geek : March 2014

As a part of my efforts to increase my understanding of machine learning algorithms I am putting together a set of blogs to experiment algorithm usage. My chosen languages are R, Java and Python to implement, train and test algorithms. I am very much a novice in the area of R but very much the opposite in Java, so any feedback is most welcome.

Models will be created in R and will be exported using Predictive Model Markup Language (PMML). This enables cross language support for ML algorithms while providing the platform to perform deep offline learning using Big Data solutions, SAS and GreenPlum DB are good platform examples to investigate.

The first ML algorithm I shall start with is the most basic; linear regression. This project I shall be using the famous abalone dataset to predict the number of rings an abalone would have based on a set of chosen features. To acheive this I shall be using R to train and validate the model, PMML model export and a Java JPMML example to read the resulting offline model to predict number of abalone rings from the held out test set. So lets get started.

Data Preparation stage
First load the data, add the column names, select required features and scale only those features. In this instance the data does not need to be cleaned.

require(pmml)     # Exporting the model in xml
require(ggplot2)  # Visualization

# Load the data
abalone <- read.table("../data/abalone.csv",sep=",", header=TRUE)
abaloneNames <- c("sex","length","diameter","height","whole_weight","shucked_weight","viscera_weight","shell_weight","rings")
colnames(abalone) <- abaloneNames

# Remove Sex column.
abalone$Sex <- NULL

# Select the feature space
features <- c("whole_weight", "diameter", "rings", "length", "height")
abalone <- abalone[ features ]

# Move features in to log space since lm works better when all features are equally scaled. 
abalone <- scale( abalone )

# Now split the dataset in to train 80%  and test 20%
indexes <- sample(1:nrow(abalone), size=0.2*nrow(abalone))
test <- abalone[indexes,]
train <- abalone[-indexes,]

Training stage
Now start the training process. In this instance I am only going to use a single feature as an indicator to the number of rings. The resulting graph provides a sense how the prediction performs by plotting the training data against linear separator.

# Now train only using a single feature
ringsmodel <- lm(formula = rings ~ diameter, data = train)

# Make sure the p-value is less than 5%
summary( ringsmodel)

Call:
lm(formula = rings ~ diameter, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.25960 -0.06873 -0.01987  0.05219  0.40065 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.284061   0.005876  218.52   <2e-16 ***
diameter    0.761835   0.013837   55.06   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.1005 on 3340 degrees of freedom
Multiple R-squared: 0.4758, Adjusted R-squared: 0.4756 
F-statistic:  3031 on 1 and 3340 DF,  p-value: < 2.2e-16 


# Get a sense how the prediction performed
qplot(x = diameter, 
      y = rings,
      data = train,
      alpha = I(0.2), # alpha makes the points semitransparent so you can see stacked points
      geom = "jitter") + # jitter helps spread the points so they don't stack so much
  geom_smooth(method = lm)

At this point we could iterate the training process by adding in additional features hoping to reduce p-value further. However, given this is low enough for this example I will stop here. As a side point a p-value less than 5% is good enough for predictive generalisation.

Test the prediction to actual outcomes stage
Now we are ready to test how out model performs using the test data we put aside earlier. To help review the I have performed a join with the predicted data with the actual data along with the error.

# Use the test data against the predictive model.
p <- predict.lm( ringsmodel, test, se.fit=TRUE )
pred.w.plim <- predict( ringsmodel, test, interval="prediction")
pred.w.clim <- predict( ringsmodel, test, interval="confidence")
matplot(test$diameter,cbind(pred.w.clim, pred.w.plim[,-1]),
        lty=c(1,2,2,3,3), type="l", ylab="predicted y")


# Join the actual with the predicted rings
prediction <- data.frame(actual = test$rings,  predicted = p$fit, error = test$rings - p$fit )

As you can see the result is reasonable enough to suggest the model has not overfitted the data and is generalising reasonable well.

Export the final model stage
The final part of this process is to export the model to pmml for a consuming application to predict with. The next blog with read the xml file using JPMML and perform the same tests used above.

# Export the resulting model as PMML file.
localfilename <- "../models/abalone-rings-lm-prediction.xml"
saveXML(pmml( ringsmodel, model.name = "AbaloneRingsPredictionLM", app.name = "RR/PMML", dataset = dataset) , file = localfilename)

# Save the test data for later use as csv
write.csv(test, file = "../data/test_abalone.csv")

Fast Data Geek

Pages

Saturday, 29 March 2014

Linear Regression in R to PMML (Part 1)

About Me :: Mr Geek

Blog Archive