As a part of my efforts to increase my understanding of machine learning algorithms I am putting together a set of blogs to experiment algorithm usage. My chosen languages are R, Java and Python to implement, train and test algorithms. I am very much a novice in the area of R but very much the opposite in Java, so any feedback is most welcome.
Models will be created in R and will be exported using Predictive Model Markup Language (PMML). This enables cross language support for ML algorithms while providing the platform to perform deep offline learning using Big Data solutions, SAS and GreenPlum DB are good platform examples to investigate.
The first ML algorithm I shall start with is the most basic; linear regression. This project I shall be using the famous abalone dataset to predict the number of rings an abalone would have based on a set of chosen features. To acheive this I shall be using R to train and validate the model, PMML model export and a Java JPMML example to read the resulting offline model to predict number of abalone rings from the held out test set. So lets get started.
Data Preparation stage First load the data, add the column names, select required features and scale only those features. In this instance the data does not need to be cleaned.require(pmml) # Exporting the model in xml require(ggplot2) # Visualization # Load the data abalone <- read.table("../data/abalone.csv",sep=",", header=TRUE) abaloneNames <- c("sex","length","diameter","height","whole_weight","shucked_weight","viscera_weight","shell_weight","rings") colnames(abalone) <- abaloneNames # Remove Sex column. abalone$Sex <- NULL # Select the feature space features <- c("whole_weight", "diameter", "rings", "length", "height") abalone <- abalone[ features ] # Move features in to log space since lm works better when all features are equally scaled. abalone <- scale( abalone ) # Now split the dataset in to train 80% and test 20% indexes <- sample(1:nrow(abalone), size=0.2*nrow(abalone)) test <- abalone[indexes,] train <- abalone[-indexes,]Training stage Now start the training process. In this instance I am only going to use a single feature as an indicator to the number of rings. The resulting graph provides a sense how the prediction performs by plotting the training data against linear separator.
# Now train only using a single feature ringsmodel <- lm(formula = rings ~ diameter, data = train) # Make sure the p-value is less than 5% summary( ringsmodel) Call: lm(formula = rings ~ diameter, data = train) Residuals: Min 1Q Median 3Q Max -0.25960 -0.06873 -0.01987 0.05219 0.40065 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.284061 0.005876 218.52 <2e-16 *** diameter 0.761835 0.013837 55.06 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1005 on 3340 degrees of freedom Multiple R-squared: 0.4758, Adjusted R-squared: 0.4756 F-statistic: 3031 on 1 and 3340 DF, p-value: < 2.2e-16 # Get a sense how the prediction performed qplot(x = diameter, y = rings, data = train, alpha = I(0.2), # alpha makes the points semitransparent so you can see stacked points geom = "jitter") + # jitter helps spread the points so they don't stack so much geom_smooth(method = lm)
At this point we could iterate the training process by adding in additional features hoping to reduce p-value further. However, given this is low enough for this example I will stop here. As a side point a p-value less than 5% is good enough for predictive generalisation.
Test the prediction to actual outcomes stage Now we are ready to test how out model performs using the test data we put aside earlier. To help review the I have performed a join with the predicted data with the actual data along with the error.# Use the test data against the predictive model. p <- predict.lm( ringsmodel, test, se.fit=TRUE ) pred.w.plim <- predict( ringsmodel, test, interval="prediction") pred.w.clim <- predict( ringsmodel, test, interval="confidence") matplot(test$diameter,cbind(pred.w.clim, pred.w.plim[,-1]), lty=c(1,2,2,3,3), type="l", ylab="predicted y") # Join the actual with the predicted rings prediction <- data.frame(actual = test$rings, predicted = p$fit, error = test$rings - p$fit )As you can see the result is reasonable enough to suggest the model has not overfitted the data and is generalising reasonable well.
Export the final model stage The final part of this process is to export the model to pmml for a consuming application to predict with. The next blog with read the xml file using JPMML and perform the same tests used above.
# Export the resulting model as PMML file. localfilename <- "../models/abalone-rings-lm-prediction.xml" saveXML(pmml( ringsmodel, model.name = "AbaloneRingsPredictionLM", app.name = "RR/PMML", dataset = dataset) , file = localfilename) # Save the test data for later use as csv write.csv(test, file = "../data/test_abalone.csv")