|Building a ML Pipeline : Part 4

If not anything else, one thing that i have learnt that working through Machine Learning projects is that it becomes much more easier if there is a proper structure in place.

This is the Fourth article in the series Building a ML Pipeline, in continuation to below mentioned articles:

Part 1 : Building Blocks of a ML Pipeline

Part 2 : Reading the Data to Feature Engineering

Part 3 : Feature Selection

1) Intro

Being the fourth article in the Building a ML Pipeline series, i would highly recommend you to go through the prior articles in the series, you can find the links above.

In this article I will sail through Feature Transformation and Building a ML Pipeline.

Feature Transformation in this article although takes up a relatively small space, but it is in-fact quite an in-depth topic to be analysed within Machine Learning.

As far as ML Pipeline is concerned, its nothing but a class encapsulating all the procedures that we have done thus far in this analysis, given that you followed all the preceding articles to this one. Once encapsulated each procedure is available to us as a functions within a classes object.

FUN FACT : The 100 pleats on a chef’s hat represents 100 ways of making an egg, signifying the chefs experience! (Photographer)

The beautiful thing about this ML Pipeline is that we can use it as a skeleton architecture and keep on adding multiple models or multiple evaluation metrics within it, by just updating the function definition of the __init__() and calcMetrics() function and update the Pipeline Configuration accordingly.

Also this architecture allows us to accommodate Ensemble Of Ensemble model quite easily, thats one aspect of it which has led me to use this architecture in numerous projects. All we need to do add a second Layer of models to the same.

Note : To introduce Ensemble Of Ensemble architecture in this post will make it way longer than what it already is and quite frankly it deserves another post for itself. Hence i have limited myself to just one layer modelling architecture for this article. If this didn’t make any sense, forget that i mentioned Ensemble Of Ensemble.

2) Feature Transformation

Its always healthy to have a target variable which is normally distributed, especially for statistical forecasting models, it helps a lot. Although the Tree Based models are impartial to wether or not the target column is following a Gaussian Distribution or not, I still find its helpful in some cases. The reason why, look at this SO answer

“..there are a small number of extremely large salaries, transforming the salaries might improve predictions because loss functions which minimize square error will not be so strongly influenced by these large values…”

Now diving into the code, remember in the last article we had already selected the features, so..

Code for Analysing the Target Column
Target Column Before and After log transformation

Now, to see how much of normality we have gained through this transformation we can look at the Q-Q plot.

Ideally if the Target Variable distribution is SAME as that of an ideal Gaussian distribution, then the line must be a straight line.

Code for Q-Q Plot
Q-Q (Quantile-Quantile) Plot

Its evident from the graph above that although not entirely, but to some extent we have managed to bring the line to a straight line.

3) ML Pipeline

As I had stated earlier that ML Pipeline is nothing but bringing all the work that we have done so far into a structured class which will give us a basic skeleton architecture that can easily be accommodated with new models or new metrics anytime in future.

FLOW
  • We start with simply getting the complete data into the environment.
  • Then split the data into Training and Test Dataset as specified in previous posts.
  • The skeleton of the pipeline will look something like this
MLPipeline Function Definitions

3.1) Reading and Splitting the Data

Reading the data
Shape of the data.

3.2) ML Pipeline

A class which encapsulates various functions as mentioned in the flow above,

Function 1 : __init__()

Initialisation function for the ML Pipeline; This function definition contains categorical_mapping dictionary for both train and test set which will keep a record of the all those columns that were converted to Categorical Codes. It also contains various individual models definitions and correspondingly their definition mapping in self.MODELS.

Function 2 : ingestData()

This function will ingest the data into the Modelling Pipeline

Function 3: cleanTransformData()

This function definition is responsible to clean and transform the data, which includes from fixing the data types to engineering features. After this stage the data residing in the Modelling Pipeline is ready for modelling.

Function 4: modelling()

This function definition is responsible for modelling, be it Base Modelling or Hyper Parameter tuning. The training is done on the basis of k-fold cross validation for the prepared data as on the prior step.

Function 5: evaluateModel()

This function definition is responsible for making Final Prediction on the Test Dataset

Function 6 : calcMetrics()

This function calculates various metrics for a particular models prediction. *Note : Metric RMSLE actually is using the formula for RMSE, but is represented as the RMSE since we have already taken the log of the target variable during cleanTransformData

3.3) ML Pipeline — Base Modelling

Once done with building the pipeline, to do Base Modelling — to figure which model will work best for the given training data,

Code for Base Modelling — Evaluating multiple models at once. models_L1 and metrics are like the configuration settings for the pipeline which you are about to run, they determine which all models will be run for this procedure and which all metrics will be evaluated against each. The rest is fairly intuitive.
Results Generated from Base Modelling

So basically what this chunk of code is trying to figure is which sort of model(among the ones available in the pipeline) is able to fit the given prepared training data the best.

From the results what we see is that RandomForestRegressor is performing the best among all the others, now this is just indicative since all the models were tested on their default hyper parameters.

So we will pick RandomForestRegressor for fine-tuning Phase.

3.4) ML Pipeline — Fine Tuning Model

Code for Fine Tuning the selected model. The function which does fine-tuning is same as Base Modelling which is — modelling() but setting the finetune_model = True and other corresponding fine-tuning parameters of that function will run that block in fine tune mode. To evaluate the performance of the model on various combination of the hyper-parameters just change the finetune_params dictionary and run the cell.
Result of Fine Tuning block. At the end what u see is that the once the fine tuning of the model is complete the trained is then saved at the filepath specified in the finetunemodel_savepath.

3.5) ML Pipeline — Predicting on Test Set

Code for evaluating the hyper parameter fine-tuned model on Test Dataset. You can either use the existing fine-tuned model residing in the ML Pipeline instance or load a saved model from the directory and make the predictions
Result of evaluate block.

Ok, so yeah the R2 is low and RMSLE is a bit high on the test set, so… maybe overfitting?? Maybe yes, we can check for that, but if you made it thus far you have effectively created a Machine Learning Pipeline which can be augmented with various Machine Learning models at be evaluated in one go. Which was in-fact the purpose of this entire exercise — to create the ML Pipeline.

Its a bit hard to see the structural advantage of this pipeline, until unless you add another layer of the modelling which can effectively convert this entire pipeline into an Ensemble Of Ensemble pipeline.

Phew!!! Quite something eh? I appreciate you staying along with me so far. In the next article i go through the final aspect that any machine learning model faces Model Interpretability.

🤥I pitched this camp….. Yes i did. Please believe me. (Photographer)

Till next time… Au Revoir!🙋🏽‍♂️

Someone who romanticises about the origins, the history. Teaching Enthusiast. Firmly believe in AI for social good.