|Building a ML Pipeline : Part 4
If not anything else, one thing that i have learnt that working through Machine Learning projects is that it becomes much more easier if there is a proper structure in place.
This is the Fourth article in the series Building a ML Pipeline, in continuation to below mentioned articles:
Being the fourth article in the Building a ML Pipeline series, i would highly recommend you to go through the prior articles in the series, you can find the links above.
In this article I will sail through Feature Transformation and Building a ML Pipeline.
Feature Transformation in this article although takes up a relatively small space, but it is in-fact quite an in-depth topic to be analysed within Machine Learning.
As far as ML Pipeline is concerned, its nothing but a class encapsulating all the procedures that we have done thus far in this analysis, given that you followed all the preceding articles to this one. Once encapsulated each procedure is available to us as a functions within a classes object.
The beautiful thing about this ML Pipeline is that we can use it as a skeleton architecture and keep on adding multiple models or multiple evaluation metrics within it, by just updating the function definition of the
calcMetrics() function and update the Pipeline Configuration accordingly.
Also this architecture allows us to accommodate Ensemble Of Ensemble model quite easily, thats one aspect of it which has led me to use this architecture in numerous projects. All we need to do add a second Layer of models to the same.
Note : To introduce Ensemble Of Ensemble architecture in this post will make it way longer than what it already is and quite frankly it deserves another post for itself. Hence i have limited myself to just one layer modelling architecture for this article. If this didn’t make any sense, forget that i mentioned Ensemble Of Ensemble.
2) Feature Transformation
Its always healthy to have a target variable which is normally distributed, especially for statistical forecasting models, it helps a lot. Although the Tree Based models are impartial to wether or not the target column is following a Gaussian Distribution or not, I still find its helpful in some cases. The reason why, look at this SO answer
“..there are a small number of extremely large salaries, transforming the salaries might improve predictions because loss functions which minimize square error will not be so strongly influenced by these large values…”
Now diving into the code, remember in the last article we had already selected the features, so..
Now, to see how much of normality we have gained through this transformation we can look at the Q-Q plot.
Ideally if the Target Variable distribution is SAME as that of an ideal Gaussian distribution, then the line must be a straight line.
Its evident from the graph above that although not entirely, but to some extent we have managed to bring the line to a straight line.
3) ML Pipeline
As I had stated earlier that ML Pipeline is nothing but bringing all the work that we have done so far into a structured class which will give us a basic skeleton architecture that can easily be accommodated with new models or new metrics anytime in future.
- We start with simply getting the complete data into the environment.
- Then split the data into Training and Test Dataset as specified in previous posts.
- The skeleton of the pipeline will look something like this
3.1) Reading and Splitting the Data
3.2) ML Pipeline
A class which encapsulates various functions as mentioned in the flow above,
Function 1 : __init__()
Function 2 : ingestData()
Function 3: cleanTransformData()
Function 4: modelling()
Function 5: evaluateModel()
Function 6 : calcMetrics()
3.3) ML Pipeline — Base Modelling
Once done with building the pipeline, to do Base Modelling — to figure which model will work best for the given training data,
So basically what this chunk of code is trying to figure is which sort of model(among the ones available in the pipeline) is able to fit the given prepared training data the best.
From the results what we see is that RandomForestRegressor is performing the best among all the others, now this is just indicative since all the models were tested on their default hyper parameters.
So we will pick RandomForestRegressor for fine-tuning Phase.
3.4) ML Pipeline — Fine Tuning Model
3.5) ML Pipeline — Predicting on Test Set
Ok, so yeah the R2 is low and RMSLE is a bit high on the test set, so… maybe overfitting?? Maybe yes, we can check for that, but if you made it thus far you have effectively created a Machine Learning Pipeline which can be augmented with various Machine Learning models at be evaluated in one go. Which was in-fact the purpose of this entire exercise — to create the ML Pipeline.
Its a bit hard to see the structural advantage of this pipeline, until unless you add another layer of the modelling which can effectively convert this entire pipeline into an Ensemble Of Ensemble pipeline.
Phew!!! Quite something eh? I appreciate you staying along with me so far. In the next article i go through the final aspect that any machine learning model faces Model Interpretability.
Till next time… Au Revoir!🙋🏽♂️