While working on exploring a data set you might all definitely need to visualize the different features and for that you will import matplotlib. But while importing matplotlib we sometimes happen to write either %matplotlib inline or %matplotlib notebook in our Jupyter notebook. Let us understand what they do. %matplotlib %matplotlib is one of the magic functions you can use in Jupyter notebook. The magic functions can be used to add some dynamic capabilities to the outputs we get, as in general the output of these plots look more like reports. Writing the above magic function sets up necessary background features for python to work with matplotlib. %matplotlib inline %matplotlib inline is used to display the plots inline and on the next cell below the code which outputs the plot. It is used to store the plots in the notebook itself. So if the next time after saving the work done in the notebook, if you again wish to see the visualizations, it will still be available in the notebook
Gradient Boosting is one of the Boosting Ensemble methods that has been used a lot lately in both regression and classification problems. As the heading suggests we are going to understand Gradient Boosting in classification. But first, let's have a brief introduction to what are ensemble methods.
Ensemble Methods
Ensemble methods are used in machine learning to create a better and more optimized model and it can do so by learning from other models. Ensemble method uses a sample of models with their results and combines them together to get a more optimized result and therefore it doesn't have to depend on a single predictive model. One of the ways to perform ensemble technique in a classification problem is to use Gradient Boosting Classifier.
Gradient Boosting in classification
In Gradient Boosting we have multiple decision trees and we use the individual tree to gather their predictions and then combine it with the next decision tree we build. Let's understand the working and intuition of Gradient Boosting with the help of an example.
Here we have a small snippet of titanic data set:
First, we have to build a base model. We will calculate the log(odds) i.e. logarithm of odds of survival. log(odds) is the initial prediction of base model for every individual.
We have rounded off the log(odds) value. Now convert the log(odds) into a probabilistic value using logistic or sigmoid function to use it for classification.
The probability value calculated above has been rounded off. This probability is now used to calculate the residuals or errors which finds out by how much value does the probability vary from the actual data given in the training data set. Residual(R1) is calculated using formula:
R1 = actual(observed) value - predicted probability
here, observed probability = 0.7, so we now have the following values:Our main objective is to minimize these residual values and therefore we will be building multiple decision trees to optimize our model. Now we build a tree from the independent features like Pclass, Age, Fare and Gender. For simplicity the decision tree given below is small but actually we use a much larger decision tree.
According to conditions you can see that the data are classified and are separated into leaf nodes. In the above tree some new predictions have been made and now these results are also needed to be combine with previous probability. Therefore, we calculate the output values for each leaf node.
Formula to calculate output values for leaf nodes:
Using the above formula for each leaf nodes:
for leaf 1 and 2 as both co-incidentally contains same values, but in actual test cases it can definitely have different data.
(0.3+0.3) / ((0.7*0.3)+(0.7*0.3)) = 1.43
for leaf 3:(-0.7 - 0.7) / ((0.7*0.3)+(0.7*0.3)) = -3.3
The tree basically now looks like this.
So, to get new log(odds) values from the new tree we combine the previous probability with the derived output values we use the following formula:
log(odds) = previous log(odds) + learningRate * (output of leaf in new tree)
We use the learning rate to scale down the output values. The most common learning rate used is 0.1
So by substituting the values for first and second record simultaneously in the above table where data set is shown, we get
0.7 + 0.1*(-3.33) = -0.367
0.7 + 0.1*(1.43) = 0.843
After getting new log(odds) value for all the records, the algorithm again calculates new probability and from that another pseudo residual is calculated. This process is carried in a loop until the residual minimizes becomes very less or it executes "m" times, where m = the number of decision trees that has been specified or built.
Comments
Post a Comment
If you have any doubt, let me know in the comment section.