Gradient Boosting Classifier

Gradient Boosting is one of the Boosting Ensemble methods that has been used a lot lately in both regression and classification problems. As the heading suggests we are going to understand Gradient Boosting in classification. But first, let's have a brief introduction to what are ensemble methods.

Ensemble Methods

Ensemble methods are used in machine learning to create a better and more optimized model and it can do so by learning from other models. Ensemble method uses a sample of models with their results and combines them together to get a more optimized result and therefore it doesn't have to depend on a single predictive model. One of the ways to perform ensemble technique in a classification problem is to use Gradient Boosting Classifier.

Gradient Boosting in classification

In Gradient Boosting we have multiple decision trees and we use the individual tree to gather their predictions and then combine it with the next decision tree we build. Let's understand the working and intuition of Gradient Boosting with the help of an example.

Here we have a small snippet of titanic data set:

In the above problem, we have to classify whether the person survived the titanic crash or not. Let's see how Gradient Boosting can help us.

First, we have to build a base model. We will calculate the log(odds) i.e. logarithm of odds of survival. log(odds) is the initial prediction of base model for every individual.

We have rounded off the log(odds) value. Now convert the log(odds) into a probabilistic value using logistic or sigmoid function to use it for classification.

The probability value calculated above has been rounded off. This probability is now used to calculate the residuals or errors which finds out by how much value does the probability vary from the actual data given in the training data set. Residual(R1) is calculated using formula:

R1 = actual(observed) value - predicted probability

here, observed probability = 0.7, so we now have the following values:

Our main objective is to minimize these residual values and therefore we will be building multiple decision trees to optimize our model. Now we build a tree from the independent features like Pclass, Age, Fare and Gender. For simplicity the decision tree given below is small but actually we use a much larger decision tree.

According to conditions you can see that the data are classified and are separated into leaf nodes. In the above tree some new predictions have been made and now these results are also needed to be combine with previous probability. Therefore, we calculate the output values for each leaf node.

Formula to calculate output values for leaf nodes:

Using the above formula for each leaf nodes:

for leaf 1 and 2 as both co-incidentally contains same values, but in actual test cases it can definitely have different data.

(0.3+0.3) / ((0.70.3)+(0.70.3)) = 1.43

for leaf 3:

(-0.7 - 0.7) / ((0.70.3)+(0.70.3)) = -3.3

The tree basically now looks like this.

So, to get new log(odds) values from the new tree we combine the previous probability with the derived output values we use the following formula:

log(odds) = previous log(odds) + learningRate (output of leaf in new tree)*

We use the learning rate to scale down the output values. The most common learning rate used is 0.1

So by substituting the values for first and second record simultaneously in the above table where data set is shown, we get

0.7 + 0.1*(-3.33) = -0.367

0.7 + 0.1*(1.43) = 0.843

After getting new log(odds) value for all the records, the algorithm again calculates new probability and from that another pseudo residual is calculated. This process is carried in a loop until the residual minimizes becomes very less or it executes "m" times, where m = the number of decision trees that has been specified or built.

Breaking Into Data Science

Search This Blog

%matplotlib inline vs %matplotlib notebook