Gradient Boosting
In gradient boosting decision trees, we combine many weak learners to come up with one strong learner. The weak learners here are the individual decision trees. All the trees are connected in series and each tree tries to minimize the error of the previous tree.
Due to this sequential connection, boosting algorithms are usually slow to learn (controllable by the developer using the learning rate parameter), but also highly accurate. In statistical learning, models that learn slowly perform better.
The weak learners are fit in such a way that each new learner fits into the residuals of the previous step so as the model improves. The final model adds up the result of each step and thus a stronger learner is eventually achieved.
The main idea here is to overcome the errors in the previous learner’s predictions. This type of boosting has three main components:
- Loss Function – The role of the loss function is to estimate how good the model is at making predictions with the given data. This could vary depending on the problem at hand. For example, if we’re trying to predict the weight of a person depending on some input variables (a regression problem), then the loss function would be something that helps us find the difference between the predicted weights and the observed weights. On the other hand, if we’re trying to categorize if a person will like a certain movie based on their personality, we’ll require a loss function that helps us understand how accurate our model is at classifying people who did or didn’t like certain movies.
- Weak Learner – A weak learner is one that classifies our data but does so poorly, perhaps no better than random guessing. In other words, it has a high error rate. These are typically decision trees (also called decision stumps, because they are less complicated than typical decision trees).
- Additive Model – This is the iterative and sequential approach of adding the trees (weak learners) one step at a time. After each iteration, we need to be closer to our final model. In other words, each iteration should reduce the value of our loss function.
Example:Let’s take age prediction as an example, yeah, I love this example. Say we have 4 people A,B,C,D and their ages 14, 16, 24, 26. If we choose a DT to predict the age, the result may probably looks like this:

Then we utilize GBDT to learn this task, and we limit the maximum number of nodes to 2, and n_estimators=2, then we may get this figure.

Note that in the second tree, it does not learn the value of 14, 16, 24, 26 but the residual value from the previous base learners.
Adaboost and GBDT: They are both boosting models. So what’s the difference between them?
For Adaboost, it computes the weight for each instance based on the previous prediction, and assign more weight on those which are wrongly classified and assign low penalty to those classified correctly. Actually, the intuition of Bootstrap and Adaboost are the same.
But what about GBDT?
GBDT is to learn the residual function of previous prediction. In some ways, you can think about it as learning something that is hard to learn.
References: