Random Forest in Bytes
This is the second post in In Bytes series. Refer this link to read about the first post.
Random Forest:
Random forest is an ensemble made by the combination of a large number of decision trees.
Ensemble:
An ensemble means a group of things viewed as a whole rather than individually. In ensembles, a collection of models is used to make predictions, rather than individual models.
In principle, ensembles can be made by combining all types of models. An ensemble can have a logistic regression, a neural network, and few decision trees working in unison.
While choosing the model , we need to check for two things- Diversity and Acceptability.
Diversity ensures that even if some trees overfit, the other trees in the ensemble will neutralize the effect. The independence among the trees results in a lower variance of the ensemble compared to a single tree
Diversity ensures that even if some trees overfit, the other trees in the ensemble will neutralize the effect. The independence among the trees results in a lower variance of the ensemble compared to a single tree
Acceptability implies that each model is at least better than a random model i.e. p>0.50.
If each of the individual models is acceptable, i.e. they’re wrong with a probability less than 50%, you can show that the probability of the ensemble being wrong (i.e. the majority vote going wrong) will be far lesser than that of any individual model.
If each of the individual models is acceptable, i.e. they’re wrong with a probability less than 50%, you can show that the probability of the ensemble being wrong (i.e. the majority vote going wrong) will be far lesser than that of any individual model.
Also, the ensembles avoid getting misled by the assumptions made by individual models. For example, ensembles (particularly random forests) successfully reduce the problem of overfitting. If a decision tree in an ensemble overfits, you let it. Chances are extremely low that more than 50% of the models have overfitted.
Bagging:
Bagging stands for bootstrapped aggregation. It is a technique for choosing random samples of observations from a dataset. Each of these samples is then used to train each tree in the forest.
You create a large number of models (say, 100 decision trees), each one on a different bootstrap sample, from the training set. To get the result, you aggregate the decisions taken by all the trees in the ensemble
Bootstrapping means creating bootstrap samples from a given data set. A bootstrap sample is created by sampling the given data set uniformly and with replacement. A bootstrap sample typically contains about 30–70% data from the data set. Aggregation implies combining the results of different models present in the ensemble.
Random forest selects a random sample of data points (bootstrap sample) to build each tree, and a random sample of features while splitting a node. Randomly selecting features ensures that each tree is diverse and is not impacted by the prominent features present in the dataset.
E.g. For dataset for heart attack, Blood Pressure and Weight will have high correlation with the predictor, since here we just take few attributes for a split , the model will not be impacted.
Advantages:
Diversity arises because you create each tree with a subset of the attributes/features/variables, i.e. you don’t consider all the attributes while making each tree. The choice of the attributes considered for each tree is random. This ensures that the trees are independent of each other.
Stability arises because the answers given by a large number of trees average out. A random forest has a lower model variance than an ordinary individual tree.
Immunity to the curse of dimensionality: Since each tree does not consider all the features, the feature space (the number of features a model has to consider) reduces. This makes the algorithm immune to the curse of dimensionality. A large feature space causes computational and complexity issues.
Parallelizability: You need a number of trees to make a forest. Since two trees are independently built on different data and attributes, they can be built separately. This implies that you can make full use of your multi-core CPU to build random forests. Suppose there are 4 cores and 100 trees to be built; each core can build 25 trees to make a forest.
Testing /training data and the OOB or out-of-bag error: You always want to avoid violating the fundamental tenet of learning: “not testing a model on what it has been trained on”. While building individual trees, you choose a random subset of the observations to train it. If you have 10,000 observations, each tree may only be made from 7000 (70%) randomly chosen observations. OOB is the mean prediction error on each training sample xᵢ, using only the trees that do not have xᵢ in their bootstrap sample. If you think about it, this is very similar to a cross-validation error. In a CV error, you can measure the performance on the subset of data the model hasn’t seen before.
In fact, it has been proven that using an OOB estimate is as accurate as using a test data set of a size equal to the training set.