Decision Tree in Bytes

6 min readJan 1, 2021

This the first post in Byte Series, where we are going to have discuss on overall concept but in brief to act a refresher.

Decision Trees:

As the name goes, a decision tree uses a tree-like model to make predictions. It resembles an upside-down tree.

Decision trees are easy to interpret; you can always go back and identify the various factors leading to a decision.
A decision tree requires you to perform tests on attributes in order to split the data into multiple partitions.

Types:

Single and Multiway class:

If a test splits the data into more than two partitions, this is called a multiway decision tree.

Classification and Regression:

In regression problems, a decision tree splits the data into multiple subsets. The difference between decision tree classification and decision tree regression is that in regression, each leaf represents a linear regression model, as opposed to a class label.

Homogeneity:

We always try to generate partitions that result in homogeneous data points. For classification tasks, a data set is completely homogeneous if it contains only a single class label.

We try to find the test which will split the node with given maximum homogeneity( for a given threshold) at the next level.

Gini Index

This build a homogeneity value for the dataset based on the values of probability of each label in the data.

Gini Index, also known as Gini impurity, calculates the amount of probability of a specific feature that is classified incorrectly when selected randomly. If all the elements are linked with a single class then it can be called pure.

Gini index varies between values 0 and 1, where 0 expresses the purity of classification, i.e. All the elements belong to a specified class or only one class exists there. And 1 indicates the random distribution of elements across various classes. The value of 0.5 of the Gini Index shows an equal distribution of elements over some classes.

Gini Index for Binary Target variable is

= 1 — P²(Target=0) — P²(Target=1)

Gini index

Entropy:

If a data set is completely homogenous, then the entropy of such a data set is 0, i.e. there’s no disorder.

Gini index is measure of homogeneity ,Entropy is measure of Chaos.

Entropy is more computationally heavy due to the log in the equation. Like gini, The basic idea is to gauge the disorder of a grouping by the target variable

Information Gain

It measures how much the entropy has decreased between the parent set and the partitions obtained after splitting.

Information Gain =Entropy Before Split — Entropy After Split

Splitting by R-squared:

Split the data such that the R Squared of the partitions obtained after splitting is greater than that of the original or parent data set. In other words, the fit of the model should be as ‘good’ as possible after splitting

Decision Tree Algorithm:

Advantages:

Predictions made by a decision tree are easily interpretable.
A decision tree does not assume anything specific about the nature of the attributes in a data set. It can seamlessly handle all kinds of data — numeric, categorical, strings, Boolean, and so on.
It does not require normalisation since it has to only compare the values within an attribute.
Decision trees often give us an idea of the relative importance of the explanatory attributes that are used for prediction.

Disadvantages:

Decision trees tend to overfit the data. If allowed to grow with no check on its complexity, a tree will keep splitting till it has correctly classified (or rather, mugged up) all the data points in the training set.
Decision trees tend to be very unstable, which is an implication of overfitting. A few changes in the data can change a tree considerably.

There are various ways to truncate or prune trees, the DecisionTreeClassifier function in sklearn provides the following hyperparameters which you can control:

criterion (Gini/IG or entropy): It defines the function to measure the quality of a split. Sklearn supports “gini” criteria for Gini Index & “entropy” for Information Gain. By default, it takes the value “gini”.
max_features: It defines the no. of features to consider when looking for the best split. We can input integer, float, string & None value.
If an integer is inputted then it considers that value as max features at each split.
If float value is taken then it shows the percentage of features at each split.
If “auto” or “sqrt” is taken then max_features=sqrt(n_features).
If “log2” is taken then max_features= log2(n_features).
If None, then max_features=n_features. By default, it takes “None” value.
max_depth: The max_depth parameter denotes maximum depth of the tree. It can take any integer value or None. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. By default, it takes “None” value.
min_samples_split: This tells above the minimum no. of samples required to split an internal node. If an integer value is taken then consider min_samples_split as the minimum no. If float, then it shows percentage. By default, it takes “2” value.
min_samples_leaf: The minimum number of samples required to be at a leaf node. If an integer value is taken then consider — -min_samples_leaf as the minimum no. If float, then it shows percentage. By default, it takes “1” value.

Decision Trees in Python (Snippets Only):

# Importing decision tree classifier from sklearn library
from sklearn.tree import DecisionTreeClassifier# Fitting the decision tree with default hyperparameters, apart from
# max_depth which is 5 so that we can plot and read the tree.
dt_default = DecisionTreeClassifier(max_depth=5)
dt_default.fit(X_train, y_train)

Hyperparameter Tunning

# GridSearchCV to find optimal max_depth
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
# specify number of folds for k-fold CV
n_folds = 5# parameters to build the model on
parameters = {'max_depth': range(1, 40)}# instantiate the model
dtree = DecisionTreeClassifier(criterion = "gini", 
                               random_state = 100)# fit tree on training data
tree = GridSearchCV(dtree, parameters, 
                    cv=n_folds, 
                   scoring="accuracy")
tree.fit(X_train, y_train)scores= tree.cv_results

Visualizing:

# plotting accuracies with max_depth
plt.figure()
plt.plot(scores["param_max_depth"], 
         scores["mean_train_score"], 
         label="training accuracy")
plt.plot(scores["param_max_depth"], 
         scores["mean_test_score"], 
         label="test accuracy")
plt.xlabel("max_depth")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

Grid Search CV for Decision Tree:

# Create the parameter grid 
param_grid = {
    'max_depth': range(5, 15, 5),
    'min_samples_leaf': range(50, 150, 50),
    'min_samples_split': range(50, 150, 50),
    'criterion': ["entropy", "gini"]
}n_folds = 5# Instantiate the grid search model
dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid = param_grid, 
                          cv = n_folds, verbose = 1)# Fit the grid search to the data
grid_search.fit(X_train,y_train)# printing the optimal accuracy score and hyperparameters
print("best accuracy", grid_search.best_score_)# model with optimal hyperparameters
clf_gini = DecisionTreeClassifier(criterion = "gini", 
                                  random_state = 100,
                                  max_depth=10, 
                                  min_samples_leaf=50,
                                  min_samples_split=50)
clf_gini.fit(X_train, y_train)# accuracy score
clf_gini.score(X_test,y_test)

Decision Tree in Bytes

Written by Sudhanshu Shukla