Decision Trees
Posted on 13 Jun 2020
Authored by Sandesh Bhusal
Example: Finding object by asking questions, a game I used to play.
Decision trees are exactly what the name sounds like – a method to make decisions. Wheather those decisions are related to finance (decision trees are used in economics to analyse the utility of a decision) or to draw inference from a set of data. Technically, decision trees are a series of questions that split the total solution space to smaller and smaller fragments until a conclusion can be drawn.
1. Introduction:
One game comes to mind when I start thinking about decision trees. Whenever we were taken on an outing by the school on a bus, one of the students would spot something and everyone else would try to guess it.
A: No
Q: Is it black?
A: Yes!
Q: Is it shaped like a ring?
A: Yes!
Q: Is it the steering wheel?
A: You are right!!
While building a decision trees, every algorithm follows a generic guideline. Before learning about the guideline, let’s learn about the parts of a decision tree:
2. Structure of a Decision Tree:
A decision tree contains nodes, leafs and edges. Some authors like to distinguish between root and internal nodes, but I do not like the distinction, as all internal nodes have the same properties as the root note. (Please feel free to contact me here if you have any contesting ideas!) Anywhoo, any node represents the collection of data at that particular node. As in the example above, we can easily imagine a lot of things in the bus. The first question asks if the thing is red, and when a negative answer is given then a conclusion is reached, where all the red things in the bus are eliminated. That reduces the amount of information to be processed next.
From the example we can see that, the first node is the collection of all objects in the bus, but from the first split, a new state is reached which is characterized by the collection of all objects in the bus that are not red. The question to be asked is the edge in the tree. Amongst many attributes of objects in the bus, we have chosen “color” as the criteria. This is an attribute. The objective of the game is to find the object in as few questions as possible. So, while selecting spiltting attributes, we need to choose the one, that minimizes the number of objects to be considered in the next iteration.
Parts of a decision tree.
So the main things to keep in mind while trying to construct a decision tree are the following:
- Each node in the decision tree represents a collection of objects that are to be further ‘split’ in order to determine our final object.
- Each node gives rise to one or more nodes, that represent a subset of objects contained in the node before it (parent node).
- Splitting criteria for a node is any attribute on the set of objects that determines how the set is going to be split in different nodes.
- We try to find the final object in as few questions as possible. This means at each step of splitting a node, we need to select a suitable attribute for the split that will try to keep our position neutral (if we try to split using a attribute that produces a very skewed distribution of data, say, 7:10 then if we land on the 10 side, we will need much more effort to find out the object. So we try a middle split like 1:1 )
Although the principles of the game may not exactly translate to construction of decision trees, they are quite similar. Point #4 denotes the ‘entropy’ of the dataset under split, and we want a split that will minimize the entropy of the overall dataset after split. For calculation of entropy under a probability distribution, check this out!
3. Problem Statement:
Decision Trees are widely used in classification tasks, and that is going to be the primary focus of ours in this blog post. While decision trees are also used in other domains, such as CART trees which can perform both classification and regression, we are not going to discuss about those in this blog post.
Classification is a supervised learning task, opposed to clustering which is an unsupervised learning task. For classification, we need a labeled dataset that contains labeled data, i.e. the data that contains some classes. Let’s consider a textbook example for now:
Example of a classification task: A decision must be made if we want to play golf or not based on current weather conditions.
The above table is a widely used textbook example while explaining the concepts of a decision tree. In the posed problem, we are given a table of labeled data, which tells us the decision to take under given weather conditions. For classification purposes, the data may be from the given table, or from somewhere else.
4. Hunt’s Algorithm:
Hunt’s algorithm forms the basis for many Decision Tree algorithms, including ID3, C4.5, CART and so on. Hunt’s algorithm is quite simple actually if you look at it. In this section, we will go over a simple dataset that I have constructed for the purposes of this tutorial, and we will talk about Hunt’s algorithm in brief.
Hunt’s algorithm is an iterative algorithm that tries to partition the given dataset progressively, reducing the class labels of the partitioned dataset into purer and purer classes as further as we go. Let us consider a dataset . Dataset is the set of a set of tuples, where is the total number of samples available to us in the dataset. Each tuple in each data point corresponds to some value of an attribute, i.e. where is the number of attributes that we have. For an example let’s consider the example below:
| S.N. | Age | Sex | Married | Homeowner |
|---|---|---|---|---|
| 1 | 30 | Female | No | No |
| 2 | 35 | Female | Yes | Yes |
| 3 | 48 | Male | Yes | Yes |
Here, the total number of samples available to us is 3. So . Also, there are four attributes (not counting Serial Number), namely Age, Sex, Married and Homeowner. The number of attributes in consideration is actually , as the fourth attribute, Homeowner is also our class label, i.e. the thing we are trying to predict.
Hunt’s algorithm begins with a complete description of a dataset, which is also called as the root node. Then, the algorithm proceeds in the following way:
1. For a Node N, the input dataset is D.
2. Construct a root node, containing all data.
3. let currentNode = root node
4. For each attribute in current Node:
a. Split given dataset in current node with attribute 'a', with respect to some testing condition
b. The testing condition should return a ranking of some kind to rank attributes suitable for partitioning the dataset.
5. Once all attributes are ranked in current node, choose one attribute with best rank.
6. Partition the dataset according to the attribute into different datasets. The new datasets do not contain the attribute that was used to partition them.
7. Repeat steps 4 to 6 until:
a. A pure node is reached, where all data belong the same class.
b. No more attributes are left to partition the dataset, so a majority voting is done to assign class label to such node.
Steps 7(a) and 7(b) produce what we call as leaf nodes. The ranking system can be different depending on requirement and algorithm. We can test fit for partition attributes using Entropy or Gini Index and so on. We can see Hunt’s algorithm is iterative in nature from above example.
The partitioning of node might give a rise to different conditions. In the above example, if we choose “Married” as the partitioning attribute, then we get two new nodes, one for “Yes” and the other one for “No”. However, the case might be different for continuous values such as age. In such condition, we might get new nodes like “Age < 35” and “Age >= 36” and so on. In case of ordinal categorical values, however, it is important to maintain the ordinality of variable if fewer splits than possible are to be made.
For an example, we have a “Temperature” attribute that can be “Hot”, “Lukewarm”, “Mild”, “Cold” and “Freezing”. If we wish to partition this into two branches only, we cannot make new branches like {{Hot, Cold} and {Lukewarm, Mild, Freezing}}. What I am trying to say is that in case of ordinal categorical variables, the order of the values must be preserved before partitioning the dataset.
Left image shows correct way to partition using ordinal categorical variables. Right is wrong. (pun)
Applying Hunt’s algorithm to the above example, might lead to a variety of trees being created. For an example, with Entropy method, “Married” and “Age” attributes are equally likely to partition the dataset right at the beginning into pure nodes, and we can choose either one of them. Depending on the attribute selection method, different decision trees may be generated!
Alternative decision trees that can be made!
5. Construction of a Decision Tree – An Example:
I have implemented a simple and dirty example of a decision tree used to create a decision tree for the textbook example mentioned above. The notebook can be found on my github profile. Here’s the code:
# This is the pandas module, used to read and manipulate csv file data
import pandas as pd
# Numpy module for numerical computation
import numpy as np
# Our dataset, named "golf.csv"
dataset = pd.read_csv("golf.csv")
# The first few lines in our dataset.
dataset.head()
| Outlook | Temp | Humidity | Windy | Play Golf | |
|---|---|---|---|---|---|
| 0 | Rainy | Hot | High | False | No |
| 1 | Rainy | Hot | High | True | No |
| 2 | Overcast | Hot | High | False | Yes |
| 3 | Sunny | Mild | High | False | Yes |
| 4 | Sunny | Cool | Normal | False | Yes |
# Let's view more details about our dataset!
dataset.describe()
dataset['Outlook'].unique()
Output: array(['Rainy', 'Overcast', 'Sunny'], dtype=object)
# Let's write a function that takes in data and attribute, and splits it into datasets containing attribute == value
def split_data_by_attribute(dataset, attribute):
# Let's check if the attribute is in the data.
try:
# values_in_attribute represent values in attribute. E.g. "sex" attribute may have values "male", "female", "LGBTQ" or "nopreference"
values_in_attribute = dataset[attribute].unique()
split_datasets = []
for value in values_in_attribute:
split = dataset[dataset[attribute] == value]
split_datasets.append((split, value))
# for each split_datasets, we need to remove the attribute that was used to split it
returnable_datasets = []
for dataset_s, value in split_datasets:
returnable_datasets.append((dataset_s.drop(columns=[attribute], axis=1), value))
return returnable_datasets
except:
raise print("\n No such attribute\n\n")
# An example of execution:
for item in split_data_by_attribute(dataset, 'Outlook'):
print(item)
print()
Output:
( Temp Humidity Windy Play Golf
0 Hot High False No
1 Hot High True No
7 Mild High False No
8 Cool Normal False Yes
10 Mild Normal True Yes, 'Rainy')
( Temp Humidity Windy Play Golf
2 Hot High False Yes
6 Cool Normal True Yes
11 Mild High True Yes
12 Hot Normal False Yes, 'Overcast')
( Temp Humidity Windy Play Golf
3 Mild High False Yes
4 Cool Normal False Yes
5 Cool Normal True No
9 Mild Normal False Yes
13 Mild High True No, 'Sunny')
We can see that the above dataset got split by the “outlook” category
Into three datasets, where values for “outlook” are distinct. The values were “Sunny”, “Rainy” and “Overcast”.
# This function calculates the entropy of a given dataset. LABEL is a field that is used to calculate the entropy
def calculate_entropy(dataset, label):
# Occurence of each item in "Label". For our original dataset, "Yes" is repeated 9 times and "No" 5 times.
occurences = dataset[label].value_counts()
# Total number of items in dataset.
total_count = sum(occurences)
# Let's convert occurences to probabilities
occurences /= total_count
# Entropy is the sumtotal value of probability multiplied by log(base 2) of inverted probability.
entropy = np.sum(occurences * np.log2(1/occurences))
return entropy
calculate_entropy(dataset, "Play Golf")
Output: 0.9402859586706309
Okay! Now our data processing functions are complete!
Let’s try to iteratively construct a decision tree that will minimize the entropy, i.e. maximize the information gain at each step along the way!
def build_decision_tree(dataset, case = ""):
print("")
print("Building from previous split of", case)
# Let's establish base cases here.
# 1. If all labels are same, then return a leaf node. Stop here.
if len(dataset['Play Golf'].unique()) == 1:
print("Leaf node reached. Stopping.")
print("--> RULE REACHED :: ", case, " then ", dataset['Play Golf'].unique())
print()
print()
return
# 2. If all attributes are exhausted, stop.
if len(dataset.keys()) == 1:
print("Attributed exhausted. Stopping.")
return
# Otherwise, continue:
# Get all attributes of dataset!
attributes = dataset.keys()
# Let's calculate the entropy of our dataset first
# TODO: make this "Play Golf" as a standard thing above.
parent_entropy = calculate_entropy(dataset, "Play Golf")
information_gains = []
print(attributes)
# take all attributes into consideration, except for "Play Golf"
for attribute in attributes[:len(attributes)1]:
# Split the dataset using the attribute above!
split_datasets_using_this_attribute = split_data_by_attribute(dataset, attribute)
# For each split dataset, calculate the entropy!
entropies = []
for dataset_s in split_datasets_using_this_attribute:
entropy = calculate_entropy(dataset_s[0], "Play Golf")
entropy *= len(dataset_s[0])
entropy /= len(dataset)
entropies.append(entropy)
# Sum of entropies:
sum_of_entropies = sum(entropies)
# calculate the information gain.
information_gains.append(parent_entropy sum_of_entropies)
# Calculate max for information gain
max_ig = max(information_gains)
max_ig_id = information_gains.index(max_ig)
print("Max information gain found using ", attributes[max_ig_id], ":", max_ig)
print("Splitting from", attributes[max_ig_id], "into different datasets.")
print("")
new_nodes_datasets = split_data_by_attribute(dataset, attributes[max_ig_id])
for dataset_s in new_nodes_datasets:
# Find what value the attribute had in original dataset.
build_decision_tree(dataset_s[0], case = case + " and " + attributes[max_ig_id] + "==" + str(dataset_s[1]))
Let’s Run the program!
build_decision_tree(dataset, case = "Start")
Output:
Building from previous split of Start
Index(['Outlook', 'Temp', 'Humidity', 'Windy', 'Play Golf'], dtype='object')
Max information gain found using Outlook : 0.246749819774439
Splitting from Outlook into different datasets.
Building from previous split of Start and Outlook==Rainy
Index(['Temp', 'Humidity', 'Windy', 'Play Golf'], dtype='object')
Max information gain found using Humidity : 0.9709505944546687
Splitting from Humidity into different datasets.
Building from previous split of Start and Outlook==Rainy and Humidity==High
Leaf node reached. Stopping.
--> RULE REACHED :: Start and Outlook==Rainy and Humidity==High then ['No']
Building from previous split of Start and Outlook==Rainy and Humidity==Normal
Leaf node reached. Stopping.
--> RULE REACHED :: Start and Outlook==Rainy and Humidity==Normal then ['Yes']
Building from previous split of Start and Outlook==Overcast
Leaf node reached. Stopping.
--> RULE REACHED :: Start and Outlook==Overcast then ['Yes']
Building from previous split of Start and Outlook==Sunny
Index(['Temp', 'Humidity', 'Windy', 'Play Golf'], dtype='object')
Max information gain found using Windy : 0.9709505944546687
Splitting from Windy into different datasets.
Building from previous split of Start and Outlook==Sunny and Windy==False
Leaf node reached. Stopping.
--> RULE REACHED :: Start and Outlook==Sunny and Windy==False then ['Yes']
Building from previous split of Start and Outlook==Sunny and Windy==True
Leaf node reached. Stopping.
--> RULE REACHED :: Start and Outlook==Sunny and Windy==True then ['No']
Finally generated decision tree. Can we play golf? Let the leaf nodes answer!!
6. Build your own Decision Tree
Hmm.. The code above seems quite long though not complicated and it would be very difficult for us to visualize it in case of a large dataset. Also, we have only worked with nominal discrete values in the dataset (values that are categorical and not ordered in terms of value). For building more complicated decision trees, we can use different python packages out there which greatly simplify the task at hand, and also help us visualize the output of the decision tree! I leave the task to the reader of this blog post – once you’ve created your own decision tree, using other peoples’ libraries is a nominal task. But bear in mind the categorical values, and continuous values in decision tree making process. Some libraries don’t play very well with those! For example, sklearn requires user to input non-nominal values, so for that you can do different types of variable encodings like one-hot encoding, binary encoding and so on.
Scikit Learn has an excellent documentation on decision trees and I would like to request you to hop on over there and check it out for yourself!
Scikit Learn Decision Tree Documentation.
7. The pros and cons of classification using Decision Trees:
We are almost at the end of this article. We have learned about what a decision tree is, how it is used, the parts of a decision tree and how you can implement a simple decision tree yourself in python! Let’s discuss some strengths and weaknesses of a decision tree classifier. Let’s begin with the strengths first:
Strengths:
- Implementing a decision tree is a trivial task if you are working with nominal variables. Same code can be tweaked to work with continuous values and ordinal values too!
- Visualizing the output of a decision tree is very close to human comprehension, compared to black box methods like a neural network.
- Very little to none data preprocessing is required for generating a decision tree.
Weaknesses:
- Decision trees are prone to overfitting. That is why we use Random Forest Classifiers.
- Decision trees are not very flexible to change in data.
If you have something to add to the list, feel free to contact me!
That’s all for this article, folks!
Ok
ReplyDelete