Decision Tree

Mariam Kili Bechir/ Techgirl_235
5 min readApr 15, 2023

--

Decision tree is a popular algorithm in machine learning used for both classification and regression tasks. It is a type of supervised learning algorithm that works by recursively splitting the dataset into smaller subsets based on the most significant features that best separate the data. In this article, we will delve deeper into the concept of decision trees, the types of decision trees, and how they work in machine learning.

A decision tree is a tree-like model where each node represents a feature, and the branches represent the decision rule or condition that separates the data into different classes or regression values. The tree is built by recursively splitting the data based on the most significant features that best separate the data into different classes or regression values. The decision rule is usually in the form of a binary question that can be answered with a yes or no.

The decision tree is constructed in a top-down manner, starting with the root node, and each node is split into two child nodes until a stopping criterion is reached. The stopping criterion could be when all the samples in a node belong to the same class, or when a specified maximum depth is reached.

https://static.javatpoint.com/tutorial/machine-learning/images/decision-tree-classification-algorithm.png

There are two types of decision trees: classification trees and regression trees. Classification trees are used for categorical or discrete target variables, while regression trees are used for continuous target variables. In this article, we will focus on classification trees.

Let’s illustrate how a decision tree works using an example. Suppose we have a dataset of 10 patients with two features, age, and gender, and we want to predict whether they have a certain disease or not. The target variable is binary, where 0 represents the absence of the disease, and 1 represents the presence of the disease.

We start by selecting the feature with the highest information gain or the lowest Gini impurity to split the data at the root node. Suppose the age feature has the highest information gain, so we split the data into two subsets based on age: patients over 45 years and patients under 45 years. We then repeat the process for each subset until a stopping criterion is reached. Here is the resulting decision tree:

The decision tree splits the data into three groups based on age and gender. Patients over 45 years of age are classified as having the disease if they are male, while patients under 45 years of age are classified as having the disease if they are female. The other group is classified as not having the disease.

To implement this decision tree in Python, we can use the scikit-learn library, which provides a decision tree classifier class. Here is an example code:

In this code, we first load the patient data and split it into features and target variable. We then create a decision tree classifier with a maximum depth of 2 and fit it to the data. Finally, we use the model to predict the disease for two new patients.

Here is another example of how To implement a decision tree in machine learning,using iris datasets of scikit-learn:

Advantages of Decision Trees

  1. Easy to interpret: Decision trees are easy to understand and interpret, even by non-experts. The tree structure allows for a clear visualization of the decision-making process.
  2. Handles both categorical and numerical data: Decision trees can handle both categorical and numerical data. This makes them suitable for a wide range of applications.
  3. Non-parametric: Decision trees are a non-parametric method, which means they do not require any assumptions about the underlying data distribution.
  4. Can handle missing data: Decision trees can handle missing data by predicting the missing values based on the available data.

Disadvantages of Decision Trees

  1. Overfitting: Decision trees are prone to overfitting, which occurs when the model is too complex and fits the noise in the data instead of the underlying pattern.
  2. Instability: Decision trees are unstable, which means that small changes in the data can lead to large changes in the resulting tree.
  3. Bias: Decision trees can be biased towards features with more levels or categories. This can be mitigated by using feature selection techniques or regularization.
  4. Inefficient: Decision trees can be inefficient for large datasets, as the time complexity of building the tree is exponential in the number of features.

In this article, we have explored the concept of decision trees in machine learning. We have learned that decision trees are a popular algorithm used for both classification and regression problems. They consist of a series of decisions and their corresponding outcomes, and the process of creating a decision tree involves selecting the best feature to split the data and recursively splitting each child node until a stopping criterion is reached.

We have also discussed the advantages and disadvantages of decision trees. They are easy to interpret, can handle both categorical and numerical data, are non-parametric, and can handle missing data. However, they are prone to overfitting, instability, bias, and inefficiency for large datasets.

Finally, we have provided an example code to implement a decision tree in machine learning using scikit-learn. Overall, decision trees are a powerful tool in machine learning that can help us make accurate predictions and gain insights into the decision-making process.

--

--

Mariam Kili Bechir/ Techgirl_235
Mariam Kili Bechir/ Techgirl_235

Written by Mariam Kili Bechir/ Techgirl_235

All That you need to Know about Data Science is here, Don't hesitate to read , share and leave a comment please.

No responses yet