Decision Tree Induction
A Decision Tree is a supervised learning algorithm used for both classification and regression tasks in data mining. It is a tree-like structure that assists in decision-making by dividing a dataset into smaller subsets, progressively building the model.
The final structure consists of decision nodes and leaf nodes. A decision node represents a feature or attribute and splits the data based on a specific condition, with at least two branches leading to further nodes or leaf nodes. Leaf nodes represent the outcome or decision, such as a class label for classification or a numerical value for regression. No further splits are made on the leaf nodes.
The top-most decision node, which is considered the best predictor for the dataset, is called the root node. Decision trees can handle both categorical and numerical data for creating the splits, making them versatile in various applications.
How Decision Tree Induction Works:
- Start with the whole dataset: The process begins with the entire dataset at the root node.
- Splitting the data: The algorithm evaluates all possible attributes and splits the data based on the feature that best separates the target variable. This is done using criteria like Gini impurity, Information Gain (for classification), or Mean Squared Error (for regression).
- Recursive splitting: The dataset is recursively divided into subsets based on the best feature until certain stopping conditions are met, such as:
- A node has data points that all belong to the same class (pure node).
- A maximum tree depth is reached.
- The dataset is too small to split further.
- Leaf nodes: Once the data can’t be split further, the algorithm assigns a class label (for classification) or a numerical value (for regression) to the leaf node.
- Tree formation: The tree structure, consisting of decision nodes and leaf nodes, is built in this manner.
Example 1: Classification
Suppose we want to predict whether a person buys a product based on their age and income.
Age | Income | Buys Product |
---|---|---|
22 | Low | No |
30 | High | Yes |
25 | Low | No |
35 | High | Yes |
- Step 1: Start with the root node (the whole dataset).
- Step 2: Calculate which attribute (Age or Income) best separates the data. For example, if we split by Income, we might see that:
- Low income: Majority does not buy the product (class "No").
- High income: Majority buys the product (class "Yes").
- Step 3: This split might result in a decision tree like this:
Income
/ \
Low High
/ \
No Yes
- Leaf nodes: The leaf nodes represent the classification result for each income group: "No" for low income, "Yes" for high income.
Example 2: Regression
Now, let's predict the price of a house based on its size (in square feet) and the number of rooms.
Size (sq ft) | Rooms | Price (in thousands) |
---|---|---|
800 | 2 | 150 |
1200 | 3 | 200 |
1500 | 4 | 250 |
1800 | 4 | 300 |
- Step 1: Start with the root node (the whole dataset).
- Step 2: The algorithm might first check Size as a predictor. If the house size is below 1300 sq ft, the predicted price is lower, while larger houses tend to have higher prices.
- Step 3: After several splits based on Size and Rooms, the tree might look like this:
Size < 1300 sq ft
/ \
Price = 175 Size >= 1300
/ \
Price = 250 Price = 275
- Leaf nodes: The leaf nodes represent the predicted prices of houses for specific size ranges.
Decision Tree Algorithm:
The decision tree algorithm may seem complex, but it is fundamentally based on a few simple techniques. The algorithm relies on three parameters: D, attribute_list, and attribute_selection_method.
- D represents the data partition, typically the entire set of training tuples along with their corresponding class labels (input training data).
- attribute_list refers to the set of attributes that define the tuples.
- attribute_selection_method specifies a heuristic process used to choose the attribute that best discriminates the tuples based on the class label.
The attribute_selection_method applies a selection measure to determine the attribute that most effectively splits the data based on the class label.
Advantages of using Decision Trees
A decision tree does not require data scaling.
Missing values in the data have minimal impact on the process of building a decision tree.
The decision tree model is straightforward and easy to explain to both technical teams and stakeholders.
Compared to other algorithms, decision trees require less effort in data preparation during the pre-processing phase.
A decision tree does not necessitate data standardization.