To understand in a better way what the two formulas described in the previous post mean really let's build a practical use case: building simple rules to determine whether an account will be recovered.
The table below contains a table of 10 accounts. For example; the first account is Under 30 days DSO having a High Amount due, Credit Score and has been recovered.
In our example, as the root node contains 6 recovered account and 4 not recovered, we need to split on the root node. This process is then repeated to create child nodes with the least impurity.
If we split the root node based on DSO, we will get the following entropies :
Entropy(Under 30) = −[(4 / 4) * (Log2 (4 /43) + (0/4) * Log2 (0 / 4)] = 0
Entropy(Between 31 and 60) = −[(1 / 4) * Log2 (1 / 4) + (3 / 4) * Log2 (3 / 4)] = 0.811
Entropy(Over 61 days) = −[(1 / 2) * Log2 (1 / 2) + (1 / 2) * Log2 (1 / 2)] = 1
Because the dataset contains four accounts Under 30, four between 31 and 60 days and 2 accounts over 61 days, the impurity of this split is calculated :
Impurity (DSO) (4 / 10) * 0 + (4 / 10) * (0.811) + (2 / 10) * (1) = 0.52
To build this calculation in Excel we will use :
- SUM to compute the number of accounts for a given attribute value,
- DIV to compute the fraction of observations having each possible attribute value.
- LOG to calculate the impurity by also using IFERROR to ensure that the undefined value P(0)Log2(0)=0 the undefined value is replaced by 0.
- SUM to compute the entropy for each possible node split
- SUMPRODUCT to calculate the impurity for each split
As shown in the result above the impurity of DSO is the smallest (0.324), among the 3 features. We should then start by splitting over the DSO feature.
The process same Process can be repeated on the new subtree. The DSO Under 30 and DSO Over 61 are pure nodes, we don't need to split them.
DSO between 31 and 60 is impure. Splitting on DSO gives an impurity of 0.8, whereas Credit Score or Amount due yield impurity of 0. So we can either split over the Credit Score or the Amount due variables.
Finally, we are able to build the following decision tree, holding a simple classification rule of Accounts Receivables.
This simple rule-based classification tree might help Receivables Analysts to prioritize their actions towards receivables that might need more effort to collect.
Get for Free the Spreadsheet receivables Decision- Tree template
If you are interested in having your own copy of the Decision Tree template, implementing the calculations for the entropies measure and impurities for Receivables Analytics, please subscribe below and get the download link.
In this article, we learned how to model a simple decision tree using Excel. The key issue in creating a decision tree is its size. By going building deep trees we might have pure terminal nodes, but this usually overfits, which means the model will perform poorly on new data (different from the training set). This overfitting problem bird to a lot of research, such as Leo Breiman’s CART algorithm, and more advanced Tree-based models such as Random Forest and GBT, all of them have been successfully implemented as part of packages in R and python. Some of their implementations, Xgboost for instance, are considered to be the most winning models in Data Science competition (Kaggle).