Decision trees practice
What to measure?
- Entropy - measures the uncertainty of randomness in our data
- Gini index measures the probability of incorrect classification
- Both are used to evaluate how well a split separates classes
Practice
In our sample dataset we have - 3 setosa samples, 3 versicolor samples, 4 virginica samples
To calculate entropy:
$$ -\sum(p * \log_2(p)) $$ where $p$ is probability of each class. So for each class
- setosa: $-(\frac{3}{10} * log_2(\frac{3}{10}))$
- versicolor: $-(\frac{3}{10} * log_2(\frac{3}{10}))$
- virginica: $-(\frac{4}{10} * log_2(\frac{4}{10}))$
Then sum calculated values to get final entropy.
To calculate Gini: Formula is: $1 - \sum(p^2)$, where $p$ probability of each class. For each class
- setosa: $(\frac{3}{10})^2$
- versicolor: $(\frac{3}{10})^2$
- virginica: $(\frac{4}{10})^2$
Then subtract sum from 1 to get final gini index
decision-trees-practice.py
import pandas as pd
from math import log2
data = {
'species': [
'setosa', 'setosa', 'setosa',
'versicolor', 'versicolor', 'versioclor',
'virginica', 'virginica', 'virginica', 'virginica'
]
}
df = pd.DataFrame(data)
class_counts = df['species'].value_counts()
total_samples = len(df)
probabilities = class_counts / total_samples
print("Dataset Distribution")
print(class_counts)
print("\nProbabilities")
print(probabilities)
entropy = -sum(p * log2(p) for p in probabilities)
print(f"\nEntropy = {entropy:.4f}")
gini = 1 - sum(p ** 2 for p in probabilities)
print(f"\nGini = {gini:.4f}")
The output of code above:
Dataset Distribution
species
virginica 4
setosa 3
versicolor 2
versioclor 1
Name: count, dtype: int64
Probabilities
species
virginica 0.4
setosa 0.3
versicolor 0.2
versioclor 0.1
Name: count, dtype: float64
Entropy = 1.8464
Gini = 0.7000