Machine Learning (CPSC 540): Decision trees

At first, I want to recommend visiting this website.
https://brilliant.org/wiki/entropy-information-theory/
This site describes the meaning of entropy in information theory very nicely.

I'm going to describe information content with following condition to help you understand well.
Suppose that there are two doors in a lobby. First door is a main entrance, and second door is a entrance that only important people use. We know that the probability the first door opens is larger than the probability the second door opens. You are a supervisor of the lobby and your staff tells you which door opens.
Now which information do you think is important? Would you interested in the event that the first door opens? No. This describes following formula in the site.
$$I(x_i) = \mathrm{log}_2\left(\frac{1}{p_i}\right).$$Now, how the entropy and uncertainty related? The definition of the entropy is the expectation of information gain which is described by
$$\sum_{i=1}^{n}p_i\mathrm{log}\left(\frac{1}{p_i}\right).$$If entropy is high, this means you will get a lot of information by one message. If uncertainty is low, you don't need much information and this means you won't get much information by one message. In this sense, we can link to concepts, entropy and uncertainty. If entropy is low, you won't get much information by one message and this means uncertainty is low. If entropy is high, you will get a lot of information by one message and this means uncertainty is high.

Then, what is Expected Entropy (EH) and Information gain (I) in decision trees?
Consider above figure. Each node has entropy and one test point have probability where to be classified. If one test point go to some node, we want that uncertainty of that node is low. We use expectation here, the expectation of uncertainty. Thus, we get following formula.
$$EH(A) = \sum_{i=1}^{K}\frac{p_i + n_i}{p + n}H\left(\frac{p_i}{p_i+n_i}, \frac{n_i}{p_i+n_i}\right).$$We use this to calculate information gain. First, there is uncertainty in the root node, and we classified data points. Then uncertainty changed. If the uncertainty become lower, this mean we get some information. We measure this by Information gain (I) and we want to maximize this.
$$I(A) = H\left(\frac{p}{p+n}, \frac{n}{p+n}\right) - EH(A).$$

댓글