Decision Trees

Decision Trees

Some definitions:

1) Suppose P1, P2 , …….P n are probabilities and P1 + P2 +…..+Pn = 1, we the define the entropy as follows:

    Entropy(P1 ,P2,….,Pn) = Sumi –P i*lg(Pi) where the sum runs from i =1 to n

Consider partitioning a set S of examples into K class S1,S2,…Sk. Let |Si| denote the number of elements from class Si. We define the probability that an element is a member of Si as being Pi=|Si|/|S| . We define

    Entropy(S) = Entropy(P1,P2,…. Pk)

We will frequently consider a set S split into positive examples Sp and negative examples Sn. Let p = |Sp| and n =|Sn|. Then

    Entropy(S) = - ( p/(p+n) * lg (p/p+n) + n/(p+n)*lg(n/p+n) )

For example , if S contains 9 positive examples and 5 negative examples, then

    Entropy(S) = - ( (9/14)*lg(9/14) + (5/14)*lg(5/14))=0.940

2) Consider a set S and an attribute A that splits S into finite subsets A1 , A2, ….., An . Let |Ai| denote the number of elements in the subset Ai. Each Ai contains positive and negative examples, so we can define Entropy(Ai) . We define Info(A) as :

        Info(A) = Sumi |A i|/|S| * Entropy(Ai)

For example, consider the following partial dataset:

    Temperature     Plays Soccer
    High                     Yes
    Medium             Yes
    High                     No
    Medium             Yes
    Low                    Yes
    Low                 No

    A_High has 2 elements, 1 positive and 1 negative.         Entropy(A_High)=1.
    A_Medium has 2 elements, both positive.                        Entropy(A_Medium) = 0
    A_Low has 2 elements, 1 positive and 1 negative.         Entropy(A_Low)=1

    Info(A) = (2/6)*1 + (2/6)*0 + (2/6)*1 =0.666

3) Consider a set S and an attribute A. We define Gain as:

        Gain(S,A) = Entropy(S) – Info(A)

The decision tree algorithms (ID3, C4, C4.5, C5) asserts to interatively pick the attribute that maximizes Gain. Maximizing gain is biased to attributes that have many values. It has been proposed to instead pick the attribute that maximizes the Gain Ratio. The Gain Ratio is defined as follows:

        Gain Ratio(S,A) = Gain(S,A)/ Split_Info(A) where Split_Info(A) is the entropy of A.

Example: Consider the following table, also used in our majority rule example

    Outlook = { sunny, overcast, rain}
    Temperature = { hot , mild , cool}
    Humidity={high , normal }
    Windy={false,true}
    Play={no,yes}

Outlook	Temperature	Humidity	Windy	Play
Sunny	Hot	High	False	No
Sunny	Hot	High	True	No
Overcast	Hot	High	False	Yes
Rainy	Mild	High	False	Yes
Rainy	Cool	Normal	False	Yes
Rainy	Cool	Normal	True	No
Overcast	Cool	Normal	True	Yes
Sunny	Mild	High	False	No
Suuny	Cool	Normal	False	Yes
Rainy	Mild	Normal	False	Yes
Sunny	Mild	Normal	True	Yes
Overcast	Mild	High	True	Yes
Overcast	Hot	Normal	False	Yes
Rainy	Mild	High	True	No

Note that there are 5 No and 9 Yes. The entropy present initially is -(5/14)* lg(5/14)-(9/14)*lg(9/14) = 0.940. We note the following quantities:

    Attribute                 Gain                                 Split Info        Gain Ratio
    Outlook                 0.940 - 0.693 = 0.247     1.577             0.156
    Temperature         0.940 - 0.911 = 0.029     1.362             0.021
    Humidity                 0.940 - 0.788 = 0.152     1.000             0.152
    Windy                     0.940 - 0.892 = 0.048     0.985             0.049

Maximizing Gain would yield Outlook as the clear choice to split the tree on. Maximizing the Gain Ratio would again pick Outlook, but now the choice is not as clear because Humidity is almost as good. Further investigation would be required.