Decision Trees

Some definitions:

1) Suppose P1, P2 , …….P n are probabilities and P1 + P2 +…..+Pn = 1, we the define the entropy as follows:
      
     Entropy(P1 ,P2,….,Pn) = Sumi –P i*lg(Pi) where the sum runs from i =1 to n

Consider partitioning a set S of examples into K class S1,S2,…Sk. Let |Si| denote the number of elements from class Si. We define the probability that an element is a member of Si as being Pi=|Si|/|S| . We define

    Entropy(S) = Entropy(P1,P2,…. Pk)

We will frequently consider a set S split into positive examples Sp and negative examples Sn. Let p = |Sp| and n =|Sn|. Then

    Entropy(S) = - ( p/(p+n) * lg (p/p+n) + n/(p+n)*lg(n/p+n) )

For example , if S contains 9 positive examples and 5 negative examples, then

    Entropy(S) = - ( (9/14)*lg(9/14) + (5/14)*lg(5/14))=0.940

2) Consider a set S and an attribute A that splits S into finite subsets A1 , A2, ….., An . Let |Ai| denote the number of elements in the subset Ai. Each Ai contains positive and negative examples, so we can define Entropy(Ai) . We define Info(A) as :

        Info(A) = Sumi |A i|/|S| * Entropy(Ai)

For example, consider the following partial dataset:

    Temperature     Plays Soccer
    High                     Yes        
    Medium               Yes                
    High                     No
    Medium               Yes
    Low                    Yes
    Low                   No

    A_High has 2 elements, 1 positive and 1 negative.         Entropy(A_High)=1.
    A_Medium has 2 elements, both positive.                        Entropy(A_Medium) = 0
    A_Low has 2 elements, 1 positive and 1 negative.         Entropy(A_Low)=1

    Info(A) = (2/6)*1 + (2/6)*0 + (2/6)*1 =0.666

3) Consider a set S and an attribute A. We define Gain as:
   
        Gain(S,A) = Entropy(S) – Info(A)

The decision tree algorithms (ID3, C4, C4.5, C5) asserts to interatively pick the attribute that maximizes Gain. Maximizing gain is biased to attributes that have many values. It has been proposed to instead pick the attribute that maximizes the Gain Ratio. The Gain Ratio is defined as follows:

        Gain Ratio(S,A) = Gain(S,A)/ Split_Info(A) where Split_Info(A) is the entropy of A.

Example: Consider the following table, also used in our majority rule example

    Outlook = { sunny, overcast, rain}
    Temperature = { hot , mild , cool}
    Humidity={high , normal }
    Windy={false,true}
    Play={no,yes}

Outlook
Temperature
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Suuny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No




Note that there are 5 No and 9 Yes. The entropy present initially is -(5/14)* lg(5/14)-(9/14)*lg(9/14) = 0.940. We note the following quantities:

    Attribute                 Gain                                 Split Info        Gain Ratio
    Outlook                  0.940 - 0.693 = 0.247     1.577             0.156
    Temperature           0.940 - 0.911 = 0.029     1.362             0.021
    Humidity                 0.940 - 0.788 = 0.152     1.000             0.152
    Windy                     0.940 - 0.892 = 0.048     0.985             0.049

Maximizing Gain would yield Outlook as the clear choice to split the tree on. Maximizing the Gain Ratio would again pick Outlook, but now the choice is not as clear because Humidity is almost as good. Further investigation would be required.