COSC 667 - Machine Learning and Data Mining
  Fall 2018
16340
PH 221 MW 5:30-6:45

Instructor - Dr. William Sverdlik ( wsverdlik@emich.edu )
Office - 512 E
Phone
487-7081

Office Hours - Walk In (office): MW 9:00-9:50, MW 2:00-3:15 w 4:30-5:30. Other times by appointment
If these times don't work out for you, let me know. We'll figure out some other time. Please note that these times may change.
Text: No Text. You need to take good notes!! More on this later....

Software: WEKA (it's free!)




The Algorithms:
1) Majority Rule
2) One-Rule
3) ID3/C4/C5
    We need the goofy table
    From Temple University
    From Wikipedia
4) Models of Information Retrieval
5) Clustering
    I am looking for nice links for this topic. So far, the best reference I can find is the textbook.
    Clustering Part 1
    Clustering Part 2
6) Apriori Algorithm and Market Basket Analysis

    Apriori 0
    Apriori 1
    Apriori 2

    Sample Data from Text
    Aprior Algorithm from SUNY Buffalo (pdf)
    Apriori Algorithm from University of Iowa (pdf)
7) Neural Networks
    Part 1
    Part 2a
    Part 2b
    Some More
    Neural Network Homework
8) Regression Models
    Simple Linear Regression
    Correlation Coefficient (note wikipedias complicated name for this
    General Linear Model (higher dimensions)
9) Probabilistic Methods
    Intro Stuff (slides 1-31)
    Abduction
    Markov Models and Hidden Markov Models
    Nice Easy Viterbi Algorithm Discussion (and the one we'll do in class)



Class Format:
We will discuss various machine learning and data mining algorithms; periodically  homework (including some computer programming)  will be given. Students are expected to respect submission deadlines; late submissions will be penalized 25% per class period late (homeworks are due at the beginning of class). There will be two  exams, homework, and a term project .


Approximate Weighting:
    - Homeworks   25%
    - Exam1            25%   September 24, 2018
    - Exam2            25%   October 22, 2018
    - Project            25%


Cheating:
It's not a good thing to do. It's counter-productive, you don't learn anything, and most importantly, it violates University policy. Cheating is defined as presenting any work as your own that you obtained from some other source. This includes copying programs and copying on quizzes.

If you are having problems with the class, come see me!

OK. What about books, references, etc ?

Good question! You must take good notes. In fact, you must take notes for the entire class!! Here's how it will work:

Every week, there will be a note taker and a reviewer. The note taker takes class notes for both Monday and Wednesday classes. Then, the note taker passes along his/her notes to the reviewer by Wednesday night at the latest.
The reviewer reviews, edits and corrects the notes. Finally, the reviewer will email me the final version of the weeks notes by no later than Saturday evening. I will post these notes on the web before class the following Monday.

             Note Taker                                       Reviewer

           Colvin, Rayaan                                   Browning, Nicholas
           Browning, Nicholas                           Bouzid, Abderraouf
           Bouzid, Abderraouf                           Mylavarapu, Deepthi
           Mylavarapu, Deepthi                         Nettem, Sindhura Lakshmi
           Nettem, Sindhura Lakshmi                Nijhawan, Dhwani Sunil
           Nijhawan, Dhwani Sunil                    Uddur Gowrishankar, Manjushri
           Uddur Gowrishankar, Manjushri       Arif, Muhammad Sohaib

The class notes will be part of your homework grade. I will confer with the class on the quality of the notes provided.

Data Sets!!

        - WEKA Datasets. Note: these are already in ARFF format
        - University of California Irvine Data Repository (KDD). Well known collection!
        - University of California Irvine Data Repository (ML)
        - Federal Statistics. You will have to do some work to create the data files, but there are interesting things to find
        - Major League Baseball. You can find statistics for any major league sport! Try a google search.

Group Presentations (find something you like or suggest something else)
Let's put a due date of Monday October 1 on a 2-4 page proposal. What will you do, what data will be gathered, who is involved ?

Final Project Presentations

Your talk should last approximately 20 minutes and allow another 5 minutes for questions. Remember, you must submit a paper summarizing your results, as well as citing any references. It's a small class, we will discuss team size.

Here's a partial list of talks from  previous years:

    - Logistic Regression and Email Spam 
    - Lunar Cycles and the Stock Market 
    - Mushroom Classification
    - Social Networks (structure)
    - Abandoned Objects in Video Feeds 

Papers:

Principal Component Analysis

Decision Trees and Entropy 


HOMEWORK 1 - Due Monday October 8:
Analyze some data !

HOMEWORK 2 - Due Monday October 15:
Read the following paper sand submit a 1-2 page summary of each:
What's the difference between Data Mining and Statistics ? Read this and find out .
One-Rule Rules! Forget these silly decision trees, one level suffices??

HOMEWORK 3 a)- Due Monday October 29:

Read the following papers and submit a 1-2 page summary . Be prepared to discuss in class
Nepotistic Links!! (A bit outdated , but interesting!!)

HOMEWORK 3b) - Due Monday October 29
1-2 page write up for your final project. Be prepared to present and discuss.

HOMEWORK 4:
Summarize the New York Times article on Link Spam. Due TBA


Neural Network 1
Neural Network 2
Neural Network 3
Neural Network 4

HOMEWORK: Neural Network Program      THIS PROGRAM IS DUE ???????