Modeling 2: Intro to classifiers with R¶

Intro and Objectives¶

In class we’ll spend some time learning about using logistic regression for binary classification problems - i.e. when our response variable has two possible outcomes (e.g. customer defaults on loan or does not default on loan). We’ll explore other simple classification approaches such as k-Nearest Neighbors and basic classification trees. Trees, forests, and their many variants have proved to be some of the most robust and effective techniques for classification problems.

This module will take us 1.5 weeks.

Readings¶

RforE - Sec 20.1 (logistic regression), Sec 23.4 (decision trees), Ch 26 (caret)
PDSwR - Ch 6 (kNN), 7.2 (logistic regression), 6.3 & 9.1 (trees and forests)
ISLR - Sec 3.5 (kNN), Sec 4.1-4.3 (Classification, logistic regression), Ch 8 (trees)

Downloads and other resources¶

Downloads_StatModel2.zip

Activities¶

We will work through a number of R Markdown and other files as we learn to build basic classifiers using R. Everything is available in the Downloads file above.

Intro to classification problems and the k-Nearest Neighbor technique¶

In this first part we’ll:

get a sense of what classification problems are all about,
get our first look at the very famous Iris dataset,
use a simple, model free technique, known as k-Nearest Neighbors, to try to classify Iris species using a few physical characteristics.

You’ll use knn/kNN_notes.Rmd and follow along with this screencast:

SCREENCAST - Intro to classification with kNN (22:25)

Logistic regression¶

Logistic regression is a variant of multiple linear regression in which the response variable is binary (two possible outcomes). It is a commonly used technique for binary classification problems. It’s definitely more “mathy” than kNN. I’ll try to help you develop some intuition and understanding of this technique without getting too deeply into the math/stat itself. See the Explore section at the bottom of this page for some good resources on the underlying math and stat of logistic regression.

You’ll use logistic_regression/IntroLogisticRegression_Loans_notes.Rmd and these screencasts:

We’ll start with a short introduction.

SCREENCAST - Intro to logistic regression (9:21)

Now, we’ll review the statistical model and compare it to standard linear regression.

SCREENCAST - The logistic regression model (12:51)

To do logistic regression in R, we use the glm(), or generalized linear model, command.

SCREENCAST - The glm (11:24)

Do some model assessment and make predictions

SCREENCAST - Models assessment and make predictions (6:32)

More model and prediction assessment using confusionMatrix().

SCREENCAST - Model performance and the confusion matrix (13:03)

We’ll end with our final model comparisons and attempts on improvements.

SCREENCAST - Final models and modeling attempts (12:52)

Decision trees¶

Now on to learning about decision trees and variants such as random forests. You’ll use trees/classification_trees_notes.Rmd with these screencasts.

We’ll start with a short introduction.

SCREENCAST - Intro to decision trees (17:04)

So, how do decision trees decide how to create their branches? We’ll take a very brief look at this and point you to some resources to go deeper if you want.

SCREENCAST - Variable splitting to create new branches (6:05)

We’ll end with our final model comparisons and attempts on improvements.

SCREENCAST - Advanced variants of decision trees (9:22)

Putting it all together - the Kaggle Titanic challenge (OPTIONAL)¶

This is the famous Kaggle practice competition that so many people used as a first introduction to predictive modeling and to Kaggle. A number of very nice tutorials have been developed to help newcomers to Kaggle. So, take a look at the following R Markdown document. In addition to a little bit of EDA and some basic model building, you’ll find some interesting attempts at feature engineering as well as creating output files suitable for submitting to Kaggle to get scored. The Titanic Challenge is perpetually running, so feel free to try it out. You can’t pay much attention to the leader board as people have figured out ways to get 100% predictive accuracy.

titanic/Titanic_kaggle.Rmd

Explore (OPTIONAL)¶

StatQuest YouTube Channel - Josh Starmer
- StatQuest: Confusion matrix
- StatQuest: Sensitivity and specificity
- StatQuest: Maximum likelihood
- StatQuest: Odds and Log(odds)
- StatQuest: Logistic regression - there are a bunch of follow on videos with various details of logistic regression
- StatQuest: Random Forests: Part 1 - Building, using and evaluation
About unbalanced classes and oversampling methods
The vtreat package for data preparation for statistical learning models
Predictive analytics at Target: the ethics of data analytics
Kappa statistic defined in plain english - Kappa is a stat used (among other things) to see how well a classifier does as compared to a random choice model but which takes into account the underlying prevalence of the classes in the data.
Applied Predictive Modeling - This is another really good textbook on this topic that is well suited for business school students. You can see details about the book at its companion website and you can actually get the book as an electronic resource through the OU Library.
The caret package for classification and regression training - Widely used R package for all aspects of building and evaluating classifier models. A few summers ago I wrote a three part series of blog posts on automating caret for efficient evaluation of models over various parameter spaces.
Tidymodels - - “a collection of packages for modeling and machine learning using tidyverse principles.” The Tidy Modeling with R online book by Kuhn and Silge provides a very good introduction to the tidymodels package and how its consitutient packages can be used for different parts of the modeling process.
Frustration: One Year with R