Modeling 2: Intro to classifiers with R(UNDER CONSTRUCTION)¶
NOTE: This is new material for Winter 2025
In class we’ll spend some time learning about using logistic regression for binary classification problems - i.e. when our response variable has two possible outcomes (e.g. customer defaults on loan or does not default on loan). We’ll explore other simple classification approaches such as k-Nearest Neighbors and basic classification trees. Trees, forests, and their many variants have proved to be some of the most robust and effective techniques for classification problems.
This module will take us 1.5 weeks.
Readings¶
ISLR - Sec 3.5 (kNN), Sec 4.1-4.3 (Classification, logistic regression), Ch 8 (trees)
ISLR Tidymodels labs - Sec 4.1, 4.2 (Classification and logistic regression) and Ch 8 (Tree based methods)
The Tidy Modeling with R book also has useful info throughout.
Downloads and other resources¶
Activities¶
We will work through a number of Quarto documents as we learn to build basic classifiers using R. Everything is available in the Downloads file above.
Intro to classification problems and the k-Nearest Neighbor technique¶
In this first part we’ll:
get a sense of what classification problems are all about,
get our first look at the very famous Iris dataset,
use a simple, model free technique, known as k-Nearest Neighbors, to try to classify Iris species using a few physical characteristics.
You’ll use knn.qmd and follow along with this screencast:
Logistic regression¶
Logistic regression is a variant of multiple linear regression in which the response variable is binary (two possible outcomes). It is a commonly used technique for binary classification problems. It’s definitely more “mathy” than kNN. I’ll try to help you develop some intuition and understanding of this technique without getting too deeply into the math/stat itself. See the Explore section at the bottom of this page for some good resources on the underlying math and stat of logistic regression.
You’ll use logistic_regression.qmd and these screencasts:
We’ll start with a short introduction to the problem, the data and an ill-advised multiple linear regression model. We will also partition the data into training and test sets.
Next, we will review the main ideas of the logistic regression model.
We’ll create a null model and then use glm()
to build our first
few models. We’ll see some of the challenges in interpretation of
statistical models (and this is a tiny model).
We’ll learn how to use tidymodels to build, fit and assess logistic regression models.
Decision trees¶
Now on to learning about decision trees and variants such as random forests. You’ll use trees.qmd with these screencasts.
We’ll start with a short introduction to the problem and the data. Data partioning and a bit of data exploration are done.
So, how do decision trees decide how to create their branches? We’ll take a very brief look at this and point you to some resources to go deeper if you want.
Now we’ll look at some more advanced tree based models.
Putting it all together - the Kaggle Titanic challenge (OPTIONAL)¶
This is the famous Kaggle practice competition that so many people used as a first introduction to predictive modeling and to Kaggle. A number of very nice tutorials have been developed to help newcomers to Kaggle. So, take a look at the following R Markdown document. In addition to a little bit of EDA and some basic model building, you’ll find some interesting attempts at feature engineering as well as creating output files suitable for submitting to Kaggle to get scored. The Titanic Challenge is perpetually running, so feel free to try it out. You can’t pay much attention to the leader board as people have figured out ways to get 100% predictive accuracy.
titanic/Titanic_kaggle.Rmd
Explore (OPTIONAL)¶
- StatQuest YouTube Channel - Josh Starmer
StatQuest: Logistic regression - there are a bunch of follow on videos with various details of logistic regression
StatQuest: Random Forests: Part 1 - Building, using and evaluation
The vtreat package for data preparation for statistical learning models
Predictive analytics at Target: the ethics of data analytics
Kappa statistic defined in plain english - Kappa is a stat used (among other things) to see how well a classifier does as compared to a random choice model but which takes into account the underlying prevalence of the classes in the data.
Applied Predictive Modeling - This is another really good textbook on this topic that is well suited for business school students. You can see details about the book at its companion website and you can actually get the book as an electronic resource through the OU Library.
- The caret package for classification and regression training - Widely used R package for all aspects of building and evaluating classifier models. A few summers ago I wrote a three part series of blog posts on automating caret for efficient evaluation of models over various parameter spaces.
Tidymodels - - “a collection of packages for modeling and machine learning using tidyverse principles.” The Tidy Modeling with R online book by Kuhn and Silge provides a very good introduction to the tidymodels package and how its consitutient packages can be used for different parts of the modeling process.