Modeling 1: Overview and linear regression in R with tidymodels

NOTE: This is new material for Winter 2025

Intro and Objectives

One of the workhorses of statistical predictive modeling is the family of linear models. We’ll do things like multiple linear regression for numeric predictions and logistic regression as a classifier for binary response variables. We’ll use these relatively simple models as a way to also learn about important modeling topics such as partitioning data into training and test sets, model training, validation and diagnostics. We’ll also use regression to introduce the notion of parameter estimation, error metrics for assessing model fit and for comparing candidate models against each other. These topics underlie all of statistical learning algorithms.

Readings

  • ISLR - Ch 1-3 and 4.1-4.3

An Introduction to Statistical Learning with applications in R

This is an outstanding book, which is available as a free pdf. ISLR covers the main statistical learning topics at a nice introductory level.

I will be listing the associated reading that you can do from ISLR as we explore various statistical learning topics - starting with linear regression.

Downloads and other resources

If you are rusty on statistics, there’s a really good OpenIntro Stats book available as a free online book or you can pay what you want for a paperback copy. It includes R based material.

Activities

We are going to work through a series of tutorials exploring the topic of building, using and evaluating predictive linear regression models.

Probability distributions in R

First we’ll do a brief review of working with probability distributions in R.

Simple linear regression

Review of simple linear regression of will set the stage for more complex linear models.

Multiple linear regression

Now we are ready for multiple linear regression. In this part, we will use the built in lm() function for fitting linear models. We will deal with missing data, partition data into training and test datasets, build and fit linear models with lm(), assess model fit, make predictions and assess predictive accuracy of our models using simple techniques.

Comparing competing models

There are better (or at least, newer) ways to build and compare models in R. We’ll use the tidymodels package to do similar things that we just did.

Predicting winning percentage with MLB data

Now we’ll revisit the MLB data and try to predict winning percentage based on team level data.

Residual analysis

We’ll do a brief review of a few residual analysis techniques.

Explore (OPTIONAL)

Regression modeling