Group by analysis and data wrangling with R

Note

The notes for this session have been updated to use Quarto instead of R Markdown. You can find the old version here.

Intro and Objectives

The term data wrangling is used to describe the process of reading, exploring, cleaning, and transforming data in preparation for modeling, statistics and machine learning. A popular rule of thumb is that you’ll typically spend 80% of total project time on data wrangling! Having done analytics work professionally both in industry and academia, this rings true. Getting good at data wrangling will pay off big time. We’ll start with basic wrangling tasks using R. Later in the semester we’ll revisit wrangling with Python and learn some more advanced methods. This does NOT mean that R can’t do the wrangling things that Python can - it can. It just makes sense to save some of the more advanced stuff like regex, JSON and web scraping until we get to Python. This will allow us to get going quicker doing statistical modeling in R. Common data wrangling tasks such as dealing with missing data and feature engineering will pop up over and over again as we get into statistical learning models.Some of the things we’ll do include:

  • We’ll learn about various ways to do “pivot” or “group by analysis” with a focus on the widely used dplyr package (more Hadley Wickham creations) - though I’ll also introduce classic R techniques using the apply family of functions.

  • reading csv, Excel, and SQLite data into R,

  • string and date/time manipulation for cleaning/transforming data,

  • tidying and reshaping data,

  • saving/writing data.

Readings

  • R4DS - 5, 7, 14, 17, 19, 20, 21

Downloads and other resources

Activities

We will do a few different data wrangling things in this session.

Group by analysis with dplyr

Tools like Excel Pivot Tables and aggregate SQL queries are commonly used for what is known as “group by” analysis.

We’ll start by learning about the new dplyr library - think pivot tables but way more powerful.

Here’s the file we’ll be using.

  • File: dplyr_basics_shell.qmd

I’m going to break this up into logical chunks.

Ingesting data and saving data

Now let’s see some alternatives to read.table and read.csv, a bit on tibbles, RData and RDS files, built in datasets.

Combining data

Let’s learn about combining data by rows and columns as well as doing SQL style joins.

Reshaping data

The tidyverse approach to reshaping data between long and wide formats is next.

Older material (OPTIONAL)

If you have the time and interest, check out the “old school” approach use base R tools such as the apply family of functions as well as a tool called plyr.

  • File GroupByAnalysis_oldschool_notes.Rmd (OPTIONAL)

Prior to the tidyverse, the reshape2 package was used to go between wide and long data formats. Then tidyr came along. The next screencast uses reshape2. It also gets into some basic string manipulation like substringing as well as some more advanced ggplot2 plot formatting.

Explore (OPTIONAL)

Split-Apply-Combine

Data reshaping

  • Tidy Data - This is the original paper by Hadley Wickham.