Group by analysis and data wrangling with R¶
Note
The notes for this session have been updated to use Quarto instead of R Markdown. You can find the old version here.
Intro and Objectives¶
The term data wrangling is used to describe the process of reading, exploring, cleaning, and transforming data in preparation for modeling, statistics and machine learning. A popular rule of thumb is that you’ll typically spend 80% of total project time on data wrangling! Having done analytics work professionally both in industry and academia, this rings true. Getting good at data wrangling will pay off big time. We’ll start with basic wrangling tasks using R. Later in the semester we’ll revisit wrangling with Python and learn some more advanced methods. This does NOT mean that R can’t do the wrangling things that Python can - it can. It just makes sense to save some of the more advanced stuff like regex, JSON and web scraping until we get to Python. This will allow us to get going quicker doing statistical modeling in R. Common data wrangling tasks such as dealing with missing data and feature engineering will pop up over and over again as we get into statistical learning models.Some of the things we’ll do include:
We’ll learn about various ways to do “pivot” or “group by analysis” with a focus on the widely used dplyr package (more Hadley Wickham creations) - though I’ll also introduce classic R techniques using the
apply
family of functions.reading csv, Excel, and SQLite data into R,
string and date/time manipulation for cleaning/transforming data,
tidying and reshaping data,
saving/writing data.
Readings¶
R4DS - 5, 7, 14, 17, 19, 20, 21
Downloads and other resources¶
Downloads_EDA2withR.zip - Group by analysis using old school tools like apply along with new era tools such as plyr and dplyr. Basic date/time manipulation.
Downloads_DataWranglngwithR.zip - More on getting data into and out of R. Also, data reshaping and basic string manipulation.
Data wrangling (dplyr and tidyr) cheat sheet (R Studio)
Activities¶
We will do a few different data wrangling things in this session.
Group by analysis with dplyr¶
Tools like Excel Pivot Tables and aggregate SQL queries are commonly used for what is known as “group by” analysis.
We’ll start by learning about the new dplyr library - think pivot tables but way more powerful.
Here’s the file we’ll be using.
File: dplyr_basics_shell.qmd
I’m going to break this up into logical chunks.
Ingesting data and saving data¶
Now let’s see some alternatives to read.table
and read.csv
, a bit on tibbles, RData and RDS files, built in datasets.
File data_wrangling_ingesting.qmd
SCREENCAST: Getting data into R (10:04)
Combining data¶
Let’s learn about combining data by rows and columns as well as doing SQL style joins.
File: data_wrangling_combining_notes.qmd
SCREENCAST: Combining data (22:01)
Reshaping data¶
The tidyverse approach to reshaping data between long and wide formats is next.
File: data_wrangling_reshaping_notes.qmd
Older material (OPTIONAL)¶
If you have the time and interest, check out the “old school” approach use base R tools such as the apply family of functions as well as a tool called plyr.
File GroupByAnalysis_oldschool_notes.Rmd (OPTIONAL)
Prior to the tidyverse, the reshape2 package was used to go between wide and long data formats. Then tidyr came along. The next screencast uses reshape2. It also gets into some basic string manipulation like substringing as well as some more advanced ggplot2 plot formatting.
File: DataWrangling_reshaping_pretidy_notes.Rmd
SCREENCAST (OPTIONAL): Wide <–> Long with reshape2 (melt and cast) (17:44)
Explore (OPTIONAL)¶
Split-Apply-Combine¶
The Split-Apply-Combine Strategy for Data Analysis - This is the original paper by Hadley Wickham. I’ve also included the pdf in the Downloads file for this session.
A brief introduction to “apply” in R - This is one of the best overviews I’ve found of the “apply” family of functions in R.
Introduction to dplyr for Faster Data Manipulation in R - Very good dplyr tutorial by Kevin Markham.
Data reshaping¶
Tidy Data - This is the original paper by Hadley Wickham.