Group by analysis and data wrangling with R¶

Intro and Objectives¶

The term data wrangling is used to describe the process of reading, exploring, cleaning, and transforming data in preparation for modeling, statistics and machine learning. A popular rule of thumb is that you’ll typically spend 80% of total project time on data wrangling! Having done analytics work professionally both in industry and academia, this rings true. Getting good at data wrangling will pay off big time. We’ll start with basic wrangling tasks using R. Later in the semester we’ll revisit wrangling with Python and learn some more advanced methods. This does NOT mean that R can’t do the wrangling things that Python can - it can. It just makes sense to save some of the more advanced stuff like regex, JSON and web scraping until we get to Python. This will allow us to get going quicker doing statistical modeling in R. Common data wrangling tasks such as dealing with missing data and feature engineering will pop up over and over again as we get into statistical learning models.Some of the things we’ll do include:

We’ll learn about various ways to do “pivot” or “group by analysis” with a focus on the widely used dplyr package (more Hadley Wickham creations) - though I’ll also introduce classic R techniques using the apply family of functions.
reading csv, Excel, and SQLite data into R,
string and date/time manipulation for cleaning/transforming data,
tidying and reshaping data,
saving/writing data.

Readings¶

RforE - 6, 11, 12, 14-16
r4ds - 4, 6, 13-20
PDSwR - 4

Downloads and other resources¶

Downloads_EDA2withR.zip - Group by analysis using old school tools like apply along with new era tools such as plyr and dplyr. Basic date/time manipulation.
Downloads_DataWranglngwithR.zip - More on getting data into and out of R. Also, data reshaping and basic string manipulation.
Data wrangling (dplyr and tidyr) cheat sheet (R Studio)

Activities¶

Tools like Excel Pivot Tables and aggregate SQL queries are commonly used for what is known as “group by” analysis.

We’ll start by learning about the new dplyr library - think pivot tables but way more powerful.

Here’s the name of the file we’ll be using. * GroupByAnalysis_dplyr_basics_shell.Rmd

I’m going to break this up into logical chunks.

SCREENCAST: Intro to group by analysis with dplyr (11:02)
SCREENCAST: Using filter() to get subset of rows (10:37)
SCREENCAST: Using select() to get subset of columns (4:49)
SCREENCAST: Using mutate() to compute new variables (12:06)
SCREENCAST: Using summarise() to do aggregations like sums, counts, means, … (20:03)
SCREENCAST: Using dplyr with databases (6:14)

If you have the time and interest, check out the “old school” approach use base R tools such as the apply family of functions as well as a tool called plyr.

GroupByAnalysis_oldschool_notes.Rmd (OPTIONAL)

Now let’s see some alternatives to read.table and read.csv, a bit on tibbles, RData and RDS files, built in datasets. See GettingDataIntoR_notes.Rmd.

SCREENCAST: Getting data into R (12:54)

Let’s use DataWrangling_combining_notes.Rmd file to learn about combining data by rows and columns as well as doing SQL style joins.

SCREENCAST: Combining data (16:05)

The tidyverse approach to reshaping data between long and wide formats is covered in DataWrangling_reshaping_notes.Rmd and in the following screencast.

SCREENCAST: Wide <–> Long with tidyr (11:18)

Prior to the tidyverse, the reshape2 package was used to go between wide and long data formats. Then tidyr came along. The next screencast uses reshape2. It also gets into some basic string manipulation like substringing as well as some more advanced ggplot2 plot formatting. The associated Rmd file is DataWrangling_reshaping_pretidy_notes.Rmd.

SCREENCAST (OPTIONAL): Wide <–> Long with reshape2 (melt and cast) (17:44)

Explore (OPTIONAL)¶

Split-Apply-Combine¶

The Split-Apply-Combine Strategy for Data Analysis - This is the original paper by Hadley Wickham. I’ve also included the pdf in the Downloads file for this session.
A brief introduction to “apply” in R - This is one of the best overviews I’ve found of the “apply” family of functions in R.
Introduction to dplyr for Faster Data Manipulation in R - Very good dplyr tutorial by Kevin Markham.

Data reshaping¶

Tidy Data - This is the original paper by Hadley Wickham.