Group by analysis and data wrangling with R

Intro and Objectives

The term data wrangling is used to describe the process of reading, exploring, cleaning, and transforming data in preparation for modeling, statistics and machine learning. A popular rule of thumb is that you’ll typically spend 80% of total project time on data wrangling! Having done analytics work professionally both in industry and academia, this rings true. Getting good at data wrangling will pay off big time. We’ll start with basic wrangling tasks using R. Later in the semester we’ll revisit wrangling with Python and learn some more advanced methods. This does NOT mean that R can’t do the wrangling things that Python can - it can. It just makes sense to save some of the more advanced stuff like regex, JSON and web scraping until we get to Python. This will allow us to get going quicker doing statistical modeling in R. Common data wrangling tasks such as dealing with missing data and feature engineering will pop up over and over again as we get into statistical learning models.Some of the things we’ll do include:

  • We’ll learn about various ways to do “pivot” or “group by analysis” with a focus on the widely used dplyr package (more Hadley Wickham creations) - though I’ll also introduce classic R techniques using the apply family of functions.

  • reading csv, Excel, and SQLite data into R,

  • string and date/time manipulation for cleaning/transforming data,

  • tidying and reshaping data,

  • saving/writing data.

Readings

  • RforE - 6, 11, 12, 14-16

  • r4ds - 4, 6, 13-20

  • PDSwR - 4

Downloads and other resources

Activities

Tools like Excel Pivot Tables and aggregate SQL queries are commonly used for what is known as “group by” analysis.

We’ll start by learning about the new dplyr library - think pivot tables but way more powerful.

Here’s the name of the file we’ll be using. * GroupByAnalysis_dplyr_basics_shell.Rmd

I’m going to break this up into logical chunks.

If you have the time and interest, check out the “old school” approach use base R tools such as the apply family of functions as well as a tool called plyr.

  • GroupByAnalysis_oldschool_notes.Rmd (OPTIONAL)

Now let’s see some alternatives to read.table and read.csv, a bit on tibbles, RData and RDS files, built in datasets. See GettingDataIntoR_notes.Rmd.

Let’s use DataWrangling_combining_notes.Rmd file to learn about combining data by rows and columns as well as doing SQL style joins.

The tidyverse approach to reshaping data between long and wide formats is covered in DataWrangling_reshaping_notes.Rmd and in the following screencast.

Prior to the tidyverse, the reshape2 package was used to go between wide and long data formats. Then tidyr came along. The next screencast uses reshape2. It also gets into some basic string manipulation like substringing as well as some more advanced ggplot2 plot formatting. The associated Rmd file is DataWrangling_reshaping_pretidy_notes.Rmd.

Explore (OPTIONAL)

Split-Apply-Combine

Data reshaping

  • Tidy Data - This is the original paper by Hadley Wickham.