EDA with R¶
Note
The notes for this session have been updated to use Quarto instead of R Markdown. You can find the old version here.
Intro and Objectives¶
We will begin to do exploratory data analysis in R. After completing the activities in this module, you should be able to explore a dataset using:
descriptive statistics,
simple R scripts including writing your own functions,
basic (and not so basic) plots with ggplot2.
We are going to explore a dataset related to New York City condo evaluations for fiscal year 2011-2012. It was obtained from the NYC Open Data initiative - https://data.cityofnewyork.us/.
Readings¶
I2R - Chapters 4-7
R4DS - Chapters 1-2
Downloads and other resources¶
Other Resources:
Activities¶
We will work through two tutorials on EDA (with a short detour on creating user defined functions in R)
Summary statistics¶
R makes it easy to compute summary statistics. We will also see how to create R Projects to help you organize your R work.
File: EDA1a_summarystats_shell.qmd
SCREENCAST: Modify dataframe, summary statistics, save as binary file (22:24)
Writing your own functions¶
We will do a brief introduction to writing functions in R.
File: summarystats_4470.R
Note
- The video above makes a few references to the “R for Everyone” text that
we are not longer using. Instead, see Chapter 7 of the ` An Introduction to R <https://intro2r.com/>`_ online textbook.
Plots and graphs¶
Now we are going to see an area where R really shines - plotting.
File: EDA1b_basicplots_shell.qmd
SCREENCAST: Intro to data visualization and base plot (9:35)
SCREENCAST: Intro to ggplot2 and the Grammar of Graphics (14:53)
SCREENCAST: Histograms, boxplots and violin plots, faceting and reusing plot objects (11:17)
SCREENCAST: Scatter plots (3:11)
SCREENCAST: Themes (6:37)
SCREENCAST: Density plots and other histogram alternatives (5:30)
SCREENCAST: Bar charts (6:54)
SCREENCAST: Correlation plots and Geeky fun with xkcd (7:10)
Explore (OPTIONAL)¶
A framework for exploratory data analysis - As you browse this, there’s a “More Pages” button at the bottom. You can also download the pdf from the GitHub site.
Data visualization¶
There’s no doubt that ggplot is awesome, but check out what can be done if you have a good grasp of base plotting in R. When I read this, it felt a bit like matplotlib, the venerable Python based plotting package.
R Graph Catalog Plot demos and code
Box plot comic - You can always count on xkcd.
Some data visualization resources from my MIS 5460: Business Analytics class
R Shiny app for visualizing NYC bike share data- Lots of interesting logistical challenges with bike share programs. EDA plays a role in gaining insight to system dynamics.
R Markdown¶
Now that you know some basic R Markdown, you might want to dig into its capabilities a little further. This overview provided by the folks at R Studio is a good place to start. In addition, Ch 27-28 of RforE covers R Markdown, knitr, and LaTex.
Daring Fireball: Markown Syntax site - This is John Gruber’s site - he developed Markdown.
Tips and tricks for working with images and figures in R Markdown documents - Very comprehensive post on the ins and outs of working with both external images as well as R generated images within R Markdown documents.
Percentiles¶
It’s easy to get enamored with averages. They don’t tell the whole story. Look at percentiles, too.
Examples of why percentiles are important for performance monitoring by Dynatrace, Elastic, AppSignal and Optimizely.
Percentiles in PostgreSQL - this SQL flavor has implemented some nice percentile and time weighted stats functions, including approximate percentiles.