*********************************** EDA with R *********************************** .. note:: The notes for this session have been updated to use Quarto instead of R Markdown. You can find the `old version here `_. Intro and Objectives -------------------- We will begin to do exploratory data analysis in R. After completing the activities in this module, you should be able to explore a dataset using: * descriptive statistics, * simple R scripts including writing your own functions, * basic (and not so basic) plots with ggplot2. We are going to explore a dataset related to New York City condo evaluations for fiscal year 2011-2012. It was obtained from the NYC Open Data initiative - https://data.cityofnewyork.us/. Readings --------- * I2R - Chapters 4-7 * R4DS - Chapters 1-2 Downloads and other resources ----------------------------- * `Downloads_EDA1withR.zip `_ Other Resources: * `Data files for the "R for Everyone" text `_ * `ggplot2 cheat sheet `_ (R Studio) * `Hadley Wickham is working on a new ggplot2 book `_ Activities ----------- We will work through two tutorials on EDA (with a short detour on creating user defined functions in R) Summary statistics ^^^^^^^^^^^^^^^^^^ R makes it easy to compute summary statistics. We will also see how to create R Projects to help you organize your R work. * File: **EDA1a_summarystats_shell.qmd** * `SCREENCAST: Overview of EDA and creating R Project `_ (8:30) * `SCREENCAST: Inital exploration of housing dataframe `_ (11:21) * `SCREENCAST: Modify dataframe, summary statistics, save as binary file `_ (22:24) Writing your own functions ^^^^^^^^^^^^^^^^^^^^^^^^^^^ We will do a brief introduction to writing functions in R. * File: **summarystats_4470.R** * `SCREENCAST: Create your own R function `_ (12:40) .. note:: The video above makes a few references to the "R for Everyone" text that we are not longer using. Instead, see Chapter 7 of the ` An Introduction to R `_ online textbook. Plots and graphs ^^^^^^^^^^^^^^^^^ Now we are going to see an area where R really shines - plotting. - File: **EDA1b_basicplots_shell.qmd** - `SCREENCAST: Intro to data visualization and base plot `_ (9:35) - `SCREENCAST: Intro to ggplot2 and the Grammar of Graphics `_ (14:53) - `SCREENCAST: Histograms, boxplots and violin plots, faceting and reusing plot objects `_ (11:17) - `SCREENCAST: Scatter plots `_ (3:11) - `SCREENCAST: Themes `_ (6:37) - `SCREENCAST: Density plots and other histogram alternatives `_ (5:30) - `SCREENCAST: Bar charts `_ (6:54) - `SCREENCAST: Correlation plots and Geeky fun with xkcd `_ (7:10) Explore (OPTIONAL) ------------------ * `A framework for exploratory data analysis `_ - As you browse this, there's a "More Pages" button at the bottom. You can also download the pdf from the GitHub site. Data visualization ^^^^^^^^^^^^^^^^^^^ * There's no doubt that ggplot is awesome, but `check out what can be done if you have a good grasp of base plotting in R `_. When I read this, it felt a bit like matplotlib, the venerable Python based plotting package. * `R Graph Catalog `_ Plot demos and code * `Data visualization Cheet Sheet from R Studio `_ * `Box plot comic `_ - You can always count on xkcd. * `Some data visualization resources from my MIS 5460: Business Analytics class `_ * `R Shiny app for visualizing NYC bike share data `_- Lots of interesting logistical challenges with bike share programs. EDA plays a role in gaining insight to system dynamics. R Markdown ^^^^^^^^^^^ * Now that you know some basic R Markdown, you might want to dig into its capabilities a little further. This `overview provided by the folks at R Studio `_ is a good place to start. In addition, Ch 27-28 of RforE covers R Markdown, knitr, and LaTex. * `Daring Fireball: Markown Syntax site `_ - This is John Gruber's site - he developed Markdown. * `Tips and tricks for working with images and figures in R Markdown documents `_ - Very comprehensive post on the ins and outs of working with both external images as well as R generated images within R Markdown documents. Percentiles ^^^^^^^^^^^^^ It's easy to get enamored with averages. They don't tell the whole story. Look at percentiles, too. * Examples of why percentiles are important for performance monitoring by `Dynatrace `_, `Elastic `_, `AppSignal `_ and `Optimizely `_. * `Percentiles in PostgreSQL `_ - this SQL flavor has implemented some nice percentile and `time weighted stats functions `_, including approximate percentiles.