*********************************** EDA with R *********************************** Intro and Objectives ==================== We will begin to do exploratory data analysis in R. After completing the activities in this module, you should be able to explore a dataset using: * descriptive statistics, * simple R scripts including writing your own functions, * basic (and not so basic) plots with ggplot2. We are going to explore a dataset related to New York City condo evaluations for fiscal year 2011-2012. It was obtained from the NYC Open Data initiative - https://data.cityofnewyork.us/. Readings ======== * RforE - Chapters 6, 7, 8 * r4ds - Chapters 1-2, 10-12 * PDSwR - Chapters 3, 4 Downloads and other resources ============================= * `Downloads_EDA1withR.zip `_ Other Resources: * `Data files for the "R for Everyone" text `_ * `ggplot2 cheat sheet `_ (R Studio) * `Hadley Wickham is working on a new ggplot2 book `_ Activities ================================ We will work through two tutorials on EDA (with a short detour on creating user defined functions in R) * **EDA1a_summarystats_shell.Rmd** - see pdf slides - `SCREENCAST: Overview of EDA and creating R Project `_ (7:17) - `SCREENCAST: Inital exploration of housing dataframe `_ (16:41) - `SCREENCAST: Modify dataframe, summary statistics, save as binary file `_ (25:13) * **summarystats_4470.R** - `SCREENCAST: Create your own R function `_ (12:40) * **EDA1b_basicplots_shell.Rmd** - `SCREENCAST: Intro to data visualization `_ (2:41) - `SCREENCAST: Plotting using R's base plot system `_ (8:53) - `SCREENCAST: Intro to ggplot2 `_ (2:55) - `SCREENCAST: The Grammar of Graphics underlying ggplot2 `_ (7:24) - `SCREENCAST: Starting with qplot `_ (2:52) - `SCREENCAST: Using the ggplot() function `_ (7:07) - `SCREENCAST: Histograms `_ (5:29) - `SCREENCAST: Boxplots and violin plots, Faceting and reusing plot objects `_ (5:40) - `SCREENCAST: Scatter plots `_ (2:33) - `SCREENCAST: Themes `_ (5:40) - `SCREENCAST: Density plots and other histogram alternatives `_ (8:45) - `SCREENCAST: Bar charts `_ (6:13) - `SCREENCAST: Correlation plots and Geeky fun with xkcd `_ (6:28) Explore (OPTIONAL) ================== * `A framework for exploratory data analysis `_ - As you browse this, there's a "More Pages" button at the bottom. You can also download the pdf from the GitHub site. Data visualization ------------------ * There's no doubt that ggplot is awesome, but `check out what can be done if you have a good grasp of base plotting in R `_. When I read this, it felt a bit like matplotlib, the venerable Python based plotting package. * `R Graph Catalog `_ Plot demos and code * `Data visualization Cheet Sheet from R Studio `_ * `Box plot comic `_ - You can always count on xkcd. * `Working with colours in R `_ * `Some data visualization resources from my MIS 5460: Business Analytics class `_ * `R Shiny app for visualizing NYC bike share data `_- Lots of interesting logistical challenges with bike share programs. EDA plays a role in gaining insight to system dynamics. Percentiles ------------ It's easy to get enamored with averages. They don't tell the whole story. Look at percentiles, too. * Examples of why percentiles are important for performance monitoring by `Dynatrace `_, `Elastic `_, `AppSignal `_ and `Optimizely `_. * `Percentiles in PostgreSQL `_ - this SQL flavor has implemented some nice percentile and `time weighted stats functions `_, including approximate percentiles. R Markdown ---------- * Now that you know some basic R Markdown, you might want to dig into its capabilities a little further. This `overview provided by the folks at R Studio `_ is a good place to start. In addition, Ch 27-28 of RforE covers R Markdown, knitr, and LaTex. * `Daring Fireball: Markown Syntax site `_ - This is John Gruber's site - he developed Markdown. * `Tips and tricks for working with images and figures in R Markdown documents `_ - Very comprehensive post on the ins and outs of working with both external images as well as R generated images within R Markdown documents.