Resource Center¶
Here a links to a variety of data science and business analytics related resources.
General business analytics, data science, statistical modeling¶
Analytics Magazine It’s published by INFORMS - the Institute for Operations Research and Management Science. They are the premier professional society for analytics and it’s inexpensive to join as a student. Full disclosure, I’ve been an INFORMS member since about 1986 (when it was still ORSA/TIMS). We were doing analytics before it was called analytics. :)
Data science - A first introduction - nice online and free text that gets you going with data science using R and the tidyverse by a collection of faculty at the University of British Columbia
Awesome Data Science - giant curated list of data science resources
Statistical Modeling, Causal Inference, and Social Science - Andrew Gelman’s very high quality blog. Lots of stuff on doing stats correctly.
Statistical Modeling: The Two Cultures - paper by L. Breiman (2001) that gets at the heart of the statistics vs ML tension.
Open data science curriculum - collection of free courses on various data science, math and stat topics
Careers¶
- There are numerous groups on Reddit related to analytics and data science. These can be very good resources for unvarnished conversations/opinions about careers, grad school as well as technical advice.
December 2022: Data Science Career Resources and Essay on data science hiring process by Eric Ma who writes the Data Science Programming Newsletter on Substack.
Programming tutorial hubs¶
Software Carpentry and Data Carpentry - helping scientists learn to do computational work with R, Python, SQL and other tools
Online courses¶
There are numerous online courses available through DataCamp, Coursera, EdX, Udemy and others. Here’s a few Python and R ones I’ve checked out over the years.
Intro to Data Science in Python - I did this short course in Feb 2017 (Coursera UMich). Great fun. If you want a good pandas/python learning challenge, try the assignments.
Python for Everybody course - This site includes a bunch of videos and supplementary files. The whole thing was created by a professor at University of Michigan and is meant to be a totally open set of freely available learning materials for Python in the context of data analysis.
Coursera has some well regarded R based data science courses
Learning the command line¶
As you’ve no doubt gathered from this class, I’m a big fan of using the command line for certain data related tasks and think that command line skills are really important. There’s a new 2e of the O’Reilly book “Data science at the command line”. The second edition is FREELY available online.
https://datascienceatthecommandline.com/ - home page for the book
https://datascienceatthecommandline.com/2e/ - the free 2nd edition
Chapter 1 is a great overview of why you should become adept at using the command line.
Learning R¶
Online R tutorials, books and examples for getting started¶
R-bloggers- The aggregator for R related blogs.
R for Data Science - Free, online version of the book, R for Data Science by Hadley Wickham and Garrett Grolemund.
Quick-R - This is a great site dedicated to helping R newbies get over the somewhat steep R learning curve.
fasteR: The fast lane to learning R - created by Norm Matloff who is a big proponent of learning base R first before doing things with tidyverse packages.
STAT 545 - Data wrangling, exploration, and analysis with R - Jenny Bryan’s course developed at UBC and still used even though JB has moved on to R Studio. Not only does this cover R, but also gets into things like version control, web scraping and Shiny.
Cookbook for R - Another great site for learning R. In their words: “The goal of the cookbook is to provide solutions to common tasks and problems in analyzing data.”
Webinars from R Studio- The creators of the hugely popular R Studio package have a ton of learning resources on their site.
The Official R Manuals - These are accessible from the main R Project page in the Documentation section.
Contributed Documentation - Many people have written tutorials, books, and other free documentation for various aspects of R. This is part of the magic of R community.
Introducing R to a non-programmer in one hour - Just what it says.
Teach yourself Shiny- A somewhat recent development by the folks at R Studio is something called a Shiny web app. Learn to create interactive, R driven, web apps!
The base R vs tidyverse debate¶
The tidyverse has become increasingly popular and with this popularity has come more scrutiny. In particular, there’s a healthy debate on whether new R users should first learn base R and then move on to the tidyverse or whether they should immediately be taught the tidyverse approach. It really isn’t an either-or question and in this course you will both base R and tidyverse approaches. I do start with base R because I think you need a good understanding of things like vectors to make the most of the R language. At the end of the day, we use R to solve problems and the more tools you have to tackle those problems, the better off you will be. A few good resources on this debate include the following.
The TidyverseSkeptic project by Norm Matloff (his fasteR project is described above) is a well known essay on why new R learners should be taught base R first. Check out the Issues for some heated discussion.
David Robinson argues the tidyverse first side in posts such as http://varianceexplained.org/r/teach-tidyverse/ and http://varianceexplained.org/r/why-I-use-ggplot2/. (yes, technically ggplot2 predates the tidyverse).
Data Carpentry has a post on base R and tidy equivalents
Caret vs tidymodels also (kinda) falls into this debate - see On not using tidymodels, and Caret vs tidymodels: the old and the new, and this Reddit post.
One criticism of the tidyverse is that it can lead to dependency bloat - learn more from this essay about the “tinyverse”.
I did a short blog post on base vs tidy
There’s no doubt that ggplot is awesome, but check out what can be done if you have a good grasp of base plotting in R. When I read this, it felt a bit like matplotlib, the venerable Python based plotting package.
There are a few Reddit threads that address this topic including this one and this other one
Packages¶
The R ecosystem relies on high quality packages and its community of package developers. Here are some collections of package descriptions and links.
RStartHere- A very comprehensive and well organized list of packages for doing data science in R.
Awesome R- Curated list of R packages by category (IDE, data manipulation, etc.)
Learning Python¶
Online Python tutorials, books and examples for getting started¶
Software Carpentry - Lessons - Software Carpentry is one of my all time favorite resources for teaching and learning practical programming skills. This link takes you to their list of “Lessons” (really entire mini-courses). In addition to a lesson on Python, you’ll find lessons on tons of stuff that is useful for business analytics and data science. Highly, highly recommended.
Whirlwind Tour of Python - Jake VanderPlas - Free 100 page pdf and associated Jupyter notebooks for those who want to learn Python for data science use and have some prior knowledge of programming.
Python for Everybody - Charles Severance - This is a remixed, freely available, textbook on learning Python to do data analysis.
Think Python (Downey) - terrific book for newish Python learners
Automate the Boring Stuff with Python (Sweigart) - another really good free online book
Ted Petrou’s GitHub repos - I stumbled on this via LinkedIn. I went through his Jupyter notebooks in the Learn-Pandas repo and they were outstanding.
Blogs and listservs¶
Practical Business Python - Super relevant blog for business students learning Python.
Pycoders Weekly - Weekly email newsletter. Always has interesting stuff and almost always something directly data science related.
Libraries¶
Awesome Python - A curated list of awesome Python frameworks, libraries, software and resources
Statistics¶
If you are rusty on statistics, there’s a really good OpenIntro Stats book available as a free online book or you can pay what you want for a paperback copy. It includes R based material.
You can also find high quality free online statistics courses through the Open Learning Initiative as well as places like Coursera and EdX.
Cross Validated is a great Q&A forum for all things statistics. Lots of R related content.
Publicly available data¶
DrivenData Competitions - not suggesting you compete (you can) but these are a great source of high quality datasets. You’ll need to create a free account to be able to download data. I used this site as a motivation for my series of blog posts on algal bloom detection from satellite imagery.
Kaggle Datasets - need to create a free Kaggle account
Data is Plural - links to many interesting datasets
Modern plain text computing - this course has a list of practice data sources on the main page (check out tidy Tuesday)
Data.gov - US government data
Census.gov - US census data
Bureau of Transportation Statistics - tons of transportation rel
OpenML Datasets - site with many ML resources
cs109 Resources (2014) - Many links to datasets (as well as links to Python and misc data science stuff)
https://github.com/rstudio/RStartHere#data - From the RStartHere site
Workflow and reproducible analysis¶
Modern plain text computing - a course by Kieran Healy
Data Science Workflow: Overview and Challenges - Blog post by Philip Guo who did his dissertation on this topic.
Cookiecutter Data Science - “A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.”