***************************************************** Text wrangling with Python ***************************************************** Intro and Objectives ==================== Now that we've got some basic Python hacking skills and have learned a little about ingesting data files of various types, we are going to learn some more advanced data cleaning techniques using things like regular expressions (regex) and even "fuzzy matching". The days of purely analyzing numeric data are over and text mining skills are really nice to add to your toolkit. These topics will start us on that part of our journey. Readings ======== * WToP - p76-91 (covers regex and preview of data science tools in Python) * KDNuggets newsletter has a nice article on `Practical skills that practical data scientists need `_ (see first Comment too after the story). Downloads and other resources ============================= Regex ----- * `Downloads_DataCleaning_regex.zip `_ Web based tools and tutorials ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * `RegExr `_ - an HTML/JS based site for creating, testing, and learning about Regular Expressions. * `regex101 `_ - Another nice interactive web based tool for learning regex. * `RegexOne `_ - Learn regular expressions with simple, interactive examples. * `Regular-Expressions.info `_ - One of my go to sites for regex for a long time now. Very complete, many examples with substantive explanations. * `Learning to Use Regular Expressions - Gnossis.cx `_ - This is the site from which I first learned regular expressions. It has been around forever, is widely read, and quite good. * `Regular Expressions - A Gentle User Guide and Tutorial `_ - This is a good tutorial, cleverly written, at a greater level of detail than some of the others above. It's got a browser based regex testing tool and the examples are based on matching parts of server logs which is a relevant application for our class. * `Regex Cheat Sheet `_ * `Regular Expressions: Now You Have Two Problems `_ - Classic blog post on regex and a related famous quote about regex. Good links to some resources on the bare minimum that every analyst/coder/hacker should know about the incredible world of regular expressions. * https://xkcd.com/208/ Books/Chapters ^^^^^^^^^^^^^^^ * `Mastering Regular Expressions `_ (Friedl) - The definitive book on regex. * `Regular Expressions Cookbook `_ (Goyvaerts & Levithan) * Chapter 16 in *R for Everyone* covers string manipulation and shows how regex can be used in R. * Chapter 7 in *Data Wrangling with Python* shows how regex can be used within Python. * *Whirlwind Tour of Python* - p76-91 Activities ================================ * We will do several short exercises to learn the basics of regex - will use regex slides and `regexr.com `_ - `SCREENCAST: Intro to regex `_ (2:29) - `SCREENCAST: First quantifiers `_ (9:51) - `SCREENCAST: Escaping special characters `_ (3:43) - `SCREENCAST: Character classes and more quantifiers `_ (13:09) - `SCREENCAST: Matching at end and/or beginning of line `_ (6:15) - `SCREENCAST: General regex tips `_ (4:19) * Then we'll see how to use regex from within Python - **data_cleaning_regex.ipnyb** - `SCREENCAST: The python re module - part 1 `_ (15:33) - `SCREENCAST: The python re module - part 2 `_ (4:29) - `SCREENCAST: The python re module - part 3 `_ (17:19) - **process_apache_log.py** - `SCREENCAST: Using re in Python program to process Apache log file `_ (15:49) * Brief intro to "fuzzy string matching". - **data_cleaning_fuzzy_string_matching.ipynb** uses `fuzzywuzzy` * [OPTIONAL] Using VSCode to work with Jupyter notebooks Another way to work with Jupyter notebooks in Windows is with Microsoft's free IDE known as `VS Code `_. - `16 Reasons to Use VS Code for Developing Jupyter Notebooks `_ - good blog post from Practical Business Python - `SCREENCAST: Intro to VS Code for Jupyter Notebooks `_ * [OPTIONAL] More cleaning examples that you can explore on your own - **data_cleaning_finding_duplicates.ipynb** shows power of `set` data structure - **data_cleaning_headerfixing.ipynb** is a good exercise is complex looping over dictionaries and lists Explore ======= * `A REALLY RELEVANT post in the Practical Business Python blog `_ - Includes slides from the authors talk entitled “Escaping Excel Hell with Python and Pandas.” * `The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) `_ * `Plain Text. Really? `_ - Fascinating take on the world of plain text. Highly recommended. * `On data types in languages like Python `_ More on learning to program in Python ------------------------------------- * `Advanced R `_ - This is Hadley Wickham's new book for those who want to really dig deep into R programming. * `Downloads_IntroPython3.zip `_ These are based on the Software Carpentry tutorials on programming with Python. They cover slightly more advanced topics. We won't cover them in class but are useful for going beyond basic programming. * Error Handling - https://swcarpentry.github.io/python-novice-inflammation/09-errors.html * Defensive programming - https://swcarpentry.github.io/python-novice-inflammation/10-defensive.html * Debugging - https://swcarpentry.github.io/python-novice-inflammation/11-debugging.html * Handling command line arguments - https://swcarpentry.github.io/python-novice-inflammation/12-cmdline.html