Text wrangling with Python¶
Intro and Objectives¶
Now that we’ve got some basic Python hacking skills and have learned a little about ingesting data files of various types, we are going to learn some more advanced data cleaning techniques using things like regular expressions (regex) and even “fuzzy matching”. The days of purely analyzing numeric data are over and text mining skills are really nice to add to your toolkit. These topics will start us on that part of our journey.
Readings¶
WToP - p76-91 (covers regex and preview of data science tools in Python)
KDNuggets newsletter has a nice article on Practical skills that practical data scientists need (see first Comment too after the story).
Downloads and other resources¶
Regex¶
Web based tools and tutorials¶
RegExr - an HTML/JS based site for creating, testing, and learning about Regular Expressions.
regex101 - Another nice interactive web based tool for learning regex.
RegexOne - Learn regular expressions with simple, interactive examples.
Regular-Expressions.info - One of my go to sites for regex for a long time now. Very complete, many examples with substantive explanations.
Learning to Use Regular Expressions - Gnossis.cx - This is the site from which I first learned regular expressions. It has been around forever, is widely read, and quite good.
Regular Expressions - A Gentle User Guide and Tutorial - This is a good tutorial, cleverly written, at a greater level of detail than some of the others above. It’s got a browser based regex testing tool and the examples are based on matching parts of server logs which is a relevant application for our class.
Regular Expressions: Now You Have Two Problems - Classic blog post on regex and a related famous quote about regex. Good links to some resources on the bare minimum that every analyst/coder/hacker should know about the incredible world of regular expressions.
Books/Chapters¶
Mastering Regular Expressions (Friedl) - The definitive book on regex.
Regular Expressions Cookbook (Goyvaerts & Levithan)
Chapter 16 in R for Everyone covers string manipulation and shows how regex can be used in R.
Chapter 7 in Data Wrangling with Python shows how regex can be used within Python.
Whirlwind Tour of Python - p76-91
Activities¶
We will do several short exercises to learn the basics of regex
will use regex slides and regexr.com
SCREENCAST: Intro to regex (2:29)
Then we’ll see how to use regex from within Python
data_cleaning_regex.ipnyb
process_apache_log.py
SCREENCAST: Using re in Python program to process Apache log file (15:49)
Brief intro to “fuzzy string matching”.
data_cleaning_fuzzy_string_matching.ipynb uses fuzzywuzzy
[OPTIONAL] Using VSCode to work with Jupyter notebooks
Another way to work with Jupyter notebooks in Windows is with Microsoft’s free IDE known as VS Code.
16 Reasons to Use VS Code for Developing Jupyter Notebooks - good blog post from Practical Business Python
[OPTIONAL] More cleaning examples that you can explore on your own
data_cleaning_finding_duplicates.ipynb shows power of set data structure
data_cleaning_headerfixing.ipynb is a good exercise is complex looping over dictionaries and lists
Explore¶
A REALLY RELEVANT post in the Practical Business Python blog - Includes slides from the authors talk entitled “Escaping Excel Hell with Python and Pandas.”
Plain Text. Really? - Fascinating take on the world of plain text. Highly recommended.
More on learning to program in Python¶
Advanced R - This is Hadley Wickham’s new book for those who want to really dig deep into R programming.
These are based on the Software Carpentry tutorials on programming with Python. They cover slightly more advanced topics. We won’t cover them in class but are useful for going beyond basic programming.
Error Handling - https://swcarpentry.github.io/python-novice-inflammation/09-errors.html
Defensive programming - https://swcarpentry.github.io/python-novice-inflammation/10-defensive.html
Debugging - https://swcarpentry.github.io/python-novice-inflammation/11-debugging.html
Handling command line arguments - https://swcarpentry.github.io/python-novice-inflammation/12-cmdline.html