Over the last decade, both the variety and amount of data available to social scientists have expanded. These new data sources include administrative records (e.g., voter files, campaign finance and lobbying records), geo-referenced data (e.g., satellite maps, geocoded event data), and texts (e.g., speeches, court rulings, legislative bills). Many of these data sources can be accessed through the World Wide Web and as a consequence, techniques such as web scraping have become an essential part of social scientists’ toolkit. The objective of this workshop is to introduce basic tools and techniques for automatic content extraction, parsing and other data-handling tasks that are commonly encountered in data-intensive research projects. The course will be taught in Python, and requires the basic knowledge of Python programming (such as the Introduction to Python workshop taught by the Office of Population Research). We will cover techniques ranging from Python regular expressions and file manipulation, to the popular HTML/XML parsing library “Beautiful Soup” and PDF content extraction. The course ends with an introduction to the Twitter API for accessing Twitter content.
Web Scraping and Text Processing with Python 2015