Web Scraping and Text Processing with Python Workshop, January 27-31, 2014

Dates: January 27 - 31, 2014
Morning Session: 9:30-11:30am
Afternoon Session: 1:30-3:30pm
Location: TBA
Instructor: Radhika Saksena


Registration  The registration for this workshop is now closed due to space limitations. If you are interested in being added to the waitlist, please fill out the waitlist form.
*Note: this workshop is open only to Princeton affiliates

Over the last decade, both the variety and amount of data available to social scientists
have expanded. These new data sources include administrative records (e.g., voter files,
campaign finance and lobbying records), geo-referenced data (e.g., satellite maps,
geocoded event data), and texts (e.g., speeches, court rulings, legislative bills). Many
of these data sources can be accessed through the World Wide Web and as a consequence, 
techniques such as web scraping have become an essential part of social scientists'
toolkit. The objective of this workshop is to introduce basic tools and techniques for
automatic content extraction, parsing and other data-handling tasks that are commonly
encountered in data-intensive research projects. The course will be taught in Python,
and only a basic knowledge of general computing and programming (such as the R
statistical programming taught at the Introductory Statistical Programming Camp) is
assumed. We will cover techniques ranging from Python regular expressions and file 
manipulation, to the popular web scraping library ``Beautiful Soup'' and PDF content
extraction. The course ends with an introduction to the Twitter API for accessing Twitter
content.

Syllabus

*Participants are encouraged to enroll in the OPR sponsored Introduction to Python workshop. Details can be found here: Introduction to Python workshop