More and more research in the humanities and allied social sciences involves analyzing machine-readable data with computer software. But learning the techniques and perspectives that support this computational work is still difficult for students and researchers. The population of university courses and books addressed to researchers in the Geisteswissenschaften remains small and unevenly distributed. This is unfortunate because scholars associated with the humanities stand to benefit from expanding reservoirs of trustworthy, machine-readable data. We wrote this book in response to this situation. Our goal is to make common techniques and practices used in data analysis more accessible and to describe in detail how researchers can use the programming language Python—and its software ecosystem—in their work. When readers finish this book, they will have greater fluency in quantitative data analysis and will be equipped to move beyond deliberating about what one might do with large datasets and large text collections; they will be ready to begin to propose answers to questions of demonstrable interest.
This book is written with a particular group of readers in mind: students and researchers in the humanities and allied social sciences who are familiar with the Python programming language and who want to use Python in research related to their interests. (Readers fluent in a programming language other than Python should have no problem picking up the syntax of Python as they work through the initial chapters.) That such a population of readers exists—or is coming into existence—is clear. Python is the official programming language in secondary education in France and the most widely taught programming language in US universities [Guo, 2014, Ministère de l'Éducation Nationale et de la Jeunesse, 2018]. The language is, increasingly, the dominant language used in software development in high-income countries such as the United States, United Kingdom, Germany, and Canada [Robinson, 2017]. There are vanishingly few barriers to learning the basics. This is a book which should be accessible to all curious hackers interested in data-intensive research.
The book is limited in that it occasionally omits detailed coverage of mathematical or algorithmic details of procedures and models, opting to focus on supporting the reader in practical work. We compensate for this shortcoming by providing references to work describing the relevant algorithms and models in “Further Reading” sections at the end of each chapter. Basic knowledge of mathematics and mathematical notation is assumed; readers lacking this background knowledge may benefit from reviewing an introductory mathematical text such as Juola and Ramsay .
Although the book focuses on analyzing text data and tabular data sets, those interested in studying image and audio data using Python will find much of the material presented here useful. The techniques and libraries introduced here are regularly used in the analysis of images and of sound recordings. Grayscale images, for example, are frequently represented as fixed-length sequences of intensities. As text documents are typically represented as fixed-length sequences of word frequencies, many of the tools used to analyze image data also work with text. And although low-level analysis of speech and music recordings requires familiarity with signal processing and libraries not covered in this book, analyzing features derived from sound recordings (e.g., Bertin-Mahieux et al. ) will likely use the techniques and software presented in these pages.
Also absent from this book is discussion of the practical and methodological implications of using computational tools and digital resources generally or in the Geisteswissenschaften specifically. We include in this category arguments which deny that the borrowing of useful (computational) methods from the natural and social sciences is welcome or desirable (e.g., Ramsay ). Scholarly work published during the previous century and the current one has treated this topic in considerable depth [Cetina, 2009, Hayles, 2012, McCarty, 2005, Pickering, 1995, Suchman et al., 1999]. Moreover, we believe that students and researchers coming from the humanities and interpretive social sciences will already be familiar with the idea that methods are not neutral, that knowledge is situated, and that interpretation and description are inextricable. Indeed, there are few ideas more frequently and consistently discussed in the Geisteswissenschaften.
We came together to write this book because we share the conviction that tools and techniques for doing computer-supported data analysis in Python are useful in humanities research. Each of us came to this conclusion by different paths. One of us came to programming out of a frustration with existing tools for the exploration and analysis of text corpora. The rest of us came to data analysis out of a desire to move beyond the methodological monoculture of “close reading” associated with research in literary studies and, to a lesser extent, with cultural studies. We have grown to appreciate the ways in which a principled mixture of methods—including methods borrowed from certain corners of the social and natural sciences—permits doing research which attends to observed patterns across time and at multiple scales.
We are indebted to the feedback provided by numerous participants in the workshops we have been teaching over the past years. We would also like to acknowledge Antal van den Bosch, Bob Carpenter, Christof Schoech, Dan Rockmore, Fotis Jannidis, James Dietrich, Lindsey Geybels, Jeroen de Gussem, and Walter Daelemans for their advice and comments. We would also like to thank several anonymous reviewers for their valuable comments on early versions of the book.