{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "8b4dc661", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "plt.style.use(\"../styles/hda.mplstyle\")" ] }, { "cell_type": "markdown", "id": "ed68a505", "metadata": {}, "source": [ "(chp-introduction-cook-books)=\n", "# Introduction\n", "\n", "## Quantitative Data Analysis and the Humanities\n", "\n", "The use of quantitative methods in humanities disciplines such as history, literary\n", "studies, and musicology has increased considerably in recent years. Now it is not uncommon\n", "to learn of a historian using geospatial data, a literary scholar applying techniques from\n", "computational linguistics, or a musicologist employing pattern matching methods. Similar\n", "developments occur in humanities-adjacent disciplines, such as archeology, anthropology,\n", "and journalism. An important driver of this development, we suspect, is the advent of\n", "cheap computational resources as well as the mass digitization of libraries and archives\n", "{cite:p}`Imai:2018,abello2012computational,borgman2010scholarship,vanKranenburg:2017` It\n", "has become much more common in humanities research to analyze thousands, if not millions,\n", "of documents, objects, or images; an important part of the reason why quantitative methods\n", "are attractive now is that they promise the means to detect and analyze patterns in these\n", "large collections.\n", "\n", "A recent example illustrating the promise of data-rich book history and cultural analysis\n", "is {cite:t}`bode2012reading`. Bode's analysis of the massive online bibliography of Australian\n", "literature *AusLit* demonstrates how quantitative methods can be used to enhance our\n", "understanding of literary history in ways that would not be possible absent data-rich and\n", "computer-enabled approaches. Anchored in the cultural materialist focus of Australian\n", "literary studies, Bode uses data analysis to reveal unacknowledged\n", "shifts in the demographics of Australian novelists, track the entry of British publishers\n", "into the Australian book market in the 1890s, and identify ways Australian\n", "literary culture departed systematically from British practices. A second enticing example\n", "showing the potential of data-intensive research is found in {cite:t}`dasilva:2016`. Using\n", "data from large-scale folklore databases, {cite:t}`dasilva:2016` investigate the\n", "international spread of folktales. Based on quantitative analyses, they show how the\n", "diffusion of folktales is shaped by language, population histories, and migration. A third and final example---one that can\n", "be considered a landmark in the computational analysis of literary texts---is the\n", "influential monograph by {cite:t}`burrows:1987` on Jane Austen's oeuvre. Burrows uses relatively\n", "simple statistics to analyze the frequencies of\n", "inconspicuous, common words that typically escape the eye of the human reader. In doing so he\n", "documents hallmarks of Austen's sentence style and carefully documents differences in characters' speaking styles. The\n", "book illustrates how quantitative analyses can yield valuable and lasting insights into\n", "literary texts, even if they are not applied to datasets that contain millions of texts.\n", "\n", "Although recent interest in quantitative analysis may give the impression that humanities\n", "scholarship has entered a new era, we should not forget that it is part of a development\n", "that began much earlier. In fact, for some, the ever so prominent \"quantitative turn\" we\n", "observe in humanities research nowadays is not a new feature of humanities scholarship; it\n", "marks a return to established practice. The use of quantitative methods such as linear\n", "regression, for example, was a hallmark of social history in the 1960s and 1970s\n", "{cite}`sewelljr2005political`. In literary studies, there are numerous examples of\n", "quantitative methods being used to explore the social history of literature\n", "{cite}`williams1961long,escarpit:1958` and to study the literary style of individual\n", "authors {cite}`yule:1944,muller1967etude`. Indeed, the founder of \"close reading,\" I. A.\n", "Richards, was himself concerned with the analysis and use of word frequency lists\n", "{cite}`igarashi2015statistical`.\n", "\n", "Quantitative methods fell out of favor in the 1980s as interest in cultural history\n", "displaced interest in social history (where quantitative methods had been indispensable).\n", "This realignment of research priorities in history is known as \"the cultural turn.\" In his widely circulated account, William Sewell\n", "offers two reasons for his and his peers' turn away from social history and quantitative\n", "methods in the 1970s. First, \"latent ambivalence\" about the use of quantitative methods\n", "grew in the 1960s because of their association with features of society that were regarded\n", "as defective by students in the 1960s. Quantitative methods were associated with\n", "undesirable aspects of what Sewell labels \"the Fordist mode of socioeconomic regulation,\"\n", "including repressive standardization, big science, corporate conformity, and state\n", "bureaucracy. Erstwhile social historians like Sewell felt that \"in adopting quantitative\n", "methodology we were participating in the bureaucratic and reductive logic of big science,\n", "which was part and parcel of the system we wished to criticize\"\n", "[{cite:author}`sewelljr2005political` {cite:year}`sewelljr2005political`, 180-81]. Second, the \"abstracted empiricism\" of\n", "quantitative methods was seen as failing to give adequate attention to questions of human\n", "agency and the texture of experience, questions which cultural history focused on\n", "[{cite:author}`sewelljr2005political` {cite:year}`sewelljr2005political`, 182].\n", "\n", "We make no claims about the causes of the present revival of interest in\n", "quantitative methods. Perhaps it has something to do with previously dominant\n", "methods in the humanities, such as critique and close reading, \"running out of\n", "steam\" in some sense, as {cite:t}`latour2004why` has suggested. This would go some way\n", "towards explaining why researchers are now (re)exploring quantitative\n", "approaches. Or perhaps the real or perceived costs associated with the use of\n", "quantitative methods have declined to a point that the potential benefits\n", "associated with their use---for many, broadly the same as they were in the\n", "1960s---now attract researchers.\n", "\n", "What is clear, however, is that university curricula in the humanities do not\n", "at present devote sufficient time to thoroughly acquaint and involve students\n", "with data-intensive and quantitative research, making it challenging for\n", "humanities students and scholars to move from spectatorship to active\n", "participation in (discussions surrounding) quantitative research. The aim of\n", "this book, then, is precisely to accommodate humanities students and scholars\n", "in their growing desire to understand how to tackle theoretical and descriptive\n", "questions using data-rich, computer-assisted approaches.\n", "\n", "Through several case studies, this book offers a guide to quantitative data analysis using\n", "the Python programming language. The Python language is widely used in academia, industry,\n", "and the public sector. It is the official programming language in secondary education\n", "in France and the most widely taught programming language in\n", "US universities {cite}`ministere2018projets,guo2014python`. If learning\n", "data carpentry in Python chafes, you may rest assured that improving your fluency in\n", "Python is likely to be worthwhile. In this book, we do not focus on learning how\n", "to code per se; rather, we wish to highlight how quantitative methods can be\n", "meaningfully applied in the particular context of humanities scholarship. The book\n", "concentrates on textual data analysis, because decades of research have been devoted to\n", "this domain and because current research remains vibrant. Although many research\n", "opportunities are emerging in music, audio, and image analysis, they fall outside the\n", "scope of the present undertaking\n", "{cite:p}`clarke:2004,tzanetakis:2007,cook:2013,clement2016measured`. All chapters focus\n", "on real-world data sets throughout and aim to \n", "illustrate how quantitative data analysis can play more than an auxiliary role in tackling\n", "relevant research questions in the humanities.\n", "\n", "## Overview of the Book\n", "\n", "This book is organized into two parts. Part 1 covers essential techniques for gathering,\n", "cleaning, representing, and transforming textual and tabular data. \"Data carpentry\"---as\n", "the collection of these techniques is sometimes referred to---precedes any effort to\n", "derive meaningful insights from data using quantitative methods. The four chapters of Part\n", "1 prepare the reader for the data analyses presented in the second part of this book.\n", "\n", "To give an idea of what a complete data analysis entails, the current chapter presents an\n", "exploratory data analysis of historical cookbooks. In a nutshell, we demonstrate which\n", "steps are required for a complete data analysis, and how Python facilitates the\n", "application of these steps. After sketching the main ingredients of quantitative data\n", "analysis, we take a step back in chapter {ref}`chp-getting-data` to describe essential\n", "techniques for data gathering and exchange. Built around a case study of extracting and\n", "visualizing the social network of the characters in Shakespeare's *Hamlet*, the chapter\n", "provides a detailed introduction into different models of data exchange, and how Python\n", "can be employed to effectively gather, read, and store different data formats, such as CSV,\n", "JSON, PDF, and XML. Chapter {ref}`chp-vector-space-model` builds on chapter\n", "{ref}`chp-getting-data`, and focuses on the question of how texts can be represented for\n", "further analysis, for instance for document comparison. One powerful form of\n", "representation that allows such comparisons is the so-called \"Vector Space Model\". The\n", "chapter provides a detailed manual for how to construct document-term matrices from word\n", "frequencies derived from text documents. To illustrate the potential and benefits of the\n", "Vector Space Model, the chapter analyzes a large corpus of classical French drama, and\n", "shows how this representation can be used to quantitatively assess similarities and\n", "distances between texts and subgenres. While data analysis in, for example, literary\n", "studies, history and folklore is often focused on text documents, subsequent analyses\n", "often require processing and analyzing tabular data. The final chapter of part 1\n", "(chapter {ref}`chp-working-with-data`) provides a detailed introduction into how such\n", "tabular data can be processed using the popular data analysis library \"Pandas\". The\n", "chapter centers around diachronic developments in child naming practices, and demonstrates\n", "how Pandas can be efficiently employed to quantitatively describe and visualize long-term\n", "shifts in naming. All topics covered in Part 1 should be accessible to everyone who has\n", "had some prior exposure to programming.\n", "\n", "Part 2 features more detailed and elaborate examples of data analysis using\n", "Python. Building on knowledge from chapter {ref}`chp-working-with-data`, the first chapter\n", "of part 2 (chapter {ref}`chp-statistics-essentials`) uses the Pandas library to\n", "statistically describe responses to a questionnaire about the reading of literature and\n", "appreciation of classical music. The chapter provides detailed descriptions of important\n", "summary statistics, allowing to analyze whether, for example, differences between\n", "responses can be attributed to differences between certain demographics. Chapter\n", "{ref}`chp-statistics-essentials` paves the way for the introduction to probability in\n", "chapter {ref}`chp-intro-probability`. This chapter revolves around the classic case of\n", "disputed authorship of several essays in *The Federalist Papers*, and demonstrates how probability theory\n", "and Bayesian inference in particular can be applied to shed light on this still intriguing\n", "case. Chapter {ref}`chp-map-making` discusses a series of fundamental techniques to create\n", "geographic maps with Python. The chapter analyzes a dataset describing important battles\n", "fought during the American Civil War. Using narrative mapping techniques, the chapter\n", "provides insight into the trajectory of the war. After this brief intermezzo, chapter\n", "{ref}`chp-stylometry` returns to the topic of disputed authorship, providing a more\n", "detailed and thorough overview of common and essential techniques used to model the\n", "writing style of authors. The chapter aims to reproduce a stylometric analysis revolving\n", "around a challenging authorship controversy from the twelfth century. On the basis of a\n", "series of different stylometric techniques (including Burrows's Delta, Agglomerative\n", "Hierarchical Clustering, and Principal Component Analysis), the chapter illustrates how\n", "quantitative approaches aid to objectify intuitions about document authenticity. The\n", "closing chapter of part 2 (chapter {ref}`chp-topic-models`) connects the preceding\n", "chapters, and challenges the reader to integrate the learned data analysis techniques as\n", "well as to apply them to a case about trends in decisions issued by the United States\n", "Supreme Court. The chapter provides a detailed account of mixed-membership models or\n", "\"topic models\", and employs these to make visible topical shifts in the Supreme Court's\n", "decision-making. Note that the different chapters in part 2 make different assumptions\n", "about readers' background preparation. Chapter {ref}`chp-intro-probability` on disputed\n", "authorship, for example, will likely be easier for readers who have some familiarity with\n", "probability and statistics. Each chapter begins with a discussion of the background\n", "assumed.\n", "\n", "## Related Books\n", "\n", "Our monograph aims to fill a specific lacuna in the field, as a coherent, book-length discussion of Python programming for data analysis in the humanities. To manage the expectations of our readership, we believe it is useful to state how this book wants to position itself against some of the existing literature in the field, with which our book inevitably intersects and overlaps. For the sake of brevity, we limit ourselves to more recent work. At the start, it should be emphasized that other resources than the traditional monograph also play a vital role in the community surrounding quantitative work in the humanities. The (multilingual) website [The Programming Historian](https://programminghistorian.org/), for instance, is a tutorial platform that hosts a rich variety of introductory lessons that target specific data-analytic skills {cite:p}`ph2019`.\n", "\n", "The focus on Python distinguishes our work from a number of recent textbooks that use the programming language R {cite:p}`R13`, a robust and mature scripting platform for statisticians that is also used in the social sciences and humanities. A general introduction to data analysis using R can be found in {cite:t}`wickham:2017`. One can also consult {cite:t}`jockers:2014` or {cite:t}`arnold2015`, which have humanities scholars as their intended audience. Somewhat related are two worthwhile textbooks on corpus and quantitative linguistics, {cite:t}`baayen:2008` and {cite:t}`gries2013`, but these are less accessible to an audience outside of linguistics. There also exist some excellent more general introductions to the use of Python for data analysis, such as {cite:t}`mckinney2017` and {cite:t}`vanderplas:2016`. These handbooks are valuable resources in their own respect but they have the drawback that they do not specifically cater to researchers in the humanities. The exclusive focus on Humanities data analysis clearly sets our book apart from these textbooks---which the reader might nevertheless find useful to consult at a later stage.\n", "\n", "## How to Use This Book\n", "\n", "This book has a practical approach, in which descriptions and explanations of quantitative\n", "methods and analyses are alternated with concrete implementations in programming code. We\n", "strongly believe that such a hands-on approach stimulates the learning process, enabling\n", "researchers to apply and adopt the newly acquired knowledge to their own research\n", "problems. While we generally assume a linear reading process, all chapters are constructed\n", "in such a way that they *can* be read independently, and code examples are not dependent on\n", "implementations in earlier chapters. As such, readers familiar with the principles and\n", "techniques of, for instance, data exchange or manipulating tabular data, may safely skip\n", "chapters {ref}`chp-getting-data` and {ref}`chp-working-with-data`.\n", "\n", "The remainder of this chapter, like all the chapters in this book, includes Python code\n", "which you should be able to execute in your computing environment. All code presented here\n", "assumes your computing environment satisfies the basic requirement of having an\n", "installation of Python (version 3.9 or higher) available on a Linux, macOS, or Microsoft\n", "Windows system. A distribution of Python may be obtained from the [Python Software\n", "Foundation](https://www.python.org/) or through the operating system's package manager\n", "(e.g., `apt` on Debian-based Linux, or `brew` on macOS). Readers new to Python may wish to install the\n", "[Anaconda](https://www.continuum.io/) Python distribution which bundles most of the Python\n", "packages used in this book. We recommend that macOS and Windows users, in particular, use\n", "this distribution.\n", "\n", "### What you should know\n", "As said, this is not a book teaching how to program from scratch, and we assume the reader\n", "already has some working knowledge about programming and Python. However, we do not expect\n", "the reader to have mastered the language. A relatively short introduction to programming\n", "and Python will be enough to follow along (see, for example, *Python Crash Course* by\n", "{cite:t}`matthes:2016`). The following code blocks serve as a refresher of some important\n", "programming principles and aspects of Python. At the same time, they allow you to test\n", "whether you know enough about Python to start this book. We advise you to execute these\n", "examples as well as all code blocks in the rest of the book in so-called \"Jupyter notebooks\" (see https://jupyter.org/). Jupyter notebooks\n", "offer a wonderful environment for executing code, writing notes, and creating\n", "visualizations. The code in this book is assigned the DOI ``10.5281/zenodo.3563075``, and\n", "can be downloaded from https://doi.org/10.5281/zenodo.3563075.\n", "\n", "#### Variables\n", "First of all, you should know that variables are defined using the assignment operator\n", "`=`. For example, to define the variable `x` and assign the value `100` to it, we write:" ] }, { "cell_type": "code", "execution_count": 2, "id": "1af9413d", "metadata": {}, "outputs": [], "source": [ "x = 100" ] }, { "cell_type": "markdown", "id": "15e97258", "metadata": {}, "source": [ "Numbers, such as `1`, `5`, and `100` are called integers and are of type `int` in\n", "Python. Numbers with a fractional part (e.g., `9.33`) are of the type `float`. The string\n", "data type (`str`) is commonly used to represent text. Strings can be expressed in multiple\n", "ways: they can be enclosed with single or double quotes. For example:" ] }, { "cell_type": "code", "execution_count": 3, "id": "f9cbfde2", "metadata": {}, "outputs": [], "source": [ "saying = \"It's turtles all the way down\"" ] }, { "cell_type": "markdown", "id": "2c5796c4", "metadata": {}, "source": [ "#### Indexing sequences\n", "Essentially, Python strings are sequences of characters, where characters are strings of\n", "length one. Sequences such as strings can be indexed to retrieve any component character\n", "in the string. For example, to retrieve the first character of the string defined above,\n", "we write the following:" ] }, { "cell_type": "code", "execution_count": 4, "id": "b5cfe344", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I\n" ] } ], "source": [ "print(saying[0])" ] }, { "cell_type": "markdown", "id": "6ddcad7e", "metadata": {}, "source": [ "Note that like many other programming languages, Python starts counting from zero, which\n", "explains why the first character of a string is indexed using the number 0. We use the\n", "function `print()` to print the retrieved value to our screen.\n", "\n", "#### Looping\n", "You should also know about the concept of \"looping\". Looping involves a sequence of Python\n", "instructions, which is repeated until a particular condition is met. For example, we might\n", "loop (or iterate as it's sometimes called) over the characters in a string and print each\n", "character to our screen:" ] }, { "cell_type": "code", "execution_count": 5, "id": "84434259", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P\n", "y\n", "t\n", "h\n", "o\n", "n\n" ] } ], "source": [ "string = \"Python\"\n", "for character in string:\n", " print(character)" ] }, { "cell_type": "markdown", "id": "89b69a9f", "metadata": {}, "source": [ "#### Lists\n", "Strings are sequences of characters. Python provides a number of other sequence types,\n", "allowing us to store different data types. One of the most commonly used sequence types is\n", "the `list`. A list has similar properties as strings, but allows us to store any kind of\n", "data type inside:" ] }, { "cell_type": "code", "execution_count": 6, "id": "8e452c4e", "metadata": {}, "outputs": [], "source": [ "numbers = [1, 1, 2, 3, 5, 8]\n", "words = [\"This\", \"is\", \"a\", \"list\", \"of\", \"strings\"]" ] }, { "cell_type": "markdown", "id": "01e35c67", "metadata": {}, "source": [ "We can index and slice lists using the same syntax as with strings:" ] }, { "cell_type": "code", "execution_count": 7, "id": "8abb2717", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "8\n", "['list', 'of', 'strings']\n" ] } ], "source": [ "print(numbers[0])\n", "print(numbers[-1]) # use -1 to retrieve the last item in a sequence\n", "print(words[3:]) # use slice syntax to retrieve a subsequence" ] }, { "cell_type": "markdown", "id": "91d57e3a", "metadata": {}, "source": [ "#### Dictionaries and sets\n", "Dictionaries (`dict`) and sets (`set`) are unordered data types in Python. Dictionaries\n", "consist of entries, or \"keys\", that hold a value:" ] }, { "cell_type": "code", "execution_count": 8, "id": "3577106b", "metadata": {}, "outputs": [], "source": [ "packages = {'matplotlib': 'Matplotlib is a Python 2D plotting library',\n", " 'pandas': 'Pandas is a Python library for data analysis',\n", " 'scikit-learn': 'Scikit-learn helps with Machine Learning in Python'}" ] }, { "cell_type": "markdown", "id": "fda2b806", "metadata": {}, "source": [ "The keys in a dictionary are unique and unmutable. To look up the value of a given key, we\n", "\"index\" the dictionary using that key, e.g.:" ] }, { "cell_type": "code", "execution_count": 9, "id": "d46b29b5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pandas is a Python library for data analysis\n" ] } ], "source": [ "print(packages['pandas'])" ] }, { "cell_type": "markdown", "id": "6f561dee", "metadata": {}, "source": [ "Sets represent unordered collections of unique, unmutable objects. For example, the\n", "following code block defines a set of strings:" ] }, { "cell_type": "code", "execution_count": 10, "id": "3fdc5ddc", "metadata": {}, "outputs": [], "source": [ "packages = {\"matplotlib\", \"pandas\", \"scikit-learn\"}" ] }, { "cell_type": "markdown", "id": "13dbdc60", "metadata": {}, "source": [ "#### Conditional expressions\n", "We expect you to be familiar with conditional expressions. Python provides the statements\n", "`if`, `elif`, and `else`, which are used for conditional execution of certain lines of\n", "code. For instance, say we want to print all strings in a list that contain the letter\n", "*i*. The `if` statement in the following code block executes the print function *on the\n", "condition* that the current string in the loop contains the string *i*:" ] }, { "cell_type": "code", "execution_count": 11, "id": "796a702f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fix\n", "things\n" ] } ], "source": [ "words = [\"move\", \"slowly\", \"and\", \"fix\", \"things\"]\n", "for word in words:\n", " if \"i\" in word:\n", " print(word)" ] }, { "cell_type": "markdown", "id": "61afbde0", "metadata": {}, "source": [ "#### Importing modules\n", "\n", "```{margin}\n", "For an overview of all packages and modules in Python's standard library, see\n", "https://docs.python.org/3/library/. For an overview of the various built-in functions,\n", "see https://docs.python.org/3/library/functions.html.\n", "```\n", "\n", "Python provides a tremendous range of additional functionality through modules in its\n", "standard library. We assume you know about the concept of \"importing\" modules and\n", "packages, and how to use the newly imported functionality. For example, to import the\n", "model `math`, we write the following:" ] }, { "cell_type": "code", "execution_count": 12, "id": "b7ceaefc", "metadata": {}, "outputs": [], "source": [ "import math" ] }, { "cell_type": "markdown", "id": "b7bdc100", "metadata": {}, "source": [ "The math module provides access to a variety of mathematical functions, such as `log()` (to\n", "produce the natural logarithm of a number), and `sqrt()` (to produce the square root of a\n", "number). These functions can be invoked as follows:" ] }, { "cell_type": "code", "execution_count": 13, "id": "5b759486", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.0000066849139877\n", "1.4142135623730951\n" ] } ], "source": [ "print(math.log(2.7183))\n", "print(math.sqrt(2))" ] }, { "cell_type": "markdown", "id": "97414399", "metadata": {}, "source": [ "#### Defining functions\n", "In addition to using built-in functions and functions imported from modules, you should be\n", "able to define your own functions (or at least recognize function definitions). For\n", "example, the following function takes a list of strings as argument and returns the number\n", "of strings that end with the substring *ing*:" ] }, { "cell_type": "code", "execution_count": 14, "id": "3f0bf631", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n" ] } ], "source": [ "def count_ing(strings):\n", " count = 0\n", " for string in strings:\n", " if string.endswith(\"ing\"):\n", " count += 1\n", " return count\n", "\n", "words = [\n", " \"coding\", \"is\", \"about\", \"developing\", \"logical\", \"event\", \"sequences\"\n", "]\n", "print(count_ing(words))" ] }, { "cell_type": "markdown", "id": "0b226cf5", "metadata": {}, "source": [ "#### Reading and writing files\n", "You should also have basic knowledge of how to read files (although we will discuss this\n", "in reasonable detail in chapter {ref}`chp-getting-data`). An example is given below, where\n", "we read the file `data/aesop-wolf-dog.txt` and print its contents to our screen:" ] }, { "cell_type": "code", "execution_count": 15, "id": "29163ea2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "THE WOLF, THE DOG AND THE COLLAR\n", "\n", "A comfortably plump dog happened to run into a wolf. The wolf asked the dog where he had\n", "been finding enough food to get so big and fat. 'It is a man,' said the dog, 'who gives me\n", "all this food to eat.' The wolf then asked him, 'And what about that bare spot there on\n", "your neck?' The dog replied, 'My skin has been rubbed bare by the iron collar which my\n", "master forged and placed upon my neck.' The wolf then jeered at the dog and said, 'Keep\n", "your luxury to yourself then! I don't want anything to do with it, if my neck will have to\n", "chafe against a chain of iron!'\n", "\n" ] } ], "source": [ "f = open(\"data/aesop-wolf-dog.txt\") # open a file\n", "text = f.read() # read the contents of a file\n", "f.close() # close the connection to the file\n", "print(text) # print the contents of the file" ] }, { "cell_type": "markdown", "id": "271b1c67", "metadata": {}, "source": [ "Even if you have mastered all these programming concepts, it is inevitable that you will\n", "encounter lines of code that are unfamiliar. We have done our best to explain the code\n", "blocks in great detail. So, while this book is *not* an introduction into the basics of\n", "programming, it does increase your understanding of programming, and it prepares you to\n", "work on your own problems related to data analysis in the humanities.\n", "\n", "### Packages and data\n", "The code examples used later in the book rely on a number of established and\n", "frequently used Python packages, such as NumPy, SciPy, Matplotlib, and Pandas. All these\n", "packages can be installed through the Python Package Index (PyPI) using the ``pip``\n", "software which ships with Python. We have taken care to use packages which are mature and\n", "actively maintained. Required packages can be installed by executing the following command\n", "on the command-line:\n", "\n", "```\n", "python3 -m pip install \"numpy<2,>=1.13\" \"pandas~=1.1\" \"matplotlib<4,>=2.1\" \"lxml>=3.7\" \"nltk>=3.2\" \"beautifulsoup4>=4.6\" \"pypdf2>=1.26\" \"networkx<2.5,>=2.2\" \"scipy<2,>=0.18\" \"cartopy>=0.19\" \"scikit-learn>=0.19\" \"xlrd<2,>=1.0\" \"mpl-axes-aligner<2,>=1.1\"``\n", "```\n", "\n", "MacOS users *not* using the Anaconda distribution will need to install a few additional\n", "dependencies through the package manager for macOS, Homebrew:\n", "\n", "```\n", "# First, follow the instructions on https://brew.sh to install homebrew\n", "# After a successful installation of homebrew, execute the following commands:\n", "brew install geos proj\n", "```\n", "\n", "In order to install `cartopy`, Linux users not using the Anaconda distribution will need to install two dependencies via their package manager. On Debian-based systems such as Ubuntu, ``sudo apt install libgeos-dev libproj-dev`` will install these required libraries. If you encounter trouble, try installing a version which is known to work with `python3 -m pip install \"cartopy==0.19.0.post1\"`.\n", "\n", "Datasets featured in this and subsequent chapters have been gathered together and\n", "published online. The datasets are associated with the DOI ``10.5281/zenodo.891264`` and\n", "may be downloaded at the address https://doi.org/10.5281/zenodo.891264. All chapters\n", "assume that you have downloaded the datasets and have them available in the current\n", "working directory (i.e., the directory from which your Python session is started).\n", "\n", "### Exercises\n", "\n", "Each chapter ends with a series of exercises which are increasingly difficult. First,\n", "there are \"Easy\" exercises, in which we rehearse some basic lessons and programming skills\n", "from the chapter. Next are the \"Moderate\" exercises, in which we ask you to deepen the\n", "knowledge you have gained in a chapter. In the \"Challenging\" exercises, finally, we\n", "challenge you to go one step further, and apply the chapter's concepts to new problems and\n", "new datasets. It is okay to skip certain exercises in the first instance and come back to\n", "them later, but we recommend that you do all the exercises in the end, because that is the\n", "best way to ensure that you have understood the materials.\n", "\n", "(sec-cooking-chp-introduction)=\n", "## An Exploratory Data Analysis of the United States' Culinary History\n", "\n", "In the remainder of this chapter we venture into a simple form of exploratory data analysis, serving\n", "as a primer of the chapters to follow. The term \"exploratory data analysis\" is attributed to\n", "mathematician John Tukey, who characterizes it as a research method or approach to encourage the\n", "exploration of data collections using simple statistics and graphical representations. These\n", "exploratory analyses serve the goal to obtain new perspectives, insights, and hypotheses about a\n", "particular domain. Exploratory data analysis is a well-known term, which Tukey (deliberately)\n", "vaguely describes as an analysis that \"does not need probability, significance or confidence\", and\n", "\"is actively incisive rather than passively descriptive, with real emphasis on the discovery of the\n", "unexpected\" {cite:p}`jones:1986`. Thus, exploratory data analysis provides a lot of freedom as to which\n", "techniques should be applied. This chapter will introduce a number of commonly used exploratory\n", "techniques (e.g., plotting of raw data, plotting simple statistics, and combining plots) all of which\n", "aim to assist us in the discovery of patterns and regularities.\n", "\n", "As our object of investigation, we will analyze a dataset of seventy-six cookbooks, the\n", "*Feeding America: The Historic American Cookbook* dataset {cite:p}`feeding-america`. Cookbooks\n", "are of particular interest to humanities scholars, historians, and sociologists, as they\n", "serve as an important \"lens\" into a culture's material and economic landscape\n", "{cite:p}`mitchell:2001,abala:2012`. The *Feeding America* collection was compiled by the\n", "Michigan State University Libraries Special Collections (2003), and holds a representative\n", "sample of the culinary history of the United States of America, spanning the late eighteenth to\n", "the early twentieth century. The oldest cookbook in the collection is Amelia Simmons's *American\n", "Cookery* from 1796, which is believed to be the first cookbook written by someone from and\n", "*in* the United States. While many recipes in Simmons's work borrow heavily from\n", "predominantly British culinary traditions, it is most well-known for its introduction of\n", "American ingredients such as corn. Note that almost all of these books were written\n", "by women; it is only since the end of the twentieth century that men started to mingle in the\n", "cookbook scene. Until the American Civil War started in 1861, cookbook production\n", "increased sharply, with publishers in almost all big cities of the United States. The\n", "years following the Civil War showed a second rise in the number of printed cookbooks,\n", "which, interestingly, exhibits increasing influences of foreign culinary traditions as the\n", "result of the \"new immigration\" in the 1880s from, e.g., Catholic and Jewish immigrants\n", "from Italy and Russia. A clear example is the youngest cookbook in the collection, written\n", "by Bertha Wood in 1922, which, as Wood explains in the preface \"was to compare the foods\n", "of other peoples with that of the Americans in relation to health\". The various dramatic\n", "events of the early twentieth century, such as World War I and the Great Depression, have\n", "further left their mark on the development of culinary America (see {cite:t}`longone:2003` for a\n", "more detailed and elaborate discussion of the *Feeding America* project and the history of\n", "cookbooks in America).\n", "\n", "While necessarily incomplete, this brief overview already highlights the complexity of\n", "America's cooking history. The main goal of this chapter is to shed light on some\n", "important cooking developments, by employing a range of exploratory data analysis\n", "techniques. In particular, we will address the following two research questions:\n", "\n", "1. Which ingredients have fallen out of fashion and which have become popular in the nineteeth\n", " century?\n", "2. Can we observe the influence of immigration waves in the *Feeding America* cookbook\n", " collection?\n", "\n", "Our corpus, the *Feeding America* cookbook dataset, consists of seventy-six files encoded\n", "in XML with annotations for \"recipe type\", \"ingredient\", \"measurements\", and \"cooking\n", "implements\". Since processing XML is an involved topic (which is postponed to chapter\n", "{ref}`chp-getting-data`), we will make use of a simpler, preprocessed comma-separated\n", "version, allowing us to concentrate on basics of performing an exploratory data analysis\n", "with Python. The chapter will introduce a number of important libraries and packages for\n", "doing data analysis in Python. While we will cover just enough to make all Python code\n", "understandable, we will gloss over quite a few theoretical and technical details. We ask\n", "you not to worry too much about these details, as they will be explained much more\n", "systematically and rigorously in the coming chapters.\n", "\n", "## Cooking with Tabular Data\n", "\n", "The Python Data Analysis Library \"Pandas\" is the most popular and well-known Python library for\n", "(tabular) data manipulation and data analysis. It is packed with features designed to make data\n", "analysis efficient, fast, and easy. As such, the library is particularly well-suited for exploratory\n", "data analysis. This chapter will merely scratch the surface of Pandas' many functionalities, and we\n", "refer the reader to chapter {ref}`chp-working-with-data` for detailed coverage of the library. Let us\n", "start by importing the Pandas library and reading the cookbook dataset into memory:" ] }, { "cell_type": "code", "execution_count": 16, "id": "ff695cf4", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv(\"data/feeding-america.csv\", index_col='date')" ] }, { "cell_type": "markdown", "id": "bc317804", "metadata": {}, "source": [ "If this code block appears cryptic, rest assured: we will guide you through it step by\n", "step. The first line imports the Pandas library. We do that under an alias, `pd` (read:\n", "\"import the pandas library *as* pd\"). After importing the library, we use the function\n", "`pandas.read_csv()` to load the cookbook dataset. The function `read_csv()` takes a string\n", "as argument, which represents the file path to the cookbook dataset. The function returns\n", "a so-called `DataFrame` object, consisting of columns and rows---much like a\n", "spreadsheet table. This data frame is then stored in the variable `df`.\n", "\n", "To inspect the first five rows of the returned data frame, we call its `head()` method:" ] }, { "cell_type": "code", "execution_count": 17, "id": "3ce21435", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | book_id | \n", "ethnicgroup | \n", "recipe_class | \n", "region | \n", "ingredients | \n", "
---|---|---|---|---|---|
date | \n", "\n", " | \n", " | \n", " | \n", " | \n", " |
1922 | \n", "fofb.xml | \n", "mexican | \n", "soups | \n", "ethnic | \n", "chicken;green pepper;rice;salt;water | \n", "
1922 | \n", "fofb.xml | \n", "mexican | \n", "meatfishgame | \n", "ethnic | \n", "chicken;rice | \n", "
1922 | \n", "fofb.xml | \n", "mexican | \n", "soups | \n", "ethnic | \n", "allspice;milk | \n", "
1922 | \n", "fofb.xml | \n", "mexican | \n", "fruitvegbeans | \n", "ethnic | \n", "breadcrumb;cheese;green pepper;pepper;salt;sar... | \n", "
1922 | \n", "fofb.xml | \n", "mexican | \n", "eggscheesedairy | \n", "ethnic | \n", "butter;egg;green pepper;onion;parsley;pepper;s... | \n", "
\n", " | butter | \n", "salt | \n", "water | \n", "flour | \n", "nutmeg | \n", "pepper | \n", "sugar | \n", "lemon | \n", "mace | \n", "egg | \n", "... | \n", "tomato in hot water | \n", "farina cream | \n", "pearl grit | \n", "chicken okra | \n", "tournedo | \n", "avocado | \n", "rock cod fillet | \n", "perch fillet | \n", "lime yeast | \n", "dried flower | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
1803 | \n", "0.570796 | \n", "0.435841 | \n", "0.409292 | \n", "0.351770 | \n", "0.272124 | \n", "0.267699 | \n", "0.205752 | \n", "0.205752 | \n", "0.188053 | \n", "0.150442 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
1807 | \n", "0.357374 | \n", "0.349839 | \n", "0.395048 | \n", "0.219591 | \n", "0.132400 | \n", "0.194833 | \n", "0.274489 | \n", "0.104413 | \n", "0.134553 | \n", "0.177610 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
1808 | \n", "0.531401 | \n", "0.371981 | \n", "0.391304 | \n", "0.352657 | \n", "0.260870 | \n", "0.149758 | \n", "0.396135 | \n", "0.115942 | \n", "0.140097 | \n", "0.294686 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
1815 | \n", "0.398551 | \n", "0.315217 | \n", "0.322464 | \n", "0.431159 | \n", "0.152174 | \n", "0.083333 | \n", "0.387681 | \n", "0.018116 | \n", "0.036232 | \n", "0.347826 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
1827 | \n", "NaN | \n", "0.066667 | \n", "0.600000 | \n", "NaN | \n", "0.033333 | \n", "0.033333 | \n", "0.400000 | \n", "0.200000 | \n", "NaN | \n", "0.033333 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 rows × 3532 columns
\n", "