{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "5d716a01", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "plt.style.use(\"../styles/hda.mplstyle\")" ] }, { "cell_type": "markdown", "id": "7dd83aad", "metadata": {}, "source": [ "(chp-topic-models)=\n", "# A Topic Model of United States Supreme Court Opinions, 1900--2000\n", "\n", "(sec-topic-models-introduction)=\n", "## Introduction\n", "\n", "\n", "\n", "In this chapter we will use an unsupervised model of text---a mixed-membership model or \"topic\n", "model\"---to make visible trends in the texts of decisions issued by the United States Supreme Court.\n", "Like many national courts, the decisions issued by the Court tend to deal with subjects which can be\n", "grouped into a handful of categories such as contract law, criminal procedure, civil rights,\n", "interstate relations, and due process. Depending on the decade, the Court issues decisions related\n", "to these areas of law at starkly different rates. For example, decisions related to criminal\n", "procedure (e.g., rules concerning admissible evidence and acceptable police practices) were common\n", "in the 1970s and 1980s but rare today. Maritime law, as one might anticipate, figured in far more\n", "cases before 1950 than it does now. A topic model can be used to make these trends visible.\n", "\n", "\n", "\n", "```{margin}\n", "Prominent commercial providers of\n", "discrete labels for legal texts include Westlaw (owned by Thompson Reuters) and LexisNexis (owned by\n", "RELX Group, né Elsevier).\n", "```\n", "This exploration of trends serves primarily to illustrate the effectiveness of an unsupervised\n", "method for labeling texts. Labeling what areas of law are discussed in a given Supreme Court\n", "decision has historically required the involvement of legal experts. As legal experts are typically\n", "costly to retain, these labels are expensive. More importantly perhaps, the process by which these\n", "labels are arrived at is opaque to non-experts and to experts other than those doing the labeling.\n", "Being able to roughly identify the subject(s) discussed in a decision without manually labeled texts\n", "has, therefore, considerable attraction to scholars in the field.\n", "\n", "\n", "\n", "This chapter describes how a mixed-membership model can roughly identify the subject(s) of decisions\n", "without direct supervision or labeling by human readers. To give some sense of where we are headed,\n", "consider {numref}`fig-topic-models-discrimination-topic` below, which shows for each year\n", "between 1903 and 2008 the proportion of all words in opinions related to a \"topic\"\n", "characterized by the frequent occurrence of words such as *school*, *race*, *voting*,\n", "*education*, and *minority*. (The way the model identifies these particular words is described in\n", "section {ref}`sec-topic-models-parameter-estimation`.) Those familiar with United States history will\n", "not be surprised to see that the number of decisions associated (in a manner to be described\n", "shortly) with this constellation of frequently co-occurring words (*school*, *race*, *voting*,\n", "*education*, and *minority*) increases dramatically in the late 1950s. The orange vertical line\n", "shows the year 1954, the year of the decision *Brown v. Board of Education of Topeka* (347 U.S.\n", "483). This decision ruled that a school district (the governmental entity responsible for education\n", "in a region of a US state) may not establish separate schools for black and white students. *Brown\n", "v. Board of Education of Topeka* was one among several decisions related to minorities' civil rights\n", "and voting rights: the 1960s and the 1970s witnessed multiple legal challenges to two signature laws\n", "addressing concerns at the heart of the civil rights movement in the United States: the Civil Rights\n", "Act of 1964 and the Voting Rights Act of 1965.\n", "\n", "```{figure} figures/discrimination-topic.png\n", "---\n", "name: fig-topic-models-discrimination-topic\n", "width: 70%\n", "---\n", "\n", "Vertical line shows the year 1954, the year of the decision *Brown v. Board of Education\n", "of Topeka* (347 U.S. 483).\n", "```\n", "\n", "```{note}\n", "An example of a challenge to the Voting Rights Act brought by a white-majority state government is *South Carolina v. Katzenbach* (383 U.S. 301). In *South Carolina v. Katzenbach*, South Carolina argued that a provision of the Voting Rights Act violated the state's right to regulate its own elections. Prior to the Voting Rights Act, states such as South Carolina had exercised their \"rights\" by discouraging or otherwise blocking non-whites from voting through literacy tests and poll taxes. In 2013, five Republican-appointed judges on the Supreme Court weakened an essential provision of the Voting Rights Act, paving the way for the return of voter discouragement measures in states such as Georgia {cite:p}`williams2018georgia`.\n", "```\n", "\n", "To understand how an unsupervised, mixed-membership model of Supreme Court opinions permits us to identify both\n", "semantically related groupings of words and trends in the prevalence of these groupings over time, we\n", "start by introducing a simpler class of unsupervised models which are an essential building\n", "block in the mixed-membership model: the mixture model. After introducing the mixture model we will\n", "turn to the mixed-membership model of text data, the model colloquially known as a topic model. By\n", "the end of the chapter, you should have learned enough about topic models to use one to model any large text corpus." ] }, { "cell_type": "code", "execution_count": 2, "id": "e79e8c8f", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# HIDE THIS CELL\n", "# NOTE: fixed random seed for normal mixture model\n", "import numpy.random; numpy.random.seed(1)\n", "import random; random.seed(1)" ] }, { "cell_type": "markdown", "id": "566f1b58", "metadata": {}, "source": [ "(sec-topic-models-mixture-models)=\n", "## Mixture Models: Artwork Dimensions in the Tate Galleries\n", "\n", "A mixture model is the paradigmatic example of an *unsupervised model*. Unsupervised models, as the\n", "name indicates, are not supervised models. Supervised models, such as nearest neighbors classifiers\n", "(cf. chapter {ref}`chp-vector-space-model` and {ref}`chp-stylometry`) or logistic regression,\n", "\"learn\" to make correct predictions in the context of labeled examples and a formal description of a\n", "decision rule. They are, in this particular sense, supervised. These supervised models are typically evaluated in terms of the predictions they make: give them\n", "an input and they'll produce an output or a distribution over outputs. For example, if we have a model which\n", "predicts the genre (tragedy, comedy, or tragicomedy) of a seventeenth century French play, the input we\n", "provide the model is the text of the play and the output is a genre label or a probability\n", "distribution over labels. If we were to give as input the text of Pierre Corneille's *Le Cid*\n", "(1636), the model might predict tragedy with probability 10 percent, comedy with 20 percent, and\n", "tragicomedy with 70 percent. *Le Cid* is traditionally classified as a tragicomedy (cf.\n", "chapter {ref}`chp-vector-space-model`).\n", "\n", "Unsupervised models, by contrast, do not involve decision rules that depend on labeled data.\n", "Unsupervised models make a wager that patterns in the data are sufficiently strong that different\n", "latent classes of observations will make themselves \"visible\". (This is also the general intuition\n", "behind cluster analysis (see section {ref}`sec-stylometry-hierarchical-clustering`).) We will make\n", "this idea concrete with an example. The classic unsupervised model is the normal (or Gaussian) *mixture model* and a typical setting for this model is when one has multi-modal data. In this section we estimate the parameters of a normal mixture model using\n", "observations of the dimensions of ca. 63,000 two-dimensional artworks in four art museums in the United Kingdom. Doing so will not take us far from topic\n", "modeling---mixtures of normal distributions appear in many varieties of topic models---and should\n", "make clear what we mean by an unsupervised model.\n", "\n", "We start our analysis by verifying that the dimensions of artworks from the four museums\n", "(the Tate galleries) are conspiciously multimodal. First, we need to load the data. A CSV\n", "file containing metadata describing artworks is stored in the `data` folder in compressed\n", "form `tate.csv.gz`. We load it and inspect the first two records with the following lines of\n", "code:" ] }, { "cell_type": "code", "execution_count": 3, "id": "b3de16b3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | artist | \n", "acquisitionYear | \n", "accession_number | \n", "medium | \n", "width | \n", "height | \n", "
---|---|---|---|---|---|---|
artId | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
1035 | \n", "Blake, Robert | \n", "1922.0 | \n", "A00001 | \n", "Watercolour, ink, chalk and graphite on paper.... | \n", "419 | \n", "394 | \n", "
1036 | \n", "Blake, Robert | \n", "1922.0 | \n", "A00002 | \n", "Graphite on paper | \n", "213 | \n", "311 | \n", "