Harper Lee and Other People: A Stylometric Diagnosis.

There are few books with as enduring a grasp on the American mind and heart as Harper Lee's To Kill a Mockingbird. Cherished by readers worldwide, the novel also continues to be the subject of extensive academic interest. (1) The sheer sales figures speak volumes about the book: Harper Collins boasts that more than thirty million copies of To Kill a Mockingbird were sold over the years and that it has been translated into forty languages (Cavoto 418). The enduring popularity of the novel has been sensational in itself, but the film adaptation redoubled its success. The movie diminished the role of Scout and the story of the children, focusing more attention on Oscar-winning Gregory Peck as Atticus Finch, (2) a character often controversially idealized as a paragon of morality and a respectful champion of the oppressed.

But the matter at hand is not only considering the book in comparison to the movie. Harper Lee, a writer who delivered a compelling portrayal of her southern hometown, was in the spotlight, to some extent exactly because of her attempts to remain out of it. A number of biographical or semi-biographical publications on To Kill a Mockingbird 'prove that the story surrounding Lee's writing of the novel has a life of its own, interwoven with the cultural functioning of the book. Her troubled friendship with Truman Capote; her relationship with her father, Amasa Coleman Lee, upon whom the character of Atticus Finch is largely modeled; her reclusiveness after the astounding success of her debut--all these elements have attracted the interest of the general public almost as much as the fictional story she authored.

When news of the pending publication of the second novel by Harper Lee was released in July 2015, the nation's fascination with To Kill a Mockingbird was reignited on a massive scale. Generations of readers who grew up with a copy of To Kill a Mockingbird in their hands and the image of Gregory Peck as Atticus Finch in their minds eagerly looked forward to learning what happened to Scout later in her life. It was clear that Go Seta Watchman was not a sequel, but the initial version of Lee's debut book, whose plot is set later in the timeline than To Kill a Mockingbird; nonetheless, the majority of readers would find it hard not to read as the continuation of their favorite book and not to despair over the metamorphosis of Atticus from broad-minded to bigoted.

With the publication of Go Set a Watchman, notorious doubts and reservations concerning the book resurfaced, among them those attributing credit for Harper Lee's writing to Truman Capote, her lifelong friend and an acclaimed writer. While the question of Lee's authorship of To Kill A Mockingbird has been established by both legal and stylometric means and reported in the media (Gamerman D5), the controversy has shown the need to study in detail the possible affinities between the two most famous inhabitants of Monroeville, Alabama.

The aim of this study is not to question the authorship of Go Set A Watchman, To Kill A Mockingbird, or In Cold Blood 'using quantitative methods of attribution. This matter has already been discussed elsewhere (Eder and Rybicki "Go Set"). Rather, one goal is to complement literary history with stylometry and biography with statistics; the other is to discuss the various quantitative results that make too much sense in time-proven hermeneutic interpretations to be discarded as coincidences. This article seeks to answer a question of greater import than gossip with more or less implausible attribution, namely: How, if at all, is the common background of the two writers visible in the way they use language throughout their respective careers?

"Tru" & "Nelle"

The troubled relationship between Truman Capote and Harper Lee is widely discussed by both authors' biographers (Shields; Clarke), but the two authors also famously commemorated each other in their fiction. (3) As children, "Tru" and "Nelle" were inseparable and shared a "common anguish" (Shields 26-27), a desire to express themselves in writing. At the age of twenty-five, Capote published "Miriam," a dreamlike short story that caught the attention of publishers and resulted in a contract for his first novel, Other Voices, Other Rooms, which appeared in 1948. With the publication of The Grass Harp two years later, his literary career began accelerating. In the 1950s, Capote gained the status of a celebrity as he engaged in a number of Broadway projects, including the musical House of Flowers (1954), for which he co-authored the book and lyrics. His iconic novel Breakfast at Tiffany's appeared four years later.

While Capote's creativity seemed to be on a constant rise, Lee was agonizing over her debut novel. The manuscript she delivered in 1957 to the J. B. Lippincott Company was much too fragmented. For Tay Hohoff, Lee's editor, it resembled "more a series of anecdotes than a fully conceived novel" (Shields 87). (4) Later on, after the publication of To Kill a Mockingbird 'in 1960 and the enormous success that followed, rumors surfaced that Lee had relied heavily on Capote's guidance in the midst of the turbulent creative process. As a much more experienced writer, he was rumored to have actually helped her extensively in the composition of the final draft. Given the intimacy of their friendship at that time, it is likely that he read the manuscript and offered Lee suggestions for revision, but for him to have authored some fragments of the novel seems doubtful. As a rather garrulous type, he had problems concealing any project he was even partially engaged in, even if asked to keep it to himself. Still, as observed by Shields, "Even now, nearly 50 years after To Kill a Mockingbird appeared, the rumor persists that Nelle Harper Lee didn't write the novel herself. Truman Capote, so goes the whisper campaign, wrote large portions--or maybe all of it" (98). Interestingly, Capote, "characteristically neurotic when it came to awards" (Schultz 101), grew increasingly envious of Lee's success, especially after she won the Pulitzer Prize, and never denied the rumor decisively (Shields 189). What is known for sure is that he enjoyed being represented by one of the characters, as he stressed in a letter to the film producer David O. Selznick, a friend. Describing the book as "delightful," Capote wrote that To Kill a Mockingbird is "going to be a great success. In it, I am the character called 'Dill'--the author being a childhood friend" (Capote 284).

In 1959, even before To Kill a Mockingbird 'was finally published and hit the New York Times and Chicago Tribune bestseller lists, Capote and Lee began cooperating on a nonfiction crime book. They conducted research together, Lee acting effectively as Capote's assistant. In the spring of 1960, Lee presented Capote with 150 pages of typed notes organized by topic, including descriptions of the local landscape and of the people involved in the crime. In Cold Blood 'was published in 1966 and soon turned out to be a great success, paving the way for the emergence of the "true crime" genre. However, for all the critical acclaim, the book failed to bring Capote either the Pulitzer Prize or the National Book Award. As Schultz argues, Capote viewed this as a failure and a setback (101). He was not the only person to have emerged disappointed from the In Cold Blood project: Lee soon discovered that Capote had failed to acknowledge her contributions and merely dedicated the book to her.

In spite of the recurrent news that Lee was working on her second book, she withdrew slowly into reclusiveness over the years and did not publish it until 2015. At the same time, since the beginning of the 1970s, Capote suffered from a creative burnout. In the wake of "myriad health issues," feeling like an "insolvable jigsaw puzzle" (Schultz 103), he started abusing alcohol and drugs, gradually slipping into personal disarray. In 1984, he was found dead in Bel Air, Los Angeles, at the home of Joanne Carson, an old friend.

Capote's death did not terminate his literary output, however, as a few of his works were released posthumously. First, Answered Prayers: The Unfinished Novel was released by Hamish Hamilton in 1986, two years after Capote's death. Earlier, Capote had managed to publish four chapters of the book in Esquire, causing a social uproar. At that time, the book seemed like a "literary Sasquatch" (Schultz 107): everyone talked about it, but very few people actually saw even its fragments. After publication, some of Capote's friends recognized themselves in the characters depicted leading lascivious lives and began to ostracize him socially. In 2005, another incomplete work by Capote, Summer Crossing, was released by Random House. It was produced from a manuscript that was recovered by a house sitter in the apartment in Brooklyn Heights where Capote lived in 1950. Finally, in 2015, Random House released fourteen previously unpublished stories written by Capote when he was only a teenager. The manuscripts had been discovered in the New York Public Library Archives two years earlier. All these recent publications provide valuable material for a comparative, hermeneutic, and stylometric diagnosis of the relationship between Lee and Capote.

The Method

While we believe that the hermeneutic paradigm continues to be, with good reason, the core of contemporary literary criticism, we also see the potential of enhancing the methodology with quantitative approaches to literature. In examining Capote and Lee, we take advantage of stylometry, or statistical analysis of style, as a way of identifying similarities between literary texts via the usage of particular words. There are at least three reasons why quantitative methods prove attractive in solving literary questions. First, stylometry focuses on uncovering stylistic (or linguistic) patterns that can hardly be spotted by the naked eye. A good example here is authorship attribution based on quantitative evidence. Even if a skilled and trained reader of literature is typically able to identify prevailing differences between authors, a vast majority of minute lexical, or even grammatical, idiosyncrasies remain hidden to human perception (Love). The advantages of stylometric methods lie in their ability to accumulate dozens of such subtle peculiarities of individually weak discriminative power into meaningful quantitative evidence. (5) Second, quantitative analysis of large text collections, or at least computing at once all the passages from all the books by Lee and Capote, expands the scale of interpretation considerably, by covering amounts of textual data that could not be handled by single scholars, even if they were trained to memorize large portions of text. This capacity allows for the exploration of literary and linguistic patterns on an unprecedented scale--for example, analyzing a particular literary period in its entirety (Dexter et al.), sometimes even hundreds of voluminous works at a time. The third reason quantitative approaches may be attractive is that they reduce the nebulous phenomenon of style into a number quantifiable features, such as word frequencies, making it possible to perform a series of measurements or statistical tests that are, by definition, reproducible. Although there are certain advantages, there are also corresponding downsides: detecting hidden patterns by definition ignores striking ones; a large-scale perspective essentially sacrifices intimate interpretations of literary masterpieces; and reducing style to a mathematical model leaves no space to a legion of rare yet eye-catching stylistic features not measurable by quantitative means. We are aware that the price to be paid would be rather high if stylometry simply replaced the hermeneutic paradigm; we believe, however, that the two perspectives prove useful when considered in tandem.

In its considerably long history dating back to the nineteenth century, stylometry has developed the notion of the stylistic fingerprint, or stylome, which usually refers to a number of linguistic features that are unique to particular authors (van Halteren et al. 65). On theoretical grounds, they include syntactic structures, characteristic phrases, and usage of certain words, but in practice the usage of ostensibly frequent grammatical words (function words) has proven to be a surprisingly effective measure. The list of these words is generated for each selection of texts, and it usually opens with such lexical items as the, and, to, a, of, I, in, was, or it, followed by morphologically more and more complex words as the list descends. However, while an author's vocabulary usually consists of thousands of words that are responsible for shaping a text's meaning, the information about idiosyncratic features of the authorial style can be traced back to the most rudimentary building blocks of text: function words.

How function words are used in a text is relatively difficult to quantify, hence the standard stylometric procedure involves splitting a running text into distinct words and then counting those words individually. The result is a numeric measure of frequency, or how many times each function word occurs in the studied text. In spite of being a reduction--in the sense that the original order of words is distorted --the frequency of words is convenient for analysis using various statistical methods. The techniques employed in stylometry, however, are usually those that can analyze a dozen or more frequencies at a time; thus, they are referred to as multivariate, or multidimensional. The number of patterns studied in this way far exceeds the limits of the human reader's computational perception.

The practice of comparing texts by using multivariate analyses of word frequencies has been around for more than half a century, since Mosteller and Wallace released their authorship study of the Federalist Papers (1964), but at least equally important was John F. Burrows's application of similar methods to works by Jane Austen (1987). Burrows's main finding was that Austen used dialogue for characterization such that similar characters have similar "idiolects." Heroines such as Elizabeth Bennett and Elinor Dashwood employ similar proportions of most-frequent words; their respective beaus, Darcy and Edward, and the villains of both love stories, Wickham and Willoughby, are also similar to each other in the distribution of most frequent words. Burrows demonstrates his claims by using Principal Component Analysis of the relative frequency of up to one hundred words most commonly used. Burrows also emphatically remarks that "It is a truth not generally acknowledged that, in most discussions of works of English fiction, we proceed as if a third, two-fifths, a half of our material were not really there. For Jane Austen, that third, two-fifths, or half comprises the twenty, thirty, or fifty most common words of her literary vocabulary," most of these being function words (modals, articles, prepositions) rather than any traditionally understood content-bearing words (1). Thus, the sheer size of the data--as opposed to the relatively infrequent appearance of individual "content" words--makes it possible to discern patterns of similarity and difference in texts that might make sense from a purely literary point of view. A later paper co-authored by Burrows attempts to explain this mechanism:
    The possibility of using such simple evidence for such large
   purposes rests upon the fact that words do not function as discrete
   entities. Since they gain their full meaning through the different
   sorts of relationships they form with each other, they can be seen
   as markers of those relationships and, accordingly, of everything
   that those relationships entail (McKenna et al. 152)

There are other reasons for the growing number of studies adopting the stylometric approach. To one of its main proponents, David L. Hoover, they "represent elements or characteristics of literary texts numerically, applying the powerful, accurate, and widely accepted methods of mathematics to measurement, classification, and analysis." Hoover goes on to say that the "availability of large numbers of electronic literary texts and huge natural language corpora has increased the attractiveness of quantitative approaches as innovative ways of 'reading' amounts of text that would overwhelm traditional modes of reading" (517). Indeed, stylometrics has been recently going hand in hand with a new approach to textual criticism called "distant reading" (Moretti), which finds patterns in large collections of texts instead of employing the traditional close reading of individual works of literature. Much of the macroanalysis of literature and literary phenomena done by Matthew Jockers is based on integrating the distant approach with analysis of most frequent words. This and many other studies have shown that mapping texts in such a way produces visualizations that make sense from the point of view of traditional literary studies. These maps demonstrate a coexistence of numerous signals: that of the author, chronology, theme, genre, or gender. Despite the lack of straightforward theoretical explanation, the empirical evidence is compelling and makes it at least interesting to apply these methods to concrete literary comparisons such as the one presented in this study. What seems to be particularly promising in the context of literature is the fact that the combination of the stylometric signals above can reveal intertextual relations between analyzed texts. In the case of Lee and Capote, even if the question of authorship of To Kill a Mockingbird 'seems to be solved, one might still be interested in discovering the extent to which the style of the Pulitzer-winning novel resembles that of, say, In Cold Blood.

This study relies on a well-established stylometric procedure based on counting frequencies of various numbers of most frequent words (MFWs) in textual corpora. For each of these, a ranking list of words and their frequency is produced; then, the frequency of a selected number of these words (usually between 100 and 5,000) is established for each of the texts. Consequently, each text is represented by a row of frequency calculations that constitutes its stylistic profile. The individual profiles are then compared, so that for each pair of texts, a measure of similarity, usually referred to as "distance," is calculated. The underlying idea is rather straightforward, namely, the greater the distance between two given texts, the less similar they are to each other. Computed distances can then be used as input for various multivariate procedures.

Building on the approach described above, we count word frequencies and introduce a few tailored techniques inspired by time-proven classification methods, such as hierarchical clustering and support vector machines. (6) Bootstrap Consensus Tree (Eder et al.) is a method that aggregates partial results produced by classical hierarchical clustering using different input parameters and builds a tree-like plot of "nearest neighbors," which brings together the texts that share the strongest similarity. The resulting tree consists of "leaves" and "branches" (groups of texts exhibiting stylistic similarities) (see Figure 1). Sufficient in many cases, the Consensus Tree method exhibits also some downsides, including the fact that it focuses on very strong similarities and filters out the weak ones. To overcome this limitation, an extension of the method involving the advances of network analysis was used. Bootstrap Consensus Network (Eder "Visualization"), then, represents and visualizes "nearest neighbor" similarities in the form of a network between texts. The algorithm establishes robust network connections only between the pairs of texts that expose strong similarities, while keeping ethereal relations--usually indirect intertextual traces--as light connections of the network (see Figure 2). The results are then further processed through network analysis with the Force Atlas 2 algorithm, which presents a spatial balance of data points based on the degree of similarity between pairs of texts (Jacomy et al.).

The third method we use relies on the assumption that literary texts do not (or at least do not have to) exhibit their stylistic profile evenly throughout the entire plot development. Rather, some textual idiosyncrasies might be expected. To allow insight into local fluctuations of style, the method Rolling Classify (Eder, "Rolling") divides input texts into equal-sized blocks, and then assesses those blocks sequentially. Each block is stylometrically compared with a reference corpus in order to trace its nearest neighbor. The final stage involves visualizing the results sequentially block by block (see Figure 3).

All the methods discussed above--from electronic text input through final results and visualization--are performed with the stylo package for R (Eder et al.), the statistical programming environment (R Core Team). The network analysis layout is computed and visualized by the Gephi network analysis program (Bastian et al.), which computes and graphically renders the network analysis layout.

The Results

The nearest neighbor tree diagram (Fig. 1) places Capote and Lee in the context of other writers of the American South. It is a good example of the potential of the function words in authorship attribution. All texts in the graph are first grouped by author before they are linked with other writers. Capote's early writings stand out here since they form a pair which shows no stylometric likeness to his later works. Also, the authorial signal is so strong in Lee's editor Tay Hohoff that it does not matter that her two texts, Cats and Other People and A Ministry to Man, represent different genres. The former is a collection of short stories, the latter a biography of John Lovejoy Elliott, whose possible influence on the Hohoff-influenced To Kill A Mockingbird version of Atticus has been discussed elsewhere (Mahler CI). Interestingly, Harper Lee's closest neighbor is neither Capote nor her editor; instead, she is closest to Eudora Welty.

The network map (Fig. 2), which is not limited to the strongest similarities, confirms the mutual independence of Capote and Lee. In this visualization, texts are shown to be more similar the closer they are placed to each other and when they are connected by thicker lines. Consequently, the output of the two authors is anything but similar: they are placed at two extremities of the map. There is a very close connection between the two Lee novels, possibly the strongest of all one-on-one comparisons in this corpus, and Lee again seems to have much more in common with Welty; at the same time, another of her favorites, Faulkner, is also not far away. Significantly, Welty was writing Losing Battles, the 1970 novel that seems to bear the strongest affinity with Lee's texts, at the time of the soaring popularity of To Kill a Mockingbird. Such a result encourages further examination of the stylistic kinship between Welty and Lee. A detailed exploration of these affinities goes beyond the scope of this paper; however, the resemblance visible in the graph produced for this study suffices to indicate that the two female authors share some relevant aspects of style as well as a command of language. Their common knack for anecdotal humor and their depictions of the Depression-era South based on personal experience could also explain some similarities between the two.

Capote, much more prolific than Lee in terms of sheer literary output, registers somewhat differently in this visualization in that he presents a more evolutionary line. It starts in the bottom-center of the network with his newly discovered teenage stories, treated here as a single body of work and (pre-)dated at 1940. His somewhat later story, "Miriam," takes him further left, as does his next work and first novel, Other Voices, Other Rooms (1948); and the main body of his (later) work becomes entrenched there for good. It is interesting to note that his posthumous novel is very much part of that cluster, despite its turbulent history and the piecemeal nature of its creation.

The only connection between Lee and Capote in this diagram is the very thin line joining To Kill A Mockingbird with The Early Stories. This might suggest that while the two writers have little in common in terms of most frequent words in their overall outputs, there might be affinities between smaller fragments of their texts. These connections are clearly visible in Figure 3, which compares samples of selected writings from this corpus against one another. By dividing each text into fragments of 4,000 words (this choice was made as the shortest text among those thus compared; "Miriam," is only slightly longer than this), consecutive individual fragments of To Kill A Mockingbird from the first to the last page can be scanned for the "signals" of other texts using Support Vector Machines as the main statistical tool. The fragments of To Kill a Mockingbird were compared sequentially against a selection of texts including two books by Hohoff, eight by Capote, and Go Set A Watchman as the bearer of the original Lee "signal." The most similar text is always indicated by the corresponding color of the lower stripe; the second-closest, by the upper stripe.

For the one hundred most frequent words, this diagram shows an overall dominance of the "original" and Hohoff-untainted Harper Lee pattern. Still, there are interesting exceptions. The initial thirty thousand words or so of To Kill a Mockingbird--roughly corresponding to the novel's first and expository chapter--betray a strong signal of Hohoff derived from her Cats and Other People (green); it continues to be visible as second-strongest in chapter two. This could suggest that the editor's impact on the complete rewrite of Go Set A Watchman was the strongest in the first two chapters of the resulting masterpiece--a reasonably plausible scenario given that Hohoffs focus on initial fragments is well-known.

Of the six instances when the Capote signal (in his Early Stories variety) is dominant, one is particularly interesting: the final and longest occurrence of the signal towards the end of To Kill a Mockingbird at the dramatic climax in chapter twenty-eight, that is, Bob Ewell's attack on Scout and Arthur "Boo" Radley's intervention. It is obviously a tempting idea to associate this variation of the stylometric signal to the heightened physical drama of the chapter, or at least to the change of pace of the action. The temptation becomes even greater when one realizes that the other two major Capotean segments correspond to chapters five and six, featuring the children's mission "to give somethin' [a letter] to Mr. [Boo] Radley" (54) and his father's mistaking them for intruders and firing a gun. As announced by one of the gathered neighbors: "Mr. [Arthur] Radley shot at a Negro in his collard patch" (61). Now, there is no reason to imagine that whenever events in her novel become more violent, Lee starts writing like Capote; but she certainly does write differently then --or at least she uses her most frequent words in a different way. And she does so in a way that resembles Capote's writing--as long as Capote is one of the points of comparison. Having said this, it is hard to ignore another Capotean trace in the two chapters in question: this is where Dill has much to say. It is worth mentioning that other--but only second-strongest--Capote signals visible in Mockingbird are those of In Cold Blood.

The signals of other authors recede to second-strongest at best when the same analysis is made for five hundred most frequent words (Fig. 4); the increased length allows for some impact of content words rather than function words that dominate the first one hundred words of every text and corpus in every natural language. While the entire length of To Kill A Mockingbird is now purely Go Set A Watchman in terms of most frequent words, the second-strongest signals are those of the same texts that were visible in the previous analysis: Cats and Other People, Early Stories, and In Cold Blood.


"I shall be brief, but I would like to use my remaining time with you to remind you that this case is not a difficult one, it requires no minute sifting of complicated facts," Atticus says in his opening speech at Tom Robinson's trial (Mockingbird 230-31). Irrespective of whether the trial is eventually lost in To Kill A Mockingbird or won in Go Set A Watchman, the statement perhaps applies to this strange case of combining supposedly objective lexical statistics with supposedly subjective critical readings of the works of Harper Lee and Truman Capote. The two very different approaches seem to make sense together. There is no sign here that someone else really authored Lee's Pulitzer Prize-winning bestseller; or that the future author of In Cold Blood wrote that violent episode in To Kill A Mockingbird; or even that his early stories are someone else's hoax. Literary history is clear on all those matters, and no serious practitioner of stylometry should try to reopen resolved matters in this way; however, there are plenty of unresolved issues waiting to be addressed. Indeed, in an earlier study, two of us concentrated not on Capote but on Tay Hohoff and her editorial impact on To Kill A Mockingbird, and we found that while it was considerable in individual fragments, it was not enough to change the fact that, overall, authorial attribution by stylometry pointed to Lee (Eder and Rybicki, "Go Set"). Nor is it our aim--or even our focus--to establish whether Capote wrote any particular fragments for Lee, or that she wrote parts of In Cold Blood 'for him. The main interest, with respect to all of the questions above, is that there are some patterns of similarity and difference between entire books and between their fragments. These patterns occur in places that make sense from a close reading perspective. Even if we assume that stylometry based on most frequent words measures only similarity and difference in different texts and different authors, which has no bearing on the human reception of a literary work, empirical studies such as this one provide interesting insights into the possible connections between such linguistic questions as lexicon and grammar and the literary questions of style, content, and intertextuality. In other words, stylometry in its distant-reading variant may be used to point out elements, passages, and fragments that could provide new perspectives for traditional close reading--perhaps similar to how lab results help a physician make a diagnosis.


Jagiellonian University


Polish Academy of Sciences


Jagiellonian University

Works Cited

The Author and His Audience: With a Chronology of Major Events in the Publishing History of J. B. Lippincott Company. Philadelphia: J. B. Lippincott Co., 1967.

Bastian, Mathieu, Sebastien Heymann, and Mathieu Jacomy. "Gephi: An Open Source Software for Exploring and Manipulating Networks." Proceedings of the InternationalAAAI Conference on Weblogs and Social Media, San Jose, CA, 2009. Accessed 23 May 2019.

"Being Atticus Finch: The Professional Role of Empathy in To Kill a Mockingbird." Harvard Law Review 117.5 (2004): 1682-1702.

Burrows, John F. Computation into Criticism: A Study of Jane Austen's Novels and an Experiment in Method. Oxford: Clarendon P, 1987.

Capote, Truman. In Cold Blood. New York: Random House, 1966.

--.Too Brief a Treat: The Letters of Truman Capote. Ed. Gerald Clarke. New York: Random House, 2004.

Cavoto, Janice E. "Harper Lee's To Kill A Mockingbird." The Oxford Encyclopedia of American Literature. Vol. 2. Oxford: Oxford UP, 2004. 418-21.

Clarke, Gerald. Capote: A Biography. New York: Simon and Schuster, 1988.

Dexter, Joseph P., et al. "Quantitative Criticism of Literary Relationships." Proceedings of the National Academy of Sciences. 2017. Accessed 23 May 2019.

Eder, Maciej. "Rolling stylometry." Digital Scholarship in the Humanities 31.3 (2016). Accessed 23 May 2019. doi:

--. "Visualization in Stylometry: Cluster Analysis Using Networks." Digital Scholarship in the Humanities 32.1 (2017): 50-64. Accessed 23 May 2019. doi:

Eder, Maciej, and Jan Rybicki. "Go Set A Watchman While We Kill the Mockingbird in Cold Blood, with Cats and Other People." Digital Humanities 2016: Conference Abstracts. Krakow: Jagiellonian University & Pedagogical University, 2016. 184-86.

Eder, Maciej, Jan Rybicki, and Mike Kestemont. "Stylometry with R: A Package for Computational Text Analysis." The R Journal 8.1: 107-21.

Gamerman, Ellen. "Data Miners Dig Into 'Go Set a Watchman.'" Wall Street Journal (17 July 2015): D5.

Haggerty, Andrew. Harper Lee: To Kill a Mockingbird. New York: Marshall Cavendish, 2010.

Herrmann, J. Berenike, Karina van Dalen-Oskam, and Christof Schoch. "Revisiting Style, a Key Concept in Literary Studies." Journal of Literary Theory 9.1 (2015): 25-52.

Hohoff, Tay. "We Get a New Author." Literary Guild Book Club Magazine 8 (1960): 3-4.

Hoover, David L. "Quantitative Analysis and Literary Studies." A Companion to Digital Literary Studies. Ed. Ray Siemens and Susan Schreibman. Oxford: Blackwell, 2007. 517-33.

Jacomy, Mathieu, Tommaso Venturini, Sebastien Heymann, and Mathieu Bastian. "ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software." PLoS ONE 9.6 (2014): e98679. article?id= 10.1371/journal.pone.0098679

Jockers, Matthew L. Macroanalysis: Digital Methods and Literary History. Urbana: U of Illinois P, 2013.

Johnson, Claudia Durst. To Kill a Mockingbird: Threatening Boundaries. New York: Twayne, 1994.

Lee, Harper. "Christmas to Me." McCall's (Dec. 1961): 63.

--. Go Set a Watchman. New York: HarperCollins, 2015.

--. To Kill a Mockingbird. 1960. New York: HarperCollins, 2002. Love, Harold. Attributing Authorship: An Introduction. Cambridge: Cambridge UP, 2002.

Lubet, Steven. "Reconstructing Atticus Finch." Michigan Law Review 97.6 (1999): 1339-62.

Mahler, Jonathan. "Invisible Hand That Nurtured an Author and a Literary Classic." New York Times (13 July 2015): C1.

McKenna, Wayne, John F. Burrows, and Alexis Antonia. "Beckett's Trilogy: Computational Stylistics and the Nature of Translation." Revue Informatique et Statistique dans les Sciences humaines 35(1-4): 151-71.

Moretti, Franco. Distant Reading. New York: Verso Books, 2013.

Mosteller, Frederick, and David L. Wallace. Inference and Disputed Authorship: The Federalist. Reading, MA.: Addison-Wesley, 1964.

Murray, Jennifer. "More Than One Way to (Mis)Read a Mockingbird." Southern Literary Journal 43.1 (2010): 75-91.

Neri, Greg. Tru & Nelle. New York: Houghton Mifflin, 2016.

Norden, Eric "Playboy Interview: Truman Capote." Playboy 15 (March 1968): 51+.

Petry, Alice Hall. "Harper Lee, the One-Hit Wonder." On Harper Lee: Essays and Reflections. Ed. Alice Hall Petry. Knoxville: U of Tennessee P, 2007. 143-64.

R Core Team. R: A language and environment for statistical computing. 2014. Accessed 23 May 2019.

Sarat, Austin, and Martha Merrill Umphrey, eds. Reimagining To Kill a Mockingbird :Family, Community, and the Possibility of Equal Justice Under Law. Boston: U of Massachusetts P, 2013.

Schultz, William Todd. Tiny Terror: Why Truman Capote (Almost) Wrote Answered Prayers. Oxford: Oxford UP, 2011.

Shields, Charles J. I am Scout: The Biography of Harper Lee. New York: Henry Holt, 2008.

van Halteren, Hans, et al. "New machine learning methods demonstrate the existence of a human stylome." Journal of Quantitative Linguistics 12.1 (2005): 65-77.

(1) Jennifer Murray points to a surge of interest in To Kill a Mockingbird spanning from Claudia D. Johnson's To Kill a Mockingbird/ Threatening Boundaries (1994) to On Harper Lee: Essays and Reflections (2007). The interest in Harper Lee continues, not only with the publication of To Kill a Mockingbird: New Essays (2010) and Reimagining To Kill a Mockingbird; Family, Community, and the Possibility of Equal Justice under Law (2013), but also the organization of conferences and academic sessions dedicated solely to Lee. The stylometric study discussed in this article was first presented at one such conference, "Harper Lee: Revision," organized in 2016 at Ludwig-Maximilians-Universitat Munich.

(2) Shields observes that "Peck positioned himself firmly and prominently at the center of the film" (168) and thus, while in the novel only fifteen percent of the content is devoted to Tom Robinson's trial, the film version dedicates more than thirty percent of the running time to the trial.

(3) Capote's Idabel Thompkins from his first book, Other Voices, Other Rooms (1948), unmistakably calls to mind a young Harper Lee, while Lee's portrayal of Dill from To Kill a Mockingbird (1960) as "a pocket Merlin" with the knack for storytelling can hardly be based on anyone else but Truman Capote. In his recent story of their childhood friendship, Tru & Nelle (2016), Greg Neri builds a narrative from their first encounter in Monroeville, Alabama, when she was six and he seven, up to a startling scene when Ku Klux Klan members disrupt a Halloween party Truman organized.

(4) Petry describes how Lee was pressured to modify her draft extensively, rewriting and reorganizing considerable sections of the text (160-61). The pervasive change of genre understandably affected the book, throwing the narrative voice out of balance. To Murray, however, Petry's account of the unification process fails to fully embrace the importance of Lee's own artistic preference for writing short stories rather than novels (77-78). See also The Author and His Audience 27-29 for a description of the progression of the manuscript from submission to publication.

(5) See for example Mosteller and Wallace's analysis of the authorship of The Federalist.

(6) Mathematical explanation of multivariate methodology in its different flavors is discussed in length in several studies as well as in textbooks. A few recent books include: G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning with Applications in R (2014); P. Fleach, Machine Learning: The Art and Science of Algorithms That Make Sense of Data (2015); R. H. Baayen, Analyzing Linguistic Data: A Practical Introduction to Statistics Using R (2008), especially the chapter "Clustering and Classification."

Caption: Figure 1. Cluster analysis bootstrap consensus tree showing the nearest neighbor patterns between selected writers of the American South.

Caption: Figure 2. Network analysis of selected writers of the American South.

Caption: Figure 3. Rolling Classify analysis of the usage of 100 most frequent words in Lee's To Kill A Mockingbird, compared against Go Set A Watchman and several books by Tay Hohoff and Truman Capote (TC).

Caption: Figure 4. Rolling Classify analysis of the usage of 500 most frequent words in Lee's To Kill A Mockingbird, compared against Go Set A Watchman and several books by Tay Hohoffand Truman Capote (TC).
