A Project Gutenberg Poetry Corpus

Critical Writing
Record Status: 
Abstract (in English): 

In this paper, I present the Gutenberg Poetry Corpus: a corpus of over three million lines of poetry (in annotated JSON format) automatically curated from Project Gutenberg. Project Gutenberg, a collection of machine-readable texts in the public domain, was originally instigated in the early 1970s with a hand-typed copy of the US Declaration of Independence. More recently driven by the volunteer efforts of a decentralized group of proofreaders, Project Gutenberg now consists of more than 54,000 texts, mostly English- language literature from the 18th and 19th centuries. Researchers in the humanities and in computational linguistics have made use of Project Gutenberg for decades, and more recently its use in data-driven computational creativity has grown. I relay the methodology used to automatically filter and identify lines of poetry from the larger Gutenberg corpus, discuss the potential of this corpus for research and creative work, and then present a series of my own experiments that use this corpus as their primary source material.

The permanent URL of this page: 
Record posted by: 
Susanne Ã…rflot ...