The Distant Reader

The Distant Reader is a tool for… distant reading. Given an almost arbitrary number of files of just about any type, the Reader applies text mining and natural language processing against the input and outputs a data set amenable to computation. These data sets are designed to stand the test of time, meaning they are operating system and network independent. These data sets are akin to library collections, and examples include: 1) the complete works of Jane Austen, 2) 20,000 articles on the topic of COVID-19, or 3) digitized Catholic pamphlets. The Reader not only creates these data sets, but it also provide the ability to analyze – model – them. Some of the out-of-the-box modeling techniques include: 1) frequencies of extracted features (ngrams, parts-of-speech, named entities, computed keywords), topic modeling, semantic indexing, Linked Data, and sentence extraction based on given grammars. More recently, work has been done to apply large-language models to the data sets. Just as importantly, one does not need to use the Reader to do analysis; a great deal of analysis can be done against these data sets using off-the-shelf GUI applications or any computer programming language.

Speaker(s)

Eric is a librarian who works at the University of Notre Dame.

Eric Lease Morgan

May 14^th

3:30 PM

TBD