OCR and PDFs - the possibilities and limitations for accessibility

In order to serve its patrons, NYPL is constantly finding new ways to make more content easily accessible through digitization. In order to expand on our research catalog and get the most out of digitization, we have started work on using Optical Character Recognition (OCR) on scanned books to not only provide digital copies, but to improve accessibility with tagged and searchable PDFs of our research materials. In this talk, we’ll explore the technical details of our new PDF creation pipeline, as well as both the possibilities and limitations we’ve encountered in using OCR data.

Speaker(s)

Sarang Joshi

May 14^th

11:05 AM

10 minutes