Published in

SPIE Newsroom

DOI: 10.1117/2.1201109.003833

Links

Tools

Export citation

Search in Google Scholar

Towards free and searchable historical census images

Journal article published in 2011 by Kenton Mchenry, Luigi Marini, Mayank Kejriwal, Rob Kooper ORCID, Peter Bajcsy
This paper is available in a repository.
This paper is available in a repository.

Full text: Download

Red circle
Preprint: archiving forbidden
Green circle
Postprint: archiving allowed
Green circle
Published version: archiving allowed
Data provided by SHERPA/RoMEO

Abstract

Combining automation and crowd sourcing will provide access to archived handwritten forms. Individuals and humanities researchers alike recognize the benefits of search services for censuses, which contain important information on ancestral populations. 1 In April 2012, the raw US census data from 1940 will be made available to the public for the first time in digital format. The census is being digitized by the National Archives and Records Administration and the US Census Bureau. Consisting of digitally scanned microfilm rolls, nearly 3.25 million photographs of the original census forms will be released (see Figure 1). The tasks of transcribing, organ-izing, and searching this very large >18TB corpus of images remains a resource-intensive task for other federal agencies. With databases of this type, a Soundex index, which encodes words based on how they sound to enable homophone match-ing, are often compiled. However, producing such an algor-ithm is a tedious and time-consuming process and will not be released with the 1940 data. On the day of the data release, var-ious commercial entities will also begin transcribing the hand-written content of the images, a task that will take thousands of trained laborers anywhere between 6 and 12 months. As a result, access to the searchable, transcribed data will come at a cost to the public by these various companies. Here, we describe our ap-proach to image-based information retrieval to avoid the costly transcription process. 2, 3 Our goal is to minimize the manual labor needed to tran-scribe handwritten entries in the census images and deliver a system capable of computationally scalable search services. Understanding the achievable accuracy and levels of automa-tion depends on solving several problems related to scalability and data management. We endeavor to provide a completely automated search capability that can build more accurate tran-scriptions over time using passive and active crowd sourcing (see Figure 2). Commercial entities typically outsource manual transcription of census forms and host text-based search services Figure 1. A digitized census form from the 1930 US census. Figure 2. The architecture of our proposed hybrid automated/crowd-sourcing system to provide access to content within scanned census forms. Image pyramids are used to pre-process the larger images to access small areas more efficiently. over those transcribed entries. In contrast, our approach uses form segmentation, handwritten text indexing, and web-based search and crowd sourcing to minimize the manual transcrip-tion of the images.