The Center for Data Innovation spoke with Matt Knutzen, a Geospatial Librarian at the New York Public Library (NYPL). Knutzen discussed recent efforts to create structured data from maps and how the practice of librarians is changing in response to increasing demands for immediately actionable data.
This interview has been lightly edited.
Travis Korte: First, can you give me an overview of the sorts of things you work on as a geospatial librarian at the NYPL?
Matt Knutzen: My job at the New York Public Library is to oversee the Lionel Pincus & Princess Firyal Map Division here, but there are lots of interesting things I end up doing at the library. A typical week might include a visit from an elementary school group, where I give students who are studying cartography and exploration an interactive show and tell of paper maps from the 16th and 17th centuries and demonstrate our latest endeavors in crowdsourcing and tell them about the fun we’ve been having here turning historic maps into historic layers in Minecraft; selection and acquisition of new antiquarian maps for the collection; and researching and writing labels for an upcoming exhibition marking the 200th anniversary of the end of the first Creek War in 1814.
TK: You recently worked to make a portion of NYPL’s map collection available for download. Who is the target audience for this release? Do you know of specific research projects or applications that plan to draw from this collection?
MK: We really didn’t have a target audience in mind when we recently made 20,000 historical maps available for free hi-res download through our Map Warper site. Unless you could call “anyone who wants them” or “the general public” a target audience. We knew that making these maps more easily accessible would certainly help meet the needs of the researchers and patrons who use our collections the most, like those interested in, but certainly not limited to, the study of genealogy (e.g. “my great grandfather lived here”), environmental remediation (e.g. superfund sites), environmental history, architectural history, urban planning and comparative urban analysis, and historical site surveying (e.g. people rebuilding/redeveloping/digging into a site and needing to know its history). But I never assume to know all the possible uses of our collection, especially when we enabling such broad access. What’s been really exciting and fun is to see how people use and engage with the materials in ways I couldn’t have ever imagined.
One example that immediately comes to mind is the response to Hurricane Sandy. In the aftermath of the storm, the New York State Department of Homeland Security contacted us to locate information that could dramatically illustrate and give historical context to past incidents of flooding in New York City. We selected a series of maps from our collections showing both lower Manhattan over 400 years and the Rockaway Peninsula over 160 years. In the case of Manhattan, the maps showed extensive and progressive landfill, much reclaimed temporarily during Sandy; the historical shoreline almost exactly matched the high water mark. In the Rockaways, nautical charts told a dynamic story of coastal geomorphology, of a barrier island that shifted and grew nearly five miles to the southwest and was hardened into place with asphalt and concrete just like neighborhoods built on bedrock. Ultimately, this historical data can be used to help inform policy makers on disaster resiliency and planning and, we hope, mitigate and absorb these type of powerful events in the future.
TK: Tell me a little bit about Map Warper. Are there any other new technologies your department is working with to add value to map collections?
MK: The Map Warper is a tool we have developed for the past five years with an open source GIS firm called Topomancy. It’s a very customized version of existing software libraries, built into a public-facing user interface, and connected, really for the first time, to a large institutional repository. What the Warper does is correlate the pixels depicting a place on an image of an old map to the corresponding geographic location, as in latitude and longitude on a virtual map, and then georectifies (normalizing or “warping” maps to align geographically at a uniform scale) the old image of a map and gives it spatial context. Think of Google Earth, when you zoom out, it looks sort of like a patchwork quilt of aerial and satellite images. These are all georectified images. What the Warper does, then, is allows users, including members of the public who create an account on the system, to transform and add value (in the form of geographic context) to our historical map collections. Essentially then, all of those nuanced decisions made by cartographers to rotate a map to fit the page, or shrink the scale of a map to make the right number of pages in an atlas can be normalized across geographic space. An architectural scale map can be easily compared to and overlayed by a topographic map of a city or regional scale. This provides tremendous value to our users who have done this kind of work traditionally either using a photocopier (enlarging or shrinking), or in their minds (it takes a high degree of skilled mental gymnastics to perform this kind of task). Georectification also prepares maps to be transcribed at a uniform scale and undergo the kind of data mining processes we are subjecting them to through the Building Inspector. [See below].
TK: The holy grail of open data releases in a lot of formats is machine readability, but I imagine this is even more difficult to achieve with maps. Can you talk a little bit about your work in this area with the NYC Historical GIS project? What’s the state of the art for turning paper maps into GIS data? Are there plans to make other map collections machine readable in the near future?
MK: As you suggest, there’s really no good means to reliably extract geographic features from scanned maps. This is in part because maps are really complex documents with overlapping visual and textual elements. Cartographers certainly maintain and follow standards, but geography itself is actually quite messy and not at all homogeneous. The decisions that flow from this lead mapmakers to stretch and kern a text label for a mountain range across an entire page, or compress thousands of names into very dense street networks. Within an atlas, cartographers may work at varying scales, or orient north not-at-top in order to better fit the map on the page, or generalize many geographic features out of their cartographic existence, if for example, the map area is simply too dense. This is all to say that the variability of cartographic elements is so great that building algorithmic means to extract geographic features through computer vision, i.e. reverse engineering maps, isn’t really possible; there is no map OCR. I’d call that holy grail “automated geographic feature extraction.”
We have only begun to address these issues. For starters, we just completed the three-year, National Endowment for the Humanities-funded NYC Historical GIS project, which enabled us to scan upwards of 12,000 historical maps of New York City and transform them (using tools at maps.nypl.org), first through georectification and next, through tracing data from maps using a drawing tool (also using maps.nypl.org). The project produced a valuable collection of high resolution, public domain images and some highly useful vector data, which documents New York City’s historical built environment, which scholars are now beginning to use. The project also provided us with tremendous insight into the process, namely it highlighted every granular step along a vectorization workflow and suggested to us that those steps could be reverse engineered to be done better, faster, easier and it could actually be fun.
My colleagues at NYPL Labs, our experimental design and technology team, have pushed the needle a little further to automated feature extraction in creating the Map Vectorizer and Building Inspector. With the former, a computer vision algorithm designed to see and trace outline shapes on maps, we can trace tens of thousands of geographic features from hundreds of maps in hours instead of months or years. The resulting data is pretty good, but definitely not perfect, so we created the mobile-friendly web app Building Inspector to enlist the help of the public to fix, augment, and validate the automatically generated work. The tasks are dead simple, fun and quite addictive, and the results are passed in front of many users to achieve a consensus. To date, we have had more than half a million tasks completed, or very small contributions, from readers in about a six week period. The maps we’re working on are fire insurance maps which document the built environment of urban spaces at a very large, nearly architectural scale. Users of Building Inspector help us check if the computer vision algorithm worked by telling us if a building looks good, needs work or needs to be pitched out all by simply tapping a “yes,” “fix,” or “no” button. And beyond validating the shapes themselves, users can encode details from the maps like the color schema (pink=brick, yellow=wood) by clicking corresponding buttons or transcribe the building addresses, or correct the shapes for those buildings that other users suggested should go into the “fix” queue.
By atomizing the work tasks, cobbling them together into a larger workflow and distributing them to users in a user friendly, game-like environment, we drastically lowered the barrier for participation and are getting a lot of work done collectively. This data is all available via API under a Creative Commons 1.0 Universal Public Domain Dedication, the same we apply to our scanned map images.
TK: Computer science and library science have had significant overlap for a long time, but do you find that the role of librarians is changing with the increased demand for machine readable access? Is curating data different from curating ordinary digital materials?
MK: I absolutely think the role of librarians is changing as users come to expect information to be actionable and immediately consumable via download or API. We require new types of technical proficiency that kind of straddle the world of analog collections and the world of digital information. There is a need for librarians to know, in the abstract at the very least, the way applications work with data and how that might inform the way we migrate our collections into data. It really helps also when librarians take on a transformative mindset, that is, instead of thinking about digitizing as a simple act of reformatting and multiplying the access to collections through the web, to see a variety of potential for that information to become something more than it was before. Take as an example a large corpus of books, which, once scanned and transformed into machine data, can be interrogated in ways not possible except for within the aggregate. Users know of this potential, whether they’re scholars champing at the bit to have a new collection converted into data or casual users simply needing to ask questions of digitized collections. Understanding, then, what it takes to move from paper to data that can be queried, mixed and aggregated is really crucial for librarians as we try to anticipate our growing user demands.