Published on January 26th, 2014 | by Travis Korte0
Liberating Government Data from PDFs
Since its release in 1993, the Portable Document Format (PDF) standard has become a popular way for organizations to store reports, forms, and other documents. The major benefit of PDF files is that documents look the same both on screen and printed, regardless of the type of device used to access the document or software used to create the file. PDFs are supported on a wide range of operating systems, take up relatively little space on a computer, and can be encrypted or digitally signed. PDFs are the de facto standard for many government agencies.
Unfortunately, data scientists sometimes have difficulty performing data analysis on documents stored in the PDF format because the text, tables, and images stored in the file are not always easily extracted in a structured format. While the format has evolved in recent years to allow for text and images to be marked up, many older PDFs are simply scanned images from which text must be extracted as a first step toward analysis. As a result, government agencies cannot easily track trends and gain insights from their data when it is stored exclusively as PDFs. Moreover, some government agencies, such as the U.S. Agency for International Development’s (USAID), house PDFs in systems that do not always allow give external users the option to download large quantities of PDF files in bulk, which is essential for large-scale data analysis.
To help develop open-source tools for working with PDFs and the databases that house them, the Sunlight Foundation and others hosted the PDF Liberation Hackathon at six locations around the country from January 18-19, 2014. The hackathon’s participants took a wide range of approaches to preparing PDFs for computer-aided analysis, including optical character recognition technologies used to extract text, specialized software for identifying and reproducing data tables in PDF documents, and scripts to automatically download large numbers of PDFs from government databases that were not designed to enable bulk downloads.
One dataset that hackathon participants worked with in Washington, D.C., is USAID’s Development Experience Clearinghouse, the largest online resource for documentation on USAID-funded international development efforts. The database contains around 170,000 documents, of which around 150,000 are available for download. As a USAID representative present at the event explained, access to the data contained within the PDFs would help development experts conduct deeper analysis on foreign aid and look granularly at what interventions are most effective.Unfortunately, the database only allows users to view a small subset of the documents at once, meaning that an analyst would potentially have to navigate through hundreds or thousands of pages of results by hand even to begin analyzing the data. Hackathon participants wrote a program to automate this process and demonstrated a simple application to visualize the number of documents contained in the database for each year.
The hackathon did not produce any analysis, since it was focused on tool development, but the future applications of PDF data liberation may be widespread. Local governments can benefit from using PDF-bound crime and unemployment data to improve policing and social services programs. Many city and county budget files are also kept in PDFs, so an easy means of data extraction would enable analysts to probe for fraud, waste and abuse.
But the impact of PDF liberation tools would extend far beyond local governments. Many non-profit organizations publish documents using the PDF format. Amnesty International’s State of Human Rights reports are also stored in PDF; automated access to that data would allow international human rights organizations to better model and predict torture violations. While open data advocates would like to see future data published in machine-readable formats, there is a considerable amount of data still stored in as PDFs across government agencies at home and abroad, so tools for working with generic the format promise to pay dividends well into the future.
Photo: The Sunlight Foundation