The first U.S. patent (above) was granted on July 31, 1790. It was issued to one Samuel Hopkins, for a process to make potash (a chemical used in fertilizer), and it was signed by George Washington himself. The original piece of paper still exists, and its information is logged in the databases of the U.S. Patent and Trademark Office (USPTO).
Since that day in 1790, the USPTO and its antecedents have been diligently collecting data on all of this country’s patent activity. It is a venerable information processing organization, and its objectives of making prior art accessible and encouraging innovation by simplifying the patent-granting process have not changed much over its history. The means it uses to achieve these objectives, on the other hand, have changed dramatically, and although it has made great strides in digitization and electronic filing, the USPTO and its international counterparts stand to benefit greatly from advanced data science initiatives.
The USPTO houses a wealth of valuable data in its patent library that is critical for businesses, researchers, and local inventors. This information used to be locked up in specially-designated Patent and Trademark Resource Centers—local libraries that housed copies of U.S. patents and trademarks materials. This is a tedious (and expensive) operation, so in recent years the USPTO has been active in releasing its data in machine-readable formats, with several dozen datasets available on data.gov. Many of the pre-digital filings have been digitized and transcribed by optical character recognition (OCR) algorithms, but some of the oldest documents are available only as scanned images; it comes as no surprise that 200-year-old handwriting can be difficult for a computer to parse.
Third parties have tried cleaning up and adding usability to the patent data, with some success. Google Patent search, which has applied its own OCR to much of the patent application data, is still a work in progress – despite its broad coverage, it still displays some diagrams and data tables in non-machine-readable formats, and coverage is weaker among earlier patents. Samuel Hopkins and his potash process, for example, are nowhere to be found.
The Lens, an interactive patent database and information portal developed by Australian nonprofit Cambia, includes much of what USPTO has released, along with similar data from other countries. The European Patent Office (EPO), from which The Lens also collects data, is itself undergoing a series of modernizing reforms, including a partnership with USPTO to standardize certain kinds of data classification. Although still in beta, The Lens adds a powerful UI and modern, relevance-based search engine to this U.S. and international data, improving on the crude search available on USPTO and EPO sites.
Another group that has made use of USPTO’s data is Cambridge, Mass.-based IPVision, which bills itself as a “patent analytics and IP strategy” company, and provides, among other things, a data-visualization platform that can be used to track patent grants over time in a given topic area. IPVision Executive Vice President Alex Butler says that the company’s technology compares patent portfolios “Across dozens and dozens of vectors…how do they compare and contrast? Are they higher in quality or lower in quality? Are they just incremental improvements? You can translate those into numerical perspectives.”
Besides attracting third-party professionals, USPTO has also been proactive about promoting its data among hobbyist developers. Last year, the office held a coding competition to develop computer vision software for digitizing patent documents. And earlier this month, the “USTPO Innovation Challenge” was held under the auspices of the National Day of Civic Hacking to develop an app that could aid in analysis of the office’s recently-released trademark case file dataset (more on that below).
For those looking to understand the USPTO itself, rather than the actual patent and trademark data, the office provides an extensive set of dashboards. These provide a dynamic display of such metrics as total pendency, which indicates the average length of time between the filing of a patent application and the final grant-or-reject decision, and are a useful tool to help businesses and inventors plan accordingly.
Also in the interest of shortening application processing times and costs, USPTO provides an electronic filing system, which has helped electronic filing spread from less than 50% adoption in 2007 to over 97% adoption in 2012. Such a system lowers both administrative and data processing costs, since the machine-readable applications do not need to be processed with OCR and patent examiners can work more efficiently.
The aforementioned USPTO Trademark Case Files Dataset, released in January, presents a major opportunity for new analyses; it dates back to 1870, and contains nearly seven million trademark applications and registrations. Macroeconomic and social science research is only beginning to make use of the dataset, and the inaugural conference on quantitative studies of the trademark data will be held this September. (Proposals are still being accepted through June 24.) Stuart Graham, the USTPO’s Chief Economist, has noted that “trademark data is largely terra incognita,” and that “unlike the patent literature, [it] has not had decades of rigorous work.”
A dataset of its size and novelty is sure to draw scholarly attention, particularly after some of the intriguing (but still embryonic) work on intellectual property analytics in recent years. The NBER’s patent citation database, for example, was taken up in 2012 by an international team of scientists and IP scholars in a paper called “Prediction of emerging technologies based on analysis of the US patent citation network” (preprint available freely here). The authors describe the patent citation network as an “evolving graph, which provides a representation of the innovation process,” and they use cluster-detection algorithms adapted from network theory to model citations and anticipate new ones. A reliable mechanism for predicting emerging technologies could have broad implications in technology and trade policy, but the methods in this field are still quite young.
Other opportunities for innovation with USPTO data may be found in patent search, where machine learning and natural language processing algorithms could enable “content-aware” search and Netflix-style recommendation systems. This could dramatically decrease the time would-be inventors spend in the discovery phase of their applications, by pointing to search results that are deemed similar but don’t necessarily use the same language as the original query. This would both lower the costs of invention and speed the time it takes to get products to market—a boon for consumers. (This approach was proposed in a GigaOm article over a year ago, but nothing of the sort has been implemented yet, at least publicly.)
In general, intellectual property data offers a great deal of low-hanging fruit for innovators; as long as the USPTO and related organizations continue to encourage its exploration, it stands to promote new insights and greater efficiency far beyond the patent office itself.