Crowdsourced question-and-answer website Quora has published a dataset of different questions users ask on its website that could be similar to each other to spur the development of algorithms that can help curate redundant questions. For example, the questions “Should I learn Python or Java first?” and “If I had to choose between learning Java and Python, what should I choose to learn first?” are semantically equivalent—though they use different words and structures, they are asking the same thing—but identifying this can be challenging to automate, causing users to frequently post the same questions on the website without realizing there is already an answer available. The dataset contains 400,000 lines of these potentially similar questions and indicates whether or not they are semantically similar, so developers could build natural language processing systems that could help Quora redirect users to posts where their questions have already been answered.
Learning to Understand What People Ask on the Internet
Joshua New was a senior policy analyst at the Center for Data Innovation. He has a background in government affairs, policy, and communication. Prior to joining the Center for Data Innovation, Joshua graduated from American University with degrees in C.L.E.G. (Communication, Legal Institutions, Economics, and Government) and Public Communication. His research focuses on methods of promoting innovative and emerging technologies as a means of improving the economy and quality of life.
View all posts by Joshua New