Help Needed: Where can I download a machine-readable database of publications in computer science or especially empirical software engineering? Or is this a job for Amazon Mechanical Turk?
At a small gathering of junior computer scientists today we had an involved discussion about the reputability of claims in scientific publishing.
A few key agreements from my perspective:
- Sometimes a lemma is more interesting than the main theorem in a publication. Then future citations cite the paper for the lemma though not for its main result. More generally, sometimes we cite tangential results, not the main result of papers.
- Sometimes a paper’s research does not truly support its claims. Future citations of the paper lose this information.
- Sometimes a paper’s research does not support its claims as strongly or as generally as citations of it seem to imply.
- Sometimes researchers will repeat a citation of a paper without reading the paper. Thus imperfect paraphrasings propagate unchecked.
- Sometimes good research is built on foundations of less rigorous research.
It would be nice to build a database of publications with metadata for supported claims. It should include the strength and method of the support of the claim. Then we compare citations of that paper against the claims the paper actually supports.
This method would apply even in mathematical fields where the purity of knowledge is higher. For example, we could imagine a proof P that relies on theorem T. A paper presenting P would cite a paper proving T. If P no longer holds if T is false, then it is interesting to know how thoroughly the proof of T has been checked. Has it been glanced over casually by two overworked reviewers, accepted on trust and reputation? Has it been checked formally by computer? Our database would include this kind of information in its entry for the paper proving T.
We should also consider the importance of a given citation to the citing paper.
Some citations are critical. In this situation a paper would lose its force if the cited paper were incorrect.
Some citations are informational. In this situation a paper is cited because it is considered interesting and relevant. But a later debunking would not invalidate the citing paper.
Some citations are methodological. In this situation a paper is cited because its method has been copied or tweaked. This leads us to another important consideration I will not discuss today: The (lack of) standardization of methodology in empirical software engineering.
An near-term achievable goal would be to gather a large number of machine-readable papers, build a dependency graph of citations, and compare the citing sentences. It is common practice to tag a citing sentence with a reference. These sentences we could extract programmatically. Thus we could gather a large number of citing sentences and compare them to each other and to the actual claims in the paper. I suspect we would not like the smell of what we found.
Help Needed (repeat): Where can I download a machine-readable database of publications in computer science or especially empirical software engineering? Or is this a job for Amazon Mechanical Turk?
 Not a real citation.