20

Is there any software (or pseudo-code) which can automatically scan a piece of text (either pasted into the tool, or read from a .doc/.pdf) and identify citation data using standard formats? The data would then be split up into its constituent fields and exported in XML, CSV, or some other structured data format. I have looked at cb2Bib but it was only able to extract the year from Harvard-style references, which is insufficient.

8 Answers8

4

At the moment (2017) the most active Open-Source project implementing this seem to be Anystyle Parser (last version 07-2016). It can be used through a web-interface, API, or downloaded as a RubyGem.

They explicitly mention on their website that the implementation is inspired by ParsCit (last version 2013?) and FreeCite (last commit 2009).

Also form their website:

AnyStyle Parser uses powerful machine learning heuristics based on Conditional Random Fields that can be trained by everyone using our built-in editor.

That is a realy cool feature, that makes this the most interesting implementation (imho). Training seems to be pretty straightforward, as explained in the API documentation. You just provide some manually corrected results, and and run the Anystyle.parser.train command. I am not sure if ParsCit and FreeCite also support this, but if they don't, this seems like a huge feature-difference to me.

Wouter
  • 1,701
4

Take a look at this list of Citation Parsers that can generate XML from input text:

http://paracite.eprints.org
http://aye.comp.nus.edu.sg/parsCit (in maintenance mode as of Aug 1, 2012)
http://opcit.eprints.org
http://search.cpan.org/~mjewell/Biblio-Citation-Parser-1.10

dourouc05
  • 103
KEG
  • 41
2

Try a tool such as Regex Buddy or Expresso.

If you're not a programmer Regular Expressions may be a bit intimidating, but they're really not that hard, especially with a decent tool like one of the above.

Here's an example of someone using Regular Expressions for extracting citations:

Citation parsing regular expression

Ash
  • 2,851
1

Mendeley should be able to do this. It can import PDFs and then export the metadata to BibTeX, RIS and EndNote XML. It is free to download and is cross-platform.

Edit: I tested this on a few documents. The PDF import does seem to work well for references that are formatted correctly. For a document I created using LaTeX, all of the references with the author in the form "Smith, J." or "J. Smith", etc., were imported fine. If the author is a company (a single word), or the reference is incomplete, it does not work as well. The extracted references can easily be edited and exported to BibTeX, etc.

sblair
  • 12,757
1

Try http://www.crossref.org/guestquery/#stqsearch

This one is capable of automatic parsing your reference text and offers a link to an on-line article.

anton
  • 11
1

I've seen a Westlaw program do that for legal citations, but that's probably not what you're looking for. Reference Manager might do something like that for academic formats, but I've never used it.

Kaypro II
  • 1,482
0

This probably belongs more as a comment to @Abhinav, but zotero definitely only handles structured data, as you would find described here:

http://www.zotero.org/support/getting_stuff_into_your_library#importing_records_from_other_reference_tools

An interesting hack might be to try to write a program that uses each citation as a search query in your favorite database, then uses something like zotero to generate the ref information. You could also download structured information from services like citeUlike. Let me know if you end up doings something like that! (put it up on github if you do ;).

Dav Clark
  • 173
0

Zotero is a plugin for firefox which does this for web content. Not sure if there is a similar tool for documents/pdfs

Abhinav
  • 2,040