How to collect bibliographic references using cb2bib

It often happens to come across a useful bibliographic reference while navigating the WWW: in a newsgroup, while reading an on-line article, etc. If you want to add it to your collection of references, you can do that in a (semi-)automated way using a small but very handy utility, cb2bib. As you can read on the program’s home page, cb2bib “is a tool for rapidly extracting unformatted, or unstandardized biblographic references from email alerts, journal Web pages, and PDF files.” The name stands for “clipboard to bibliography (entry)” and stems from the program’s modus operandi: the text copied by the user in the clipboard is read by cb2bib and compared with a set of pre-existing patterns, then if a match is detected the clipboard text is directly converted in a bibtex entry on the basis of the matching pattern. Let’s see an example.

Note: cb2bib is available for both Windows and Linux operating systems (you can download it here), but the following screenshots refer to the OS I normally use, i.e. Linux.


Installing cb2bib is very easy both under Windows and Linux, in the latter case if you are using an RPM based distribution. If you use a Debian or Debian-derived Linux distro, or MacOS X, you might have to compile and install the software on your own. Comprehensive instructions are available at this URL.

A Simple Example

Open the Examples page on cb2bib’s website, select and then copy to the clipboard the second example, the one labelled as “PNAS Table of Contents Alert”.

Note: to copy text to the clipboard under Linux you can simply select it, or you can use the CTRL C key combination; under Windows press CTRL C.

As you can see from the following screenshot (Fig. 1), the selected text has been automatically converted to a structured bibliographic entry, which you can save now as a bibtex entry: just click on the icon next to last on the right (the one showing a floppy disk with a pencil over it), or press CTRL S, and the entry will be added to the file shown in the text field immediately above it.

Missing image 1-example.png Fig. 1 – A sample entry

It’s called references.bib and it lies in the cb2bib folder, but you can modify both path and file name, for instance you might choose C:\Documents\collected-refs.bib.

cb2bib will also retrieve the abstract, add the relevant keywords to the entry, and even download the PDF version of the article if there is an URL pointing to it and access is free! All of this automatically.

Once you have nicely collected and/or modified your reference, click on the Save button (the second from right), or press CTRL S, to save it in the references file. Delete the cb2bib_query_tmp_pdf if present, or you won’t be able to download the PDF file for the article if there is a link in the next reference you are going to process.

To know more about cb2bib features read the very nice overview page here.

Configuring cb2bib

Before exploring further cb2bib capabilities, it may be a good idea to check the program settings: click on the second icon from left, the one showing a wrench, and you will see the configuration window (Fig. 2).

Missing image 2-config.png Fig. 2 – Configuring cb2bib

It is split in about half a dozen tabs, it is important that you check paths in the first one and enable the network queries in the third one (Fig. 3).

Missing image 3-config.png Fig. 3 – Network queries

Since the author is especially interested in scientific publications, you will probably have to modify the regexps.txt file to obtain automatic format detection and field formatting. To do this you will have to understand the file structure, which isn’t too difficult, and add patterns for specific bibliographic styles. This is an example of a pattern which can identify and automatically format MLA-style entries:


cb2Bib 0.3.6  Pattern:
MLA-style article 1
author title journal volume number year pages
^(. ), "(. )," (. ) (\d ):(\d ) \((\d\d\d\d)\), ([\d|\-|\s] )\.$

Since this can be a time-consuming task, please share your regexps.txt files, so that everybody can benefit from your work and add/mix patterns to his own configuration.

Import references from a PDF file

If you have one or more PDF documents holding a good number of bibliographic references, you can import them using cb2bib: click on the third icon from left, and again on “Select files” to choose the PDF file(s) (Fig. 4).

Missing image 4-pdfprocess.png Fig. 4 – Extracting references from a PDF document

Once you have a list of files ready, click on “Process” to have the program read them and extract the references. If nothing happens, it’s because you haven’t specified a PDF importer in the last tab of the configuration window. Unfortunately this feature is still susceptible to improvement, results can vary from a useful list of references to a useless mangled text.

Export bibtex references to HTML

There are many programs that allow you to export your collection of bibtex references to the HTML format: many of these are simple command line tools, like bibtex2html or bib2html; if you want, you can also export them to XML using bibtexml. But you could also take advantage of more sophisticate bibliographic software, like Tellico (Linux) or Pybliographer (Linux): they allow for references managing, network queries, exporting to several different formats, and much more.