How to collect bibliographic references using cb2bib

It often happens to come across a useful bibliographic reference while navigating the WWW: in a newsgroup, while reading an on-line article, etc. If you want to add it to your collection of references, you can do that in a (semi-)automated way using a small but very handy utility, cb2bib. As you can read on the program’s home page, cb2bib “is a tool for rapidly extracting unformatted, or unstandardized biblographic references from email alerts, journal Web pages, and PDF files.” The name stands for “clipboard to bibliography (entry)” and stems from the program’s modus operandi: the text copied by the user in the clipboard is read by cb2bib and compared with a set of pre-existing patterns, then if a match is detected the clipboard text is directly converted in a bibtex entry on the basis of the matching pattern. Let’s see an example.

Note: cb2bib is available for both Windows and Linux operating systems (you can download it here), but the following screenshots refer to the OS I normally use, i.e. Linux.


Installing cb2bib is very easy both under Windows and Linux, in the latter case if you are using an RPM based distribution. If you use a Debian or Debian-derived Linux distro, or MacOS X, you might have to compile and install the software on your own. Comprehensive instructions are available at this URL.

A Simple Example

Open the Examples page on cb2bib’s website, select and then copy to the clipboard the second example, the one labelled as “PNAS Table of Contents Alert”.

Note: to copy text to the clipboard under Linux you can simply select it, or you can use the CTRL C key combination; under Windows press CTRL C.

As you can see from the following screenshot (Fig. 1), the selected text has been automatically converted to a structured bibliographic entry, which you can save now as a bibtex entry: just click on the icon next to last on the right (the one showing a floppy disk with a pencil over it), or press CTRL S, and the entry will be added to the file shown in the text field immediately above it.

Missing image 1-example.png Fig. 1 – A sample entry

It’s called references.bib and it lies in the cb2bib folder, but you can modify both path and file name, for instance you might choose C:\Documents\collected-refs.bib.

cb2bib will also retrieve the abstract, add the relevant keywords to the entry, and even download the PDF version of the article if there is an URL pointing to it and access is free! All of this automatically.

Once you have nicely collected and/or modified your reference, click on the Save button (the second from right), or press CTRL S, to save it in the references file. Delete the cb2bib_query_tmp_pdf if present, or you won’t be able to download the PDF file for the article if there is a link in the next reference you are going to process.

To know more about cb2bib features read the very nice overview page here.

Configuring cb2bib

Before exploring further cb2bib capabilities, it may be a good idea to check the program settings: click on the second icon from left, the one showing a wrench, and you will see the configuration window (Fig. 2).

Missing image 2-config.png Fig. 2 – Configuring cb2bib

It is split in about half a dozen tabs, it is important that you check paths in the first one and enable the network queries in the third one (Fig. 3).

Missing image 3-config.png Fig. 3 – Network queries

Since the author is especially interested in scientific publications, you will probably have to modify the regexps.txt file to obtain automatic format detection and field formatting. To do this you will have to understand the file structure, which isn’t too difficult, and add patterns for specific bibliographic styles. This is an example of a pattern which can identify and automatically format MLA-style entries:


cb2Bib 0.3.6  Pattern:
MLA-style article 1
author title journal volume number year pages
^(. ), "(. )," (. ) (\d ):(\d ) \((\d\d\d\d)\), ([\d|\-|\s] )\.$

Since this can be a time-consuming task, please share your regexps.txt files, so that everybody can benefit from your work and add/mix patterns to his own configuration.

Import references from a PDF file

If you have one or more PDF documents holding a good number of bibliographic references, you can import them using cb2bib: click on the third icon from left, and again on “Select files” to choose the PDF file(s) (Fig. 4).

Missing image 4-pdfprocess.png Fig. 4 – Extracting references from a PDF document

Once you have a list of files ready, click on “Process” to have the program read them and extract the references. If nothing happens, it’s because you haven’t specified a PDF importer in the last tab of the configuration window. Unfortunately this feature is still susceptible to improvement, results can vary from a useful list of references to a useless mangled text.

Export bibtex references to HTML

There are many programs that allow you to export your collection of bibtex references to the HTML format: many of these are simple command line tools, like bibtex2html or bib2html; if you want, you can also export them to XML using bibtexml. But you could also take advantage of more sophisticate bibliographic software, like Tellico (Linux) or Pybliographer (Linux): they allow for references managing, network queries, exporting to several different formats, and much more.


PhiloLogic is an open source full-text search, retrieval and analysis tool developed by the ARTFL Project and the Digital Library Development Center (DLDC) at the University of Chicago. It is designed to work with a variety of data encoding specifications, most importantly TEI-Lite (XML/SGML) and other TEI variants (such as MEP and CES), as well as some support for plaintext, docbook, and ATE (Dublin Core, HTML and some extensions).

Further information, sample databases, downloads, documentation and our wiki are available at:


We currently have French and English messages. If you are using PhiloLogic and want to help by translating the interface into other languages, please let us know and we will be happy to assist you in any way that we can. We are particular interested in Spanish, Italian, Portuguese, and German. Latin would be fun.  :-)

A couple of caveats: Please note that we have NOT translated system generated search forms. We have found that search forms and headers are frequently heavily modified by users and administrators. We have also opted not to support dynamic selection in the distribution, but this would be a trivial function. If we find we need to do it, we will add the patch to the PhiloLogic wiki.



COLLATE Text editing software

Collate was developed by Peter Robinson for the collation, analysis and publication of texts preserved in multiple witnesses. The current version of the software can handle up to 2000 versions of a different text. Collate has a regularization tool which can be used to produce a file containing word equivalences without altering the original transcription files. The software uses a light tagging system which can, at a later stage, be converted to XML. Collate can produce output files for paper-based editions or electronic publications. The following are examples of projects which are currently using Collate:

  • Canterbury Tales Project, directed by Peter Robinson.
  • Monarchia Project, directed by Prue James.
  • Commedia Project, directed by Prue James.
  • Cancioneros Project, directed by Dorothy Severin.
  • Nestle-Aland 28 (the electronic version of the Nestle-Aland Greek New Testament), based at the Institut für neutestamentliche Textforschung INTF.
  • Parzival Project, directed by Michael Stolz

Source(s): http://arts-itsee.bham.ac.uk/itseeweb/software/collate/


eLaborate is a content management system (CMS) for collaborative work on digital editions of texts. Addtional to common web editing functionality found in content management systems in general, eLaborate offers specialized content objects to create transcriptions for uploaded facsimilé and annotations on the transcription text.

eLaborate let’s individual users or user groups create collaborative projects around digitized texts. To apply for a new project space please refer to joris_van_zundert@huygensinstituut_knaw_nl. (Note: for anti spam reasons you will need to replace all underscores (‘_’) in the emailaddress with ‘.’).

eLaborate is an initiative of the Huygens Institute for literary and intellectual history in the Low Countries and is funded by the Dutch Royal Academy for Arts and Sciences (KNAW).

eLaborate is created as a 100% on line application. This means that there’s no need for additional installs or off line components besides a common web browser (preferably Firefox 1.x, but Internet Explorer 5.0+ is also fully supported).

eLaborate may be used from it’s originating server at http://www.e-laborate.nl (mind the dash in the web address). Alternatively the eLaborate software and components may be installed on any server to function as a separately administered instance of the collaboratory.

eLaborate has been creating and will be further developed using only open source components. Currently eLaborate is a unix/Java/MySql/AJAX (Web 2.0) solution. Whenever possible and applicable we adhere to common open standards (XML, OAI etc.). eLaborate is build using an agile developement proces closely oriented towards eXtreme Programming.




Multidoc SGML Browser

The Multidoc SGML browser was a commercial browser by Citec that was used to display and style SGML documents on the fly. The browser was discontinued in 2000, when the licence for the Synex Viewport SGML/HyTime browser engine upon which it was based expired. Before it was discontinued, the browser was used by several humanities computing projects, including the first two CD-ROMs in SEENET’s Piers Plowman Electronic Archive.

The browser was quite advanced for its time. It had sophisticated searching and styling capabilities: searching could be done by text or SGML context; SGML documents were styled using external stylesheets that were themselves SGML documents (thus anticipating subsequent developments in XSL). Elements in the SGML document were styled using a template model, moreover, and could be controlled using variable and contextual expressions including a primitive precursor to XPATH. Although it was not originally designed to do so, the stylesheet language and engine proved adept at effecting SGML to HTML translations (See http://people.uleth.ca/~daniel.odonnell/offPrints/multidoc/multidoc.htm for a description of this method).

Although it can be considered in some sense an SGML precursor of XSL, the Multidoc Stylesheet model differed from XSL differed from the later language in several important respects. Like CSS, it was conceived primarily of as a means of associated style with specific elements. It did not construct a model of the input or export documents, and, as a result, could not be used to effect true ‘transformations: with a few exceptions (mainly for note-type elements), elements could not be moved, copied, or otherwise reordered from their position in the input document. There was also no requirement that the output text be valid SGML, XML, or HTML–or indeed direct method of exporting output in any of these formats (in actual practice, SGML, XML, or HTML could be exported by printing-to-file from a generic/plain text print driver and saving the result with the correct extension.