COLLATE Collaboratory for annotation, indexing, and retrieval

From the project home page

[COLLATE is] a Web-based collaboratory for archives, researchers and end-users working with digitized historic material. It… offers new ways of document-centered knowledge work to distributed user groups. European film heritage and censorship processes in the 1920s and 1930s were chosen as an example domain for the project. The developed COLLATE technologies, however, can easily be adapted to other application domains and usage contexts which are similarly information-intensive.

The current COLLATE collection of rare historic documents was provided by three major film and national archives from Germany, Austria and the Czech Republic. It consists of about 20000 digitized document pages describing film censorship procedures related to historic films and enriched context documentation including press material and digitized photos and film fragments. Members of these institutions – film historians and archivists – worked as pilot users, employing the COLLATE system for detailed cataloguing of the document collection and for in-depth content indexing and annotation of relevant sub-collections.

At the end of the project we established both an innovative Web-based collaboratory with a comfortable work environment for in-depth knowledge work with the material and a comprehensive, selected digitized collection of rare historic documents on European historic film that was interpreted and annotated by a multination team of film experts.

Since the end of the project the achieved results have been maintained and made available to the public. The project partners plan to further promote the system, i.e. both the technologies and contents (first of all Fraunhofer IPSI as the coordinator and major technology delevoper and the Deutsches Filminstitut – DIF as coordinator of the content providers).

Source(s): Web-based solutions

DigiPal Launch Party

Date: Tuesday 7th October 2014
Time: 5.45pm until the wine runs out
Venue: Council Room, King’s College London, Strand WC2R 2LS
Co-sponsor: Centre for Late Antique & Medieval studies, KCL
Register your place at 

After four years, the DigiPal project is finally coming to an end. To celebrate this, we are having a launch party at King’s College London on Tuesday, 7 October. The programme is as follows:

  • Welcome: Stewart Brookes and Peter Stokes
  • Giancarlo Buomprisco: “Shedding Some Light(box) on Medieval Manuscripts”
  • Elaine Treharne (via Skype)
  • Donald Scragg: “Beyond DigiPal”
  • Q & A with the DigiPal team

If you’re in the area then do register and come along for the talks and a free drink (or two) in celebration. Registration is free but is required to manage numbers and ensure that we have enough drink and nibbles to go around.

If you’re not familiar with DigiPal already, we have been been developing new methods for the analysis of medieval handwriting. There’s much more detail about the project on our website, including one post of the DigiPal project blog which summarises the website and its functionality. Quoting from that, you can:

 Do have a look at the site and let us know what you think. And – just as importantly – do come and have a drink on us if you are in London on Tuesday!

The DigiPal Team


Source: From a description published in The XML Journal (

Anastasia is designed for handling large and highly-complex XML documents, where extremely precise control is required over the presentation of these. It can create output in any format, and it is optimized for HTML output direct to web browsers. Anastasia permits you to publish documents in identical form both on CD-ROM (Macintosh and Windows) and on the internet, from identical scripts. It includes full support for all valid XML and SGML documents, and a fully XML-aware search engine. Anastasia’s ease of use makes it suitable for small publishers with comparatively fewer computer resources, while its power fits it for large publishing enterprises. Anastasia is open source.

Home page:



How to collect bibliographic references using cb2bib

It often happens to come across a useful bibliographic reference while navigating the WWW: in a newsgroup, while reading an on-line article, etc. If you want to add it to your collection of references, you can do that in a (semi-)automated way using a small but very handy utility, cb2bib. As you can read on the program’s home page, cb2bib “is a tool for rapidly extracting unformatted, or unstandardized biblographic references from email alerts, journal Web pages, and PDF files.” The name stands for “clipboard to bibliography (entry)” and stems from the program’s modus operandi: the text copied by the user in the clipboard is read by cb2bib and compared with a set of pre-existing patterns, then if a match is detected the clipboard text is directly converted in a bibtex entry on the basis of the matching pattern. Let’s see an example.

Note: cb2bib is available for both Windows and Linux operating systems (you can download it here), but the following screenshots refer to the OS I normally use, i.e. Linux.


Installing cb2bib is very easy both under Windows and Linux, in the latter case if you are using an RPM based distribution. If you use a Debian or Debian-derived Linux distro, or MacOS X, you might have to compile and install the software on your own. Comprehensive instructions are available at this URL.

A Simple Example

Open the Examples page on cb2bib’s website, select and then copy to the clipboard the second example, the one labelled as “PNAS Table of Contents Alert”.

Note: to copy text to the clipboard under Linux you can simply select it, or you can use the CTRL C key combination; under Windows press CTRL C.

As you can see from the following screenshot (Fig. 1), the selected text has been automatically converted to a structured bibliographic entry, which you can save now as a bibtex entry: just click on the icon next to last on the right (the one showing a floppy disk with a pencil over it), or press CTRL S, and the entry will be added to the file shown in the text field immediately above it.

Missing image 1-example.png Fig. 1 – A sample entry

It’s called references.bib and it lies in the cb2bib folder, but you can modify both path and file name, for instance you might choose C:\Documents\collected-refs.bib.

cb2bib will also retrieve the abstract, add the relevant keywords to the entry, and even download the PDF version of the article if there is an URL pointing to it and access is free! All of this automatically.

Once you have nicely collected and/or modified your reference, click on the Save button (the second from right), or press CTRL S, to save it in the references file. Delete the cb2bib_query_tmp_pdf if present, or you won’t be able to download the PDF file for the article if there is a link in the next reference you are going to process.

To know more about cb2bib features read the very nice overview page here.

Configuring cb2bib

Before exploring further cb2bib capabilities, it may be a good idea to check the program settings: click on the second icon from left, the one showing a wrench, and you will see the configuration window (Fig. 2).

Missing image 2-config.png Fig. 2 – Configuring cb2bib

It is split in about half a dozen tabs, it is important that you check paths in the first one and enable the network queries in the third one (Fig. 3).

Missing image 3-config.png Fig. 3 – Network queries

Since the author is especially interested in scientific publications, you will probably have to modify the regexps.txt file to obtain automatic format detection and field formatting. To do this you will have to understand the file structure, which isn’t too difficult, and add patterns for specific bibliographic styles. This is an example of a pattern which can identify and automatically format MLA-style entries:


cb2Bib 0.3.6  Pattern:
MLA-style article 1
author title journal volume number year pages
^(. ), "(. )," (. ) (\d ):(\d ) \((\d\d\d\d)\), ([\d|\-|\s] )\.$

Since this can be a time-consuming task, please share your regexps.txt files, so that everybody can benefit from your work and add/mix patterns to his own configuration.

Import references from a PDF file

If you have one or more PDF documents holding a good number of bibliographic references, you can import them using cb2bib: click on the third icon from left, and again on “Select files” to choose the PDF file(s) (Fig. 4).

Missing image 4-pdfprocess.png Fig. 4 – Extracting references from a PDF document

Once you have a list of files ready, click on “Process” to have the program read them and extract the references. If nothing happens, it’s because you haven’t specified a PDF importer in the last tab of the configuration window. Unfortunately this feature is still susceptible to improvement, results can vary from a useful list of references to a useless mangled text.

Export bibtex references to HTML

There are many programs that allow you to export your collection of bibtex references to the HTML format: many of these are simple command line tools, like bibtex2html or bib2html; if you want, you can also export them to XML using bibtexml. But you could also take advantage of more sophisticate bibliographic software, like Tellico (Linux) or Pybliographer (Linux): they allow for references managing, network queries, exporting to several different formats, and much more.


PhiloLogic is an open source full-text search, retrieval and analysis tool developed by the ARTFL Project and the Digital Library Development Center (DLDC) at the University of Chicago. It is designed to work with a variety of data encoding specifications, most importantly TEI-Lite (XML/SGML) and other TEI variants (such as MEP and CES), as well as some support for plaintext, docbook, and ATE (Dublin Core, HTML and some extensions).

Further information, sample databases, downloads, documentation and our wiki are available at:

We currently have French and English messages. If you are using PhiloLogic and want to help by translating the interface into other languages, please let us know and we will be happy to assist you in any way that we can. We are particular interested in Spanish, Italian, Portuguese, and German. Latin would be fun.  :-)

A couple of caveats: Please note that we have NOT translated system generated search forms. We have found that search forms and headers are frequently heavily modified by users and administrators. We have also opted not to support dynamic selection in the distribution, but this would be a trivial function. If we find we need to do it, we will add the patch to the PhiloLogic wiki.