TUSTEP (TUebingen System of TExt processing Programs)

TUSTEP is a professional toolbox for scholarly processing textual data (including those in non-latin scripts) with a strong focus on humanities applications.

Designed in cooperation with many humanities projects by the Division of Literary and Documentary Data Processing at the Computing Center of the University of Tübingen and first implemented more than 25 years ago, TUSTEP is constantly being improved and expanded in order to facilitate solutions for new problems and to take advantage of new hardware and operating systems. It contains modules for all stages of scholarly text data processing, starting from data capture and including information retrieval, text collation, text analysis, sorting and ordering, rule-based text manipulation, and output in electronic or conventional form (including typesetting in professional quality).
Beyond the University of Tübingen, TUSTEP is currently used in roughly 100 other universities and research institutions (a list is available on the web page of the International TUSTEP User Group ITUG)

Articles and tutorials

Modularity, Professionality, Integration: Design principles for TUSTEP

Text data processing with TUSTEP: overview, hints

Current Version


Home pages

German: http://www.uni-tuebingen.de/zdv/tustep/tustep.html

English: http://www.uni-tuebingen.de/zdv/tustep/tustep_eng.htm

T-PEN (Transcription for Paleographical and Editorial Notation)

T‑PEN (transcription for paleographical and editorial notation) is a web-based tool for working with images of manuscripts. Users attach transcription data (new or uploaded) to the actual lines of the original manuscript in a simple, flexible interface.




T-PEN automatically recognizes columns and lines. This automatical layout segmentation can be modified by the users before transcribing.



  • is an open and general tool for scholars of any technical expertise level
  • allows transcriptions to be created, manipulated, and viewed in many ways
  • collaborate with others through simple project management
  • exports transcriptions as a pdf, XML(plaintext) for further processing, or contribute to a collaborating institution with a click
  • respects existing and emerging standards for text, image, and annotation data storage
  • avoids prejudice in data, allowing users to find new ways to work

As of April 2014, it provides access to more than 4000 manuscripts (e.g. links with e-codices), either publicly available or on restricted access within specific projects.


T-PEN version 2.0 was launched in May 2012, with new features (1. Users can now upload their own image set for transcriptions; 2. T-PEN now fully supports crowd-sourcing projects; 3. T-PEN has been providing access to support tools for transcribers; 4. an additionnal feature is still experimental: Glyph matching, a paleographical analytical tool into T-PEN).

TILE (Text-Image Linking Environment)

The Text-Image Linking Environment (TILE) is a web-based tool for creating and editing image-based electronic editions and digital archives of humanities texts.


TILE 1.0 supports the following tools and functionality:

  • Image markup tool
  • Annotate regions of an image by drawing rectangles, polygons, and ellipses, apply labels to selections, and manually create links between sections of an image and transcript lines.
  • Importing and exporting tools
  • Import TEI P5 or JSON data directly into TILE or create a script to import from various XML formats.
  • Export data as TEI or JSON, scripts can generate an output into any XML, HTML, or text-based format. Additional import/export tools can be developed as plugins.

Semi-automated line recognizer

Implemented in javascript, the TILE semi-automated line recognizer annotates images by detecting individual lines on an image, and selects regions of an image based upon those lines.

Plugin architecture

Extend the core functionality of TILE by creating a plugin that can manipulate TILE’s interface, filter and process data, and connect to other tools.

Development and versions

TILE is a collaboration between the Maryland Institute for Technology in the Humanities (Doug Reside, Dave Lester) and Indiana University (Dot Porter, John Walsh), and supported by an NEH Preservation and Access grant. Its team has partnered with Editing Modernism in Canada, and is looking for additional partners to support and extend the software.

Tile version 1.0 was released in July 2011, with an interface added for tagging and annotating manuscript images (Image Annotation), an interface added for automatically tagging lines using basic image analysis tools (Auto Line Recognizer), dialog tools for loading and saving data, support for TEI P5 formatted XML data etc.


Project home page

Image Markup Tool

The Image Markup Tool is an open source software, allowing to annotate images, create TEI files and generate ready-to-publish XHTML files.


The software is fully TEI (Text Encoding Initiative) compliant – File format based on TEI P5 schema. – RelaxNG schema. – Basic export/re-import in DocBook format. – Loading and display of a wide variety of image formats. – Insertion of resizable, movable annotation areas on the image. – Association of a title and text (TEI XML code) with each annotation area. – Configurable categories for sorting annotations into groups. – Configurable colour for each category and configurable display shape for each category. – Display of a list of annotation areas by title, for ease of navigation. – Re-ordering of annotations and categories. – Display/editing of the teiHeader element. – Syntax highlighting for XML editing (in the Annotation Text and teiHeader editing boxes), using the UniSysEdit open-source control. – Switch from @facs to a combination of @facs (for transcriptional annotations) and @corresp (for non-transcriptional annotations) as the attribute used to link annotation divs to their corresponding zones on the image, in line with 2008 changes to TEI P5. etc.

Development and versions

The software is being developped by Martin Holmes and a team at the University of Victoria,

Version released in June 2012.


Marjorie Burghart (EHESS, France) has created some online self-correcting exercises designed to teach students how to read medieval scripts. In the process, she has also created a package of materials and instructions to help others to create similar materials.


Project Homepage

Fiche Plume

Poliqarp for DjVu

Poliqarp for DjVu is an open-source search engine software for DjVu corpora available on GNU GPL license, developped by Janusz S. Bień at the University of Warsaw. It relies on the DjVu format and allows to present end-users with results of advanced language technologies.

Conceived as a modification of the Poliqarp (Polyinterpretation Indexing Query and Retrieval Procesor) corpus query tool, it inherits from its origin the powerfull search facilities based on two-level regular expressions, which can be used in the queries to circumvent the OCR errors, but also the ability to represent low-level ambiguities and other linguistic phenomena. It delivers highlighted results and KWIC search results.

Although at present the tool is used mainly to facilitate access to the results of dirty OCR, it is ready to handle also more sophisticated output of linguistic technologies.

Poliqarp for DjVu is in particular used for a non-medieval corpus (corpus of historical Polish (since 1570 to 1756), with issues related to medieval corpus (spelling, abbreviations, etc.)

The software can be used for the scans of “Lexicon Mediae et Infimae Latinitatis Polonorum” (http://rcin.org.pl/publication/15584) prepared at the cost of the European Fund of Regional Development under the framework of Operational Programme – Innovative Economy, Priority Ax 2. Investment projects relating to development of information infrastructure of science within the 2.3.2 sub-action – The Projects in the area of development of information resources of science in a digital form. Unfortunately, although at first the scans have been available freely “to all for their own use, for scientific, educational or teaching purposes”, since February 2013 the access to them is severely limited: “Publication accessible in the Institute of Polish Language of the Polish Academy of Sciences network for their [?] own use, for scientific, educational or teaching purposes”.
Source(s): Software solutions

Presentation of the tool