Use our language resources

Data

All the data you need to give you the edge for your translation work within the South African language sphere

Corpora

Data available for all official South African languages

PARALLEL CORPORA
These collections are sentence-level, aligned with English, done through a combination of automatic and manual alignment techniques. The data was mainly sourced from the South African government domain.
DOWNLOAD
MONOLINGUAL CORPORA
These collections are raw unaligned data from the previous five Autshumato projects. The main source is the South African government domain.
DOWNLOAD

Machine Translation Evaluation Data Sets

EVALUATION DATA SETS
Comparable evaluation data for use in automatic machine translation evaluations. The evaluation set consists of 500 sentences translated separately by four different professional human translators for each of the 11 official South African languages. This creates a set with four reference translations for each of the 11 languages where each of the texts can be used as an input text as well. This ensures that the evaluation set can be used to evaluate machine translation between any two of the 11 languages.
DOWNLOAD

TRANSLATION MEMORIES &
GLOSSARIES

A wide variety of translation memories and glossaries are available for download and can be used free of charge. These resources are in standard translation tool format and can be used with the Autshumato ITE or any other TMX-enabled software package.

The TMG is a crowd-sourced platform through which translation resources can be supplied and obtained. By sharing collective translation resources, everyone can benefit. The sharing of translation resources between various affiliations (translation units) and freelance translators can ensure better consistency increased productivity throughout translation projects. Which in turn can provide more access to information to for everyone in their native language.

Users can rate and comment on resources, in order to give others an indication of the quality of a specific resource. The system also remembers the resources that you have uploaded and downloaded and can serve as a cloud storage facility for your translation resource. So should you, by some circumstance, lose all your translation resources, you will be able to easily recover them.

The TMG also makes provision for managers to manage their personnel and their resources on the system.

GLOSSARIES

The data is given as a single UTF-8 text file, with each segment on a new line. The dataset contains existing data sourced for the DAC funded Autshumato project as well as new data sourced for the SADiLaR: Parallel corpora for English into isiXhosa projects.