UTX and Quick-input Convertibility
Click to download example utx file
Click to download example txt file
TBX-Glossary and convert_glossary were developed for UTX-Simple 1.00. Since then, the UTX-S standard has advanced to version 1.10, gaining certain features which are not compatible with our software. To prevent incompatibilities, we describe them here:
First, convert_glossary is not prepared to bypass additional, descriptive lines in the UTX header. It must find the column definitions in the second and last header line. However, additional descriptions can be included as a glossary-wide note, as described below.
Second, UTX-S 1.10 provides for bidirectionality, grouping by concept ID, and term status. None of these data categories are convertible according to our current design, which was based on UTX-S 1.00, and all of this information will be lost in conversion. Entries are converted as though they were all separate concepts, all monodirectional, and all approved terms. In order to produce a file with no forbidden terms, with only approved terms, etc., one must pre-filter the UTX.
In the remainder of this documentation, 'UTX' refers to UTX-Simple 1.00.
This document also describes our quick-input format, illustrated by another sample, data.txt. This format is identical to UTX, except (a) that quick input files may omit any UTX element not mentioned below, even if it is mandatory in proper UTX, and (b) that quick input provides several conveniences for entering the mandatory part-of-speech data.
For both UTX and quick-input, the converter provides an easement from the UTX specification: Lines may be terminated with whatever end-of-line code the local Perl recognizes, not solely with the canonical carriage return and line feed. On Unix-like systems, where the local end-of-line code is line feed alone, it will also accept the canonical version. The converter does not, however, waive UTX's prohibition of files starting with a byte-order mark.
Source and target languages are expressed in the first header line, as in all UTX.
Subject field is expressed in the first header line, as an 'optional' field indicated by the key word 'subject'. It is mandatory for convertibility.
A glossary-wide note is expressed in the first header line, as an optional field with the key word 'comment'.
Source and target terms are expressed as in all UTX.
In convertible UTX, source part-of-speech is expressed as in all UTX. In the quick-input format, it can be implied: a blank in the src:pos column, or that column's entire absence, indicates that the source term is a noun.
Target part-of-speech is mandatory for convertibility. In convertible UTX, it may be expressed explicitly in a tgt:pos column, or implicitly: A blank in the tgt:pos column, or that column's absence, indicates that the part of speech is the same as in the source language. In the quick-input format, a third option joins these two: The 'note' field can contain the tag 'tgt:pos:' followed by a part of speech. This special note formatting will override the implicit same-as-source assumption (but will not override an explicit tgt:pos in its proper place). This is designed to allow the quick-input user to avoid keying a tgt:pos column; implicit same-as-source covers the most common case, and special note formatting covers the exceptions. (The tgt:pos portion of the note field is removed before the note is processed further.)
The convertible part-of-speech values are adjective, adverb, noun, properNoun, and verb. Sentence is not a convertible part of speech.
The remaining data categories (note on an entry (in source language), definition, source of definition, contextual example, and source of contextual example) are convertible but not mandatory. They must appear in columns headed by the correct abbreviations, as seen in the sample file. Per the UTX standard, columns after the mandatory three may appear in any order so long as they are consistent within a file.
When UTX is selected as the output format, the converter will produce files conforming to the UTX-Simple 1.0 specification and the above requirements, with this exception: Language tags in the RFC 4646 format will neither be expanded nor reduced to conform to the narrower xx-XX format shown in the UTX spec. This may be done by hand if desired.