Article for translators: A new look at OmegaT

By Jean-Christophe Helary

fusion@mx6.tiki.ne.jp
http://www.eskimo.com/~helary/OmegaT/omegat_review.html

This article was first published last year in Corinne MacKay's Open Source Update. Corinne has recently put all the OSU articles together into a book available from Lulu at Open Source Update 2005: A guide to free and open source software for translators where you will also find a pre- release review of OmegaT 1.6 with updated information. Even though most of the information in this article is relevant, OmegaT 1.6 is a major upgrade that addresses a number of issues like sentence segmenting, html attributes support, improved OpenOffice.org files support etc.

OmegaT is a Computer Assisted Translation (CAT) tool. OmegaT is free. You can use OmegaT, modify it and distribute it according to the terms of the GPL licence.

This article is an updated version of the one I wrote in February for Corinne McKay's Open Source Update Issue #4. It has a totally rewritten section on Unicode and a new section about OmegaT's segmentation function (adapted from a mail I sent to Lantra-L at the end of February.)

I refer to OmegaT version 1.4.4.02 (localized to Belorussian, English, Esperanto, French, Japanese, Spanish, Russian), and I also sometimes mention the current Release Candidate version: 1.4.5RC4. I intend to update this article along with the modifications in OmegaT, so keep it in your bookmarks !

I. Introduction:

OmegaT has been around on the Computer Assisted Translation tools market for a little while now, steadily evolving and answering user needs. Its latest release has been available for a few months now (1.4.4.02) and has been reviewed a number of times already ¹.
I am extremely happy to write this short article because, not only as a member of the "development" team (as a non programmer) but also as a freelance Japanese to French translator I find that this new version has the potential to bring new groups of users to this extremely potent application. Before entering the core of the discussion, I will give a few general indications of what OmegaT is, and what it is not.

OmegaT in a few words...

OmegaT is a computer assisted translation tool based on the “translation memory” concept ².

OmegaT works on any operating system ³, it is free ⁴ and it has reached a level of maturity that allows it to be used in a lot of professional environments.

What has not changed from the previous releases is the following:

OmegaT works seamlessly with the OpenOffice.org file format ⁵, html and its variant xhtml files, as well as text only and Java properties files. It supports the UTF-8 flavor of Unicode ⁶ in source files, it works with the translation memory standard TMX 1.1 at level 1, and accepts glossary files for terminology checks.

The major improvements of the most recent version ⁷, that make OmegaT available for a much wider range of potential users are the following:

-A new segment matching engine allows much more accurate matches in general but especially for languages that do not have space separated words (Japanese and Chinese for example).

-Unicode support for glossary files allows for work with languages pairs whose characters are not covered in the same character set (Japanese in target and French in source for example).

-Translations of the user interface and help/manual files.

-Other less visible new features that will facilitate further developments.

OmegaT is not a "machine translation" tool: it will not produce a translation for you. It helps you remember the way you (or others) translated segmented parts of the text you are working on. For more information about OmegaT, including a recent and still relevant introduction written by Samuel Murray (another non-programmer team member) please check either the official home page ⁸, the included user manual ⁹ or the yahoo group page ¹⁰.

II. OmegaT 1.4.4's new features:

1- The new matching engine:

Up to now, OmegaT's match engine just plainly did not work with languages having non space-separated words. It would very seldom find correct segment matches in Japanese texts for example and since that is equivalent to not recalling the "right" memories, its translation memory function for such languages would be extremely limited, and mostly non functional.

Let's take the following example (from a simple Japanese-French project that I created for the occasion):

The sentence "私は日本人です。" is not matched to anything in OmegaT 1.4.3 while it is a 75% match to "彼は日本人ですか。", a 66% match to "あなたは日本人です。" (yet grammatically and semantically closer than the previous match) and a 40% match to "私はりんごが好きです。" in OmegaT 1.4.4.

There are obvious semantic issues here, but keep in mind that a translation memory tool is not supposed to actually understand the texts that are fed to it, it just displays segments matching according to a computation that produces the closest possible results.

A language dependent tuning ought to be an option but we leave that for volunteer developers!

Right now, the new match engine does not affect glossary item matches, which is a little frustrating, but this will hopefully be corrected in a future version of OmegaT.

2- Unicode support for glossary files:

The following introductory explanations are just here to put in simple words the gist of the whole issue (all my gratitude goes to Jukka K. Korpela for kindly rewriting this section, although I take full responsibility for end version inaccuracies.) People who already know what Unicode is about will want to directly jump to the "Back to OmegaT" part.

Benefits of Unicode:

Computers internally work on numbers. This means that characters need to be coded as numbers. A typical arrangements is to use numbers from 0 to 255, because that range fits into a basic unit of data storage and transfer, called "(8-bit) byte" or "octet".

When you define how those numbers correspond to characters, you define a character code. There is a large number of character codes defined and used. Most of them have the same assignments for numbers 0 to 127, used for characters that appear in English as well as in many other languages: the letters a-z plus their capital equivalents, the digits 0-9 and a few punctuation marks. Many of the numbers in this so-called ASCII set of characters are used for various technical purposes.

For French texts, for example, you need additional characters such as accented letters (é, ô, etc.). This can be handled by using code numbers in the range 128-255, and there is room for letters used in many other Western European languages as well. Thus, you can use the same character code, Latin 1, even for a text containing a mixture of English, French, and German.

However, you run out of numbers if you try to cover too many languages within a total of 256 characters. For this reason, different character codes have been developed. For example, Latin 1 is for Western European languages, Latin 2 for several languages spoken in Central/Eastern Europe, and different character codes exist for Greek, Cyrillic, Arabic, etc. When only one language is used, you can usually pick up a suitable character code and use it. In fact, someone probably did that for you when designing the particular computer system (including software) that you use. You might have used a particular character code for a long time without knowing anything about it.

Things change when you need to combine languages in one document and the languages are fundamentally different in their use of characters. In an English-German or French-Spanish glossary or other bilingual data, you can use Latin 1. In English-Greek data, you can use one of the character codes developed for Greek, since these codes contain the ASCII characters. But what about French-Greek? That's not possible the same way, since the character codes discussed above do not support such a combination. A code either has accented letters in the "upper half" (range 128 - 255), or it has Greek letters there.

As some of you may know, the number of characters needed for Chinese, Korean and Japanese is very large. They just would not fit into a set with only 256 characters. Therefore, different strategies are used. For example, two bytes (octets) instead of one might be used for one character. This would give 65,536 possible numbers for a character. But such character codes do not contain all the characters used in the world.

The solution to such problems, and many other problems in the world of growing information exchange, is the introduction of a character code that gives all characters of all languages a unique number. This number does not depend on the language used in the text, or the font used to display the character, or the software used, or operating system, or device. It is universal and kept unchanged.

The solution is called Unicode, and it gives anyone the opportunity to say "I want this character displayed and the number is ..." and have himself understood by all systems that support Unicode. This does not always guarantee a success in displaying the character, due to lack of a suitable font, but such technical problems are manageable.

Unicode has been supported by widely used software, such as Microsoft Windows, for quite some time. However, to utilize Unicode, all the relevant components must be "Unicode enabled". For example, although Windows "knows Unicode", an application program used on a Windows system might not.

Back to OmegaT:

Until 1.4.3, OmegaT used to consider that glossaries were saved in the default operating system code. This meant MacRoman (close to LATIN-1) for Roman languages users of Mac, MacJapanese (close to S-JIS) for Japanese users of Mac and other corresponding codes (and many others) for Windows users etc.

Obviously, translators working with code incompatible language pairs (like French and Japanese or similar) were not able to make good use of the glossary function.

OmegaT 1.4.4 has the Unicode solution.

OmegaT as an application supports Unicode natively (the UTF-8 flavor, there are many flavors but they are all compatible), only the glossary was an exception. With version 1.4.4, all glossary files can be saved as Unicode files and be correctly interpreted for traditionally problematic language pairs.

Et voilà !

With that level of Unicode support, OmegaT is now able to be used with any possible combination of languages.

3- Localizations:

For many reasons 2004 has been a great year for OmegaT. Mostly, we found a new lead developer, a very responsive and able Java programmer who happens to have started working with OmegaT to translate a Java application to Russian. His involvement produced a lot of emulation in the user/helper group and following all the buzz (and the ability to have OmegaT parse and translate its own interface, see below), it was progressively suggested to produce a set of translations to able non English users to use OmegaT on a daily basis.

At the time of writing OmegaT 1.4.4 includes Belorussian, English, Esperanto, French, Japanese, Russian and Spanish. That's for the whole set, including the user interface (all the menus, buttons and messages) and the user manual. We are working on 3 new localizations for the next release: German, Italian and Turkish, as well as Vietnamese if we make it on time.

Other languages

Hopefully the list will grow, and the existing translations (that do not claim to be perfect) will improve.

OmegaT is now able to use Java localization files (files with the .properties suffix) as source files. Right now this software localization file format is the only one supported by OmegaT, but the development team is already working on other formats parsers to support the gettext .po file format and others. We will probably see new file parsers in the months to come.

Since OmegaT is a Java application, the ability to translate Java bundle .properties files means that one is able to use OmegaT to translate OmegaT's GUI strings.

We are setting an OmegaT translation project where anybody can download the existing translations TMX files to review an existing translation.¹¹ The original manual, in English, is also available for modifications too by the way. The manual (officially called “help files”) has found a coordinator and we are actively working on updating, reformatting and renewing its contents.

4- Other new features

OmegaT 1.4.4 is a big step toward even more mature versions. The code itself has been partially rewritten to allow for simpler access to it.

Not only you can see that with the localization process (see above), but in the field of "file parsers" OmegaT is also moving away from the original design to open itself (you can see the first steps of this evolution in the 1.4.5 Release Candidate version available on SourceForge)¹².

File parsers are the core of OmegaT. Without file parsers OmegaT would not be able to understand the file formats you feed it.

OmegaT now understand plain text in either the operating system's native encoding or UTF-8 (it plainly interprets a paragraph as a segment), it understands html and xhtml ( by considering block level tags as segments and by converting inline tags to OmegaT tags for edition purposes) and it understand the OpenOffice.org file format (an xml based format, now accepted as a world standard with the OASIS initiative, in a way similar to html/xhtml ¹³). It also understand the Java localizable file format.

All this understanding comes from file parsers that give OmegaT access to the text strings within the file to be translated. As mentioned above, new file parsers are being developed and the existing ones can be modified to fit your needs as well (this feature is being tested in OmegaT 1.4.5 Release Candidate version).

The yahoo OmegaT support group was created in march 2004 ¹⁴. It has reached a very good level of maturity now with extremely knowledgeable people exchanging ideas, tips, thoughts on how to make OmegaT a better tool for translators. The group has a “links” page that puts most of the relevant information within accessible reach, a “file” page that gives access to the current version as well as some less recent version. Its mail archives are accessible to list members.

OmegaT support group (http://groups.yahoo.com/group/OmegaT/)

Most of the questions you can have should already find an answer in the user manual. In case you need something more you can always ask the group. Just like any open source community based groups, members are never obliged to answer and do so because they feel the need to help.

A number of related groups will give you information on related aspects of translation with OmegaT or similar tools. A lot of members of the OmegaT group are also members of the following groups:

5- OmegaT and text segmentation: an old feature that needs some explanations

A lot of would-be users are disappointed when they hear that OmegaT segments texts at the "paragraph" level instead of segmenting at the "sentence" level.

I wrote a mail a while ago on a translators list to clarify what was OmegaT doing in terms of source text segmentation. The following section is an adaptation of the original mail:

OmegaT presegments the text when the project is opened. You cannot move segment boundaries from within OmegaT.

If you are confronted with bad segmenting, you need to close the project, modify the file and re-open the project.

Segmenting in OmegaT depends on the type of file you have in your project.

a- Text files

OmegaT segments text files at the line breaks.

Line breaks are sometimes found at the end of sentences, sometimes at the end of paragraphs.

OmegaT will create paragraph-level segments, sentence-level segments or segments that include sentence parts depending on how your text file is formated.

You can therefore force segmenting at any level you want by inserting line breaks in your file before starting to work on the project.

If you suspect that your text might contain manual line breaks in inconvenient places, it is therefore worth editing it before opening the project. If you are confronted with bad segmenting (e.g. a line break in the middle of a sentence) whilst you are working in OmegaT, you can remove the break from the source text (for example in a text editor or word processor). You will not see the change in OmegaT until you reload the project, however.

b- Formatted files

By "formatted" files we mean (x)html files as well as OpenOffice.org/StarOffice format files.

Formatted files generally include 2 types of tags: block level tags and inline tags.

Inline tags generally define the style of the text while block level tags generally describe its logical structure.

OmegaT segments formatted files at the block level tags.

To modify a formatted file segmenting, you can use OpenOffice.org conversion macros. These macros convert sentences (a stylistic unit) to paragraphs (a logical unit) so that when the file is opened in OmegaT, its segments include sentences rather than whole paragraphs. The macros can be downloaded from the OmegaT web site.

c- Other files

Besides for text files and tagged files, OmegaT 1.4.4 supports Java properties files.

Java properties files use one line for each data to be translated. Their segmenting is thus very similar to text files except that there is no need to modify the file's formatting.

d- Conclusion

It is sometimes said that OmegaT does paragraph segmenting. It is not exactly true.

OmegaT considers that segments must not be delimited by the usual ponctuation marks (. ! ; etc) but by marks that show the logical structure of the file.

Besides, talking in terms of "paragraph segmenting" in a spreadsheet/table context has little meaning. Similarly, "paragraphs" in a visual presentation do not carry the same meaning as in a literary text. Technical manuals usually group textual data in short paragraphs (or long sentences) and the difference between "paragraph" or "sentence" segmenting does not mean much anymore then.

OmegaT allows for segmenting modifications before you start working on a project. Segmentation fine tuning within OmegaT is a long expected feature that will come in future versions.

III. What you can do with OmegaT now...

What kind of user are you?

Are you somebody who likes to use an application without thinking too much, beyond what is required to understand the manual?

If you are of this kind you'll be able to use OmegaT as a translation memory tool for what it is advertised for: text files, (x)html files, OpenOffice.org files, Java properties files. With of without glossaries, with or without TMX 1.1 translation memories. Besides, OmegaT will reproduce a complex folder structured project by mirroring the source structure to the target structure enabling you to easily translate complex home pages, for example.

You can do a lot by just using OmegaT as is.

Are you a would-be power user who likes to see how to use an application within a complex work flow and take advantage of the possibilities offered by combinations with other applications in the workflow?

If you are of this kind (like me), OmegaT is a boon for you. Think about OpenOffice.org and its compatibility with Microsoft Office. You can open MS Office files within OpenOffice.org, convert them to the OpenOffice.org file format, translate them seamlessly in OmegaT, convert them back and deliver them as MS files. There are other ways to integrate OmegaT to your work flow, it all depends on the tools you use.

You can do a huge lot more by combining OmegaT with OpenOffice.org or similar applications (NeoOffice/J for OSX) and others (xml/xliff editors, aligner tools etc.)

Some OmegaT “advanced” users have started to create small tools in easy to understand and easy to use languages (namely Tcl/Tk, Python, OpenOffice.org macro language Basic...) that allow you to get a better control of OmegaT segmenting, get access to external spell checkers, align text files to produce OmegaT compatible TMX files etc...

Are you of the kind who loves to fiddle with the system and other things?

Well, I don't belong there (yet) but by reading what is exchanged on the OmegaT mailing lists (users and developers) I can see that OmegaT is a real neat application for them too, with limitations, but with huge possibilities for workarounds and developments.

IV. What's next, and how you fit in the picture:

You now can see that OmegaT has greatly improved with version 1.4.4. Much is to be expected in the next to come versions. You can see what Maxym, the new lead developer, has in mind by reading his ideas on the OmegaT project dedicated Source Forge site ¹⁵, you can participate in OmegaT's future by identifying unexpected or faulty behavior (what people call bugs) and by submitting suggestions to further improve the application. All these report functions are centralized on the same Source Forge OmegaT project page. You can suggest uses and tricks on the Yahoo OmegaT mailing list etc.

But what could be next, if you think OmegaT could have a "next", also depends very much on you. That's the nice thing about Open Source and community development...

Are you a translator using OmegaT either for testing or for real life jobs?

Try OmegaT and report any flaw. Create test projects for a language you use and see how the match engine behaves. Right-to-Left languages have yet to be tested. The match engine was re-worked after flaws in its handling of Japanese text were discovered (thanks to a very simple 20 lines project I introduced above it took me barely 10 minutes to create it and identify the issues). Do all the experimenting you want and report !

Do you enjoy writing?

OmegaT's documentation is modified all the time to match the changes in the application. The "User Manual" (what you get when you press F1) is the basic text and is needed for daily use. Marc Prior, the original writer and project coordinator, is also working on an advanced manual (the ASAD Manual).

OmegaT needs writers, proof-readers, editors. Following what is discussed on the lists, putting that on "electronic" paper and proposing it as documentation is a necessary task not always well managed by the current community.

User manual modifications are currently being coordinated by Raymond Martin, a Linux professional who has very recently decided to give some of his time for the project.

Do you feel like translating OmegaT's interface and documentation?

Once you are familiar with the interface and the process, just get the OmegaT localization file package and translate it to a language you are familiar with (Chinese, Korean, other Asian languages, African languages, Indian languages, Portuguese, Arabic, EU official languages etc the list is open...) While translating keep note of all the problems you find (either in the application itself, or in the documentation) and make it known that you found documentation anomalies (that do sometimes translate to application bugs...)

Are you also a computer hobbyist?

Write a small documented script in an easy language to increase OmegaT's relation to its environment (link OmegaT to a spell checker, to a segmenter,to a TMX converter or a glossary creator, make OpenOffice.org macros to modify the text before sending it to OmegaT: possibilities for simple tools are just endless.)

Marc Prior, OmegaT's project coordinator has started what is called OmegaTk: a Tcl/Tk extension project for OmegaT. Some of the above mentioned ideas have found an implementation in Tcl/Tk.

Since Tcl/Tk is a language available to most common operating systems the tools are readily available for your usage.

You can try things in any other language: lisp, python, ruby, perl, anything you like. If you document your mini application somebody will find it useful.

Are you a developer interested in translation issues?

We need Java coders (OmegaT is a Java application) ! That's just as simple as that.

The more coders the better. Maxym is doing an amazing job at keeping a steady pace of answering and implementing all our whims but I can feel that we have so much requests and there are so much needs that any interested Java programmer would be seen as a savior !

Even if you are not a Java coder you can still help. (Obj)C(++) tools can be compiled to work on any platforms.

Lisp tools (I'm a little bit hooked to lisp right now...) would probably be real nice too... Anything that fits in or around OmegaT in any language would do (at OmegaT we really don't discriminate with language, we are translators, remember?)

My personal request (as a MacOSX user) would be to take advantage of all the hard work Apple has accomplished in putting Java on OSX: by allowing OmegaT to access the OS spell checking service, to say the first thing that comes to my mind.

I am already "responsible" for the OSX bundle, it is fun and does not require more knowledge of the tools than what it takes to create a zip file, but a bundle only goes that far in "OSXizing" a Java application...

We are starting to have people interested in creating versions to be included in mainstream Linux distributions (Debian ¹⁶, Fedora, etc...), but even if you are a Windows only person there is room for you since a big part of OmegaT's user base uses Windows only...

OmegaT users need to have an application that looks and feels good within the environment they are familiar with (Windows, MacOSX have relatively clearly defined application interfaces and each offer specific operating system embedded functions).

Other things are much more needed, of course:

-increased TMX compatibility (take a look at the TMX standards ¹⁷ to see what that means).

-creation of file parsers for other "industry" standards (XLIFF but also all the file formats used by commercial applications: Trados among others, but also MS Office, to have direct access to that format without needing the OpenOffice.org workaround, and many other common documentation file formats like TeX, RTF etc...).

-creation of an embedded alignment tool to work with existing translations so as to create translation memories for future projects.

-allowing for segment modification from within OmegaT's UI so as not to have to edit the source file, reload the project to have the new segments taken into account.

Things are discussed very actively on the Source Forge development list, take a look at it and jump in ! The future of OmegaT is only limited by your imagination and your willingness to contribute your time and skills...

V. Thanks

To conclude this already too long article, I would like to thank all the folks at OmegaT, starting with Keith, who made it all possible, and including Marc, Maxym, Samuel, Dmitri and Raymond and all the other participants too numerous to name here, also Corinne McKay for giving me the idea to write this article and for the original proofreading, Maxym Mykhalchuk and Arnold Wiegert for helping with the rewrite, and Jukka K. Korpela for the Unicode part.

OmegaT is the product of the developers' hard work, but would not exist without you, the translators who use it.

Jean-Christophe Helary (helary at eskimo dot com)

Kokubunji, February 4^th 2005.
Last modifications: May 17^th 2005.

Footnotes:

1 See Corinne McKay's Open Source Update Issue #3 at: http://www.translatewrite.com/osupdate3.html
also Dmitri Popov's article at: http://software.newsforge.com/article.pl?sid=05/02/11/1831257&from=rss
Slightly less recent:
Sylvain Galibert's review at: http://www.proz.com/howto/189
Samuel Murray Smit's introduction at: http://leuce.com/translate/omegat.html
And for historical purposes, the very first one, by Marc Prior, at: http://www.accurapid.com/journal/23linux.htm

2 Translation memories mean to remind you the way you, or somebody else, previously translated similar text segments.

3 A recent version (1.4 and above) of the Java Virtual Machine is necessary to run the 1.4 series. Linux and Windows users need to download and install one eventually, MacOSX users have it pre-installed.

4 You can download the application from the SourceForge site at: http://sourceforge.net/projects/omegat
You can use the application without paying any licensing fee and you are free to modify the code itself to modify the application according to the GPL license.

5 OpenOffice.org 2.0 (and its OpenDocument file format) will be natively supported in the future 1.4.5 version. For 1.4.4 users, check the manual to modify the preference settings.

6 For more about Unicode see http://www.unicode.org where you'll find many introductory tutorials.

7 1.4.4.02 at the time of this writing. Versions 01 and 02 are bug fixes of version 1.4.4.
1.4.5 exists as a Release Candidate version for testers, 1.4.5 new features are also mentioned in this article.

8 http://www.omegat.org/

9 After downloading the application, unpack it and open the contents of the /doc repertory in your favorite browser.

10 http://groups.yahoo.com/group/OmegaT/

11 Translation project on SourceForge at: http://sourceforge.net/projects/omegat .

12 Release Candidate version on SourceForge at: http://sourceforge.net/projects/omegat

13 http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office

14 http://groups.yahoo.com/group/OmegaT/

15 SourceForge site at http://sourceforge.net/projects/omegat

16 See also the article on how to install OmegaT on Debian at: http://www.proz.com/howto/317

17 http://www.lisa.org/tmx/