Stopping the Word Count Insanity Translation Industry translation jobs
Home More Articles Join as a Member! Post Your Job - Free! All Translation Agencies
Advertisements

Stopping the Word Count Insanity



Become a member of TranslationDirectory.com at just $12 per month (paid per year)





Andrzej Zydron photoIn the localization industry, there is a total lack of consistency among word or character counts, not only between rival products, but even among different versions of the same product. The same can be said for word processing software: word and character counts differ among vendors and versions. An additional problem is that none of this software provides any proper verifiable specification as to how the actual metrics are determined. You have to accept them as they are.

This is effectively the same situation that existed for weights and measures before the French Revolution established a sane and uniform system that everyone could agree upon, one that we still use today (with minor exceptions).

Trying to establish a measure for the size of a given localization task poses a real problem for the professional who is trying to calculate a price. The differences in word and character counts among different translation or word processing tools can be as much as 20 percent. And such a gap can mean the difference between profitability and loss.

ClientSide News Magazine pictureRealizing that this problem needed to be addressed by an independent industry body, LISA OSCAR undertook the task, in 2004, of establishing a standard that everyone can agree on and that can be independently verified.

Nearly three years later, we finally have a far-reaching and considerably reviewed approach to this problem. The core of the new standard comes under the umbrella concept of Global Information Management Metrics Exchange or GMX for short.

We all know that word and character counts are not the only measure of a given localization task. Thus, GMX comprises three standards:

  • GMX-V (for volume)
  • GMX-Q (for quality
  • GMX-C (for complexity)

    GMX-V is the first of the three standards to be completed. Work will commence in 2007 on GMX-Q and GMXC. Quality (GMX-Q) will deal with the level of quality required for a task. For example, the quality required for the translation of a legal document is much higher than that for technical documentation that will have a relatively small audience. Complexity (GMX-Q) will take into consideration the source and format of the original document and its subject matter. For example, a highly complex document dealing with a specific tight domain is far more complex to translate than user instructions for a simple consumer device.

    All of the GMX family of standards relies on an XML vocabulary for the exchange of metric data. Using the three standards together, it will be possible to have a uniform measure for defining the specific aspects of a localization task, to a point where one can completely automate all the pricing aspects of the task and exchange this data electronically.

    GMX-V

    GMX-V is designed to fulfill two primary roles:

    • Establish a verifiable way of calculating the primary word and character counts for a given electronic document.
    • Establish a specific XML vocabulary that enables the automatic exchange of metric data

    As with all good standards, GMX-V is itself based on other well established standards:

    • Unicode 5.0 normalized form
    • Unicode Technical Report 29 – Text Boundaries
    • OASIS XML Localization Interchange File Format (XLIFF) 1.2
    • LISA OSCAR Segmentation Rules Exchange (SRX) 2.0

    WORDS AND CHARACTERS

    GMX-V mandates both word and character counts. Character counts convey the most precise definition of a localization task, whereas word counts are the most commonly used metric in the industry.

    OTHER METRICS

    The XML exchange notation of GMX-V allows for the exchange of all metrics relating to a given localization task, such as page counts, file counts, screen shot counts, etc.

    CANONICAL FORM

    One of the main problems with calculating word and character counts is the sheer range of differing proprietary file formats. Trying to establish a standard that addresses all formats is impossible. GMX-V required a canonical form that effectively levels the playing field. Such a common format is available through the OASIS XLIFF standard, which is now supported by all of the localization tool providers.

    Within XLIFF, inline codes are interpreted as inline XML elements. The inline elements are not included in the word and character counts, but form a separate inline element count of their own. The frequency of inline elements can have an impact on the translation workload, so a separate count is useful when sizing a job. Punctuation and white space characters are also featured as additional categories.

    GMX-V addresses all issues related to counting words and characters in the XLIFF canonical format. Since the sentence is the commonly accepted atomic unit for translation, it proposes sentence-level granularity for counting purposes within XLIFF.

    GMX-V does not preclude producing metrics directly from non-XLIFF files, as long as the format for counting is based on the XLIFF canonical form for each text unit being counted. This can be done dynamically on the fly, and it requires an audit file for verification purposes.

    WORDS

    GMX-V uses “Unicode Technical Report 29 (TR29-9) – Text Boundaries” to define words and characters. This provides a clear and unambiguous definition of word or “grapheme” boundaries.

    LOGOGRAPHIC SCRIPTS

    Word counts have little relevance for Chinese, Japanese, Korean (CJK) and Thai source text. For these languages, GMX-V recommends using only character counts.

    There is a proposal before ISO TC 37, submitted by Professor Sun Maosong, relating to the automatic identification of word boundaries for CJK languages. Should this recommendation become a standard, GMX-V should reference it for the provision of CJK word counts.

    QUANTITATIVE AND QUALITATIVE MEASUREMENTS

    GMX-V counts fall into two categories: how many and what type. The primary count is unqualified. For example, how many characters and words are in the file? This is the minimal conformance level proposed for GMX-V.

    A typical translatable document will contain a variety of text elements. Some of these elements will contain non-translatable text, some will have been matched from translation memory, and some will have been fuzzy matched by the customer. Therefore, it is important to be able to categorize the word and character counts according to type, in order to provide a figure in words and characters for a given localization task. GMX-V also provides an extension mechanism that enables user defined categories.

    COUNT CATEGORIES

    Apart from the total-word-count and total-charactercount values, GMX-V also includes these count categories:

    • In-context exact matches – An accumulation of the word and character count for text units that have been matched unambiguously with a prior translation and that require no translator input.
    • Leveraged matches – An accumulation of the word and character count for text units that have been matched against a leveraged translation memory database.
    • Repetition matches – An accumulation of the word count for repeating text units that have not been matched in any other form. Repetition matching is deemed to take precedence over fuzzy matching.
    • Fuzzy matches – An accumulation of the word and character count for text units that have been fuzzy matched against a leveraged translation memory database.
    • Alphanumeric-only text units – An accumulation of the word and character counts for text units that have been identified as containing only alphanumeric words.
    • Numeric-only text units – An accumulation of the word and character counts for text units that have been identified as containing only numeric words.
    • Punctuation characters – An accumulation of the punctuation characters.
    • White Spaces – An accumulation of white space characters.
    • Measurement-only – An accumulation of the word and character count from measurement-only text units.
    • Other Non-translatable words – An accumulation of other non-translatable word and character counts.
    • Automatically treatable text – A count of automatically treatable inline elements, such as date, time, measurements, or simple and complex numeric values.

    VERIFIABILITY

    Any measurement standard must have a reference implementation, as well as an authoritative body that tests and validates the measuring instruments. In the US, this is provided by the National Institute of Standards and Technology. In order to be successful, GMX-V must provide for a certification authority that will (1) maintain reference documents with known metrics and (2) provide an online facility to test given XLIFF documents. In this way, both customers and suppliers can be confident that GMX-V provides an unambiguous and reliable way of quantifying a localization or global-information-management task.

    NON-VERIFIABLE METRICS AND EXCHANGE NOTATION

    There are many instances where it is not possible to verify electronically the metrics data, such as screen shots, number of pages, etc. GMX-V allows for the annotation and exchange of all relevant metrics for a given localization task.

    SUMMARY

    GMX-V has been widely peer reviewed and published for open public comment for eighteen months. Much valuable feedback has been submitted and incorporated into the standard. All major localization tool providers have been consulted, to insure no obstacles to implementing it. GMX-V also provides a specification that can be used by word processing tool vendors and localization tool suppliers. It provides a consistent and unambiguous common standard for word and character counts.

    Further details of GMX-V are available at the following URL: www.lisa.org/standards/gmx


    ClientSide News Magazine - www.clientsidenews.com









    Submit your article!

    Read more articles - free!

    Read sense of life articles!

    E-mail this article to your colleague!

    Need more translation jobs? Click here!

    Translation agencies are welcome to register here - Free!

    Freelance translators are welcome to register here - Free!









  • Free Newsletter

    Subscribe to our free newsletter to receive news from us:

     
    Menu
    Recommend This Article
    Read More Articles
    Search Article Index
    Read Sense of Life Articles
    Submit Your Article
    Obtain Translation Jobs
    Visit Language Job Board
    Post Your Translation Job!
    Register Translation Agency
    Submit Your Resume
    Find Freelance Translators
    Buy Database of Translators
    Buy Database of Agencies
    Obtain Blacklisted Agencies
    Advertise Here
    Use Free Translators
    Use Free Dictionaries
    Use Free Glossaries
    Use Free Software
    Vote in Polls for Translators
    Read Testimonials
    Read More Testimonials
    Read Even More Testimonials
    Read Yet More Testimonials
    And More Testimonials!
    Admire God's Creations

    christianity portal
    translation jobs


     

     
    Copyright © 2003-2024 by TranslationDirectory.com
    Legal Disclaimer
    Site Map