Translating HTML files
How to translate correctly HTML files.
Translating Web Sites
Today, being able to translate HTML is crucial, for obvious reasons, and about every translator will accept HTML files. Yet, although it's not politically correct to mention this here, truth is that many translators don't know enough about HTML and websites to do a professional job.
There are LOTS of good HTML tutorials around, but they are all intended for webmasters wannabes or even professional webmasters, and skip important issues a translator should be aware of. I hope this fills in the gap and helps you do a better job.
If you are already well familiar with HTML, Keywords handling and style sheets, go straight to How to translate HTML for more on preparing an HTML file for translation and doing the translation itself.
What is HTML and how does it work? HTML stands for HyperText Markup Language. Hypertext is text characterized by the presence of links. Take a book. You read from the beginning and move toward the end. With hypertext, you can have access immediately to the information you are looking for by clicking on links.
An HTML file is a simple text file with an htm or html extension. Do the following experience: Take a simple text file, whatever.txt" and rename it to whatever.htm. Double click on it and it will display in your default web browser. Now, you will note that there are no links. There are no bold, no underlines, no tables, no pictures and not even paragraph marks.
HTML is the "language" that you use to tell the browser (Internet Explorer, Netscape, Mozilla, Opera...) how the page should be displayed and what it should do in different situations (the user click on a link, the navigator finds the page and display it, for instance). To do that, it uses markups. A markup - or tag - is a small piece of code that provides this information. In HTML, tags are made of a < sign, some code and a > sign. Case is not important.
For instance <b> tells the browser that whatever information follows that tag should be displayed in bold. Now, unless you want everything to be displayed in bold, there must be another tag to tell the browser where it should stop to display the text in bold. That tag is </b>. Note the / sign. The tag triggering the bold display (<b>) is called an opening tag. The tag canceling the action of the opening tag (</b>) is called a closing tag. There are tags for about every formatting option: italics, underline, color, size You will find them very easily on the net, like here for instance.
There are other types
of tags in an HTML document. For instance,
there are tags detailing the structure
of the page and its general behavior. An HTML
page is usually as follow:
You need not change the structure tags when you translate.
Another type of tag is the Meta tag. These are located in the header and give information on the page, used mostly by search engines, like keywords, description of the page, author and copyrights You will need to translate the contents of some of these tags. Bearing in mind that these tags are mostly intended for search engines, you have to translate the keywords and description using words that people will use to find the web site. Its not a matter of just translating those.
You have to think a little bit about which terms are applicable to the page and will be the most popular. You are likely to find misspellings in the Meta tags. They are there on purpose, so that people who misspell their search terms in the search engine find the page anyway. If so, misspell too. Google listed the misspellings it found for Britney Spears. There are hundreds, and they have been searched for by thousands of people, so misspelling on popular searches could amount to a significant trafic.
If you find well thought of descriptions and several typos in the Meta tags, be extra careful, for this is evidence that your customer has attempted some search engine optimization, and perhaps paid a lot of money to do so. Dont ruin it.
There is one other important item in the Meta tags: The charset. It tells the browser which character set is used in the page. If you translate from a language with a character encoding different of yours, you may have to change the encoding for the page to display properly. Here is what that Meta tag looks like:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
The TITLE tag (in the header. Shows in the title bar of the web browser when you display the page) <title>. THIS is the single most important piece of text in your web page. Why? Because Search Engines value it above everything else, when they analyze the page. Welcome to Whatever.inc is probably the most stupid title you can come up with. A title should contain the keywords that will be used to find the page. If the page talks about Blue widgets, the title should have Blue widget in it! Now, of course, you are translating. That means you have to follow the original Web page, and if the original name is Welcome to Whatever.inc, then keep it, but if you can see the author has put some thought on the title to include keywords in a specific sequence, give it some thought yourself.
Links. In HTML, a link looks like this:
<a href=http://www.website.com title=Good web site> Web Site </a>
a stands for Anchor, and href tells the browser where that anchor is located (here, http://www.website.com). Title gives a title for the link, so that when you pass the mouse over the link, a small note will display, Good web site, in this example. You have to translate it. Web Site is the text of the link. You may or may not have to translate it. </a> is the closing tag.
Images. Although you see images in web pages, they are not really inside the HTML document. Its a simple text file, right? In fact, you have a tag that tells the web browser where the picture is stored and how to display it (what size, with or without a border, where in the screen ). The image tag is <img src=http://www.website.com/image.jpg alt=Picture of a blue widget>. It has no closing tag. You should not change the image tag except for the content of the "alt" tag. Alt stands for Alternate text.
In the early days of Internet, many browsers were not able to display pictures, or it was too slow, so many users disabled the pictures to surf faster. To enable those users to understand what picture should be there, the alt text is displayed instead. Even if the image is displayed, the alt text shows when you move the mouse over the image. You have to translate it.
The alt" and the title are usually loaded with keywords for the search engines. If this is the case, make sure that the translation is the same way.
HTML has evolved a lot from the first version. Nowadays, a web designer can decide exactly the size of the text, create styles (a concept similar to styles in a word processor more on that later), set the position and so on. But in the early days, HTML was much more frugal.
The web was used for text. You had a series of tags to identify the documents hierarchy, called the heading tags <h1>, <h2>, <h3> and their closing tags, </h1>, </h2>, </h3>. H1 is the main heading. It's big, bold, often too big, in fact. H2 is a secondary heading, slightly smaller. H3 is again small... You got the idea.
Although there are much better ways in current HTML to arrange the display, the H tags have remained and are used by search engines when they analyze a page, the rationale being that if a word is in a heading, it is more relevant to the page content. This is the main reason why many web sites still use those tags even if that means a little bit more work. As a translator, these tags tell you that you are translating a heading, and its position in the document's hierarchy.
They are also a warning that you have to be aware that the words inside these tags. Exactly. Keywords. Usually, you will see the same keywords used in the H tags and in the keywords Meta tag. Make sure that you use the same keywords. Search Engines analyze, amongst other things, the number of times a specific keyword appears compared to the total number of words in the page, and where. Try to keep the same proportion as the original document, and if a keyword is in a header, make sure your translation leaves a keyword in that same header.
For the same reason, HTML contains a number of redundant tags, like <b> and <strong>, or old ones that you almost dont see anymore, like <big> (self explanatory, I think). Look for these. Too easy to concentrate on the standard <b>, <i>... and forget to handle those old things. you may need to move them, too.
Next, styles and style sheets. A style is a series of attributes defined in advance, either in the header of the document, or in a separate file called a style sheet.
To understand styles, you need to understand what problems they resolve:
Suppose you want the big titles in your web site to be bold, italic, blue, and centered. In good old HTML, you would write:
Pretty clumsy, isnt it? And that's just 4 simple attributes. The solution is to define a style with all these specifications: Its bold, its blue, it's centered, and you give it a name, i.e.: bbc (For Bold Blue Centered. Just an example. Its normally named so that one remembers easily what it is). Then, you don't need to write it every time. In the header of the page, you write:
Then, anytime you have a title, you write
But the best is that if after all is done, you decide that it would be nicer in red, or that italics would be cool, you dont have to look all over the document and change all the tags, each time. You simply change 1 word in the style definition and every instance change at once. This not only saves a lot of time when you design the page, but also make the page size smaller, and thus faster to load.
Now, if you want to use a style in several pages, or even the whole site, you have to copy the same styles in the header of each page. Not too smart. The solution was to write all the styles in a separate file, called a style sheet, then to link each page to the style sheet. That way, you write the styles only one time, and in each page, you have a link in the header that looks like this:
<link href="/stylesheet.css" rel="stylesheet" type="text/css">
A style sheet files extension is *.css. Now, as a translator, this is relatively important to know because it determines how the text will be displayed and where. The same page can look completely different with and without the style sheet. With experience, you can look at the source code and see the page (No, this aint the Matrix yet ;-). That helps a lot, because you dont need to check out the page in the browser every few minutes.
Anyway, this should cover the basic HTML you need to translate. When you get a bit more time, pick one of the many HTML tutorials on the Web and learn about tables and frames.
How to translate HTML
There are two reliable, proven
methods and many wrong methods. Amongst the wrong
methods, the most populars are:
The correct methods
Preparing the text for translation:
What do I mean by Preparing
the text for translation? For translation
purposes, there are 2 types of tags:
Overall, there are very few tags that you may need to delete during the translation process.
"Preparing files" means modifying the files so that they can be translated easily using a CAT. What follow is a description of a file prepared for Wordfast/Trados, a tagged file, in the translator lingo. Since Trados is/was widely used, most professional CAT can handle this type of files, with more or less success. However, if you own and use another CAT (SDLX, DV, ), please check your CAT's documentation. As you will use a CAT to work of the tagged file, I assume that you are familiar with the basic concepts. (If not, please read the following pages of this web site before going further: What are CATs? and First translation)
A tagged file is a RTF file containing the source code (meaning, tags + text) of the original HTML file. The tags are identified using 2 styles: tw4winInternal and tw4winExternal. Without getting into details, the tw4winInternal style is red, and the tw4winExternal is light grey. Whenever you receive a file with tags in red and grey, its almost a given that the file has been tagged. Although the handling is very similar, beware that HTML files are not the only tagged files, and many more exotic formats are tagged for use with CATs, like SGML, XML, QuarkXpress, FrameMaker, etc.
All tags are protected against deletion by default, to avoid you deleting one by mistake. Tags that you may need to move, like <b> (bold), are in tw4winInternal. Internal because they will be included in the segment you have to translate. They are in red. Tags that you don't need to change or to be concerned about during the translation process are in tw4winExternal, (like <p> (paragraph mark), <body>, ) and are in grey. A tag in tw4winExternal style will end a segment automatically.
Here is an example:
Correct: You are learning to translate <b>Web Sites</b></p>Bla bla bla
By now, you should know that Web sites is in bold, and that the </p> shows the end of a paragraph. When you open that sentence with Wordfast (or Trados), the segment will end just after the </b>, although there is no period, because <p> is in tw4winExternal style.
Incorrect: You are learning to translate <b>Web Sites</b></p>Bla bla bla
(The segment would stop right after translate).
Incorrect: You are learning to translate <b>Web Sites</b></p>Bla bla bla
(The segment would include everything).
Incorrect: You are learning to translate <b>Web Sites</b></p>bla bla bla
(The segment would include everything and the tags are not protected).
2. Tagging an HTML file?
If you open the source code of virtually any HTML file, you will see there are a LOT of tags. So changing the styles manually is just not workable. You need to use another software to tag (prepare) the file. Its rather easy to do for HTML, and other relatively common formats like XML and SGML. My personal preference goes to a software called Rainbow (freeware). There are other possibilities like +Tools (also freeware).
The process is rather simple and well explained in both software documentations, so I wont overkill it. In Rainbow, (once installed), you click on Add, select the HTML files you need to prepare, go to the Tools menu, select Prepare for translation, fill out the needed options, and under the tab Package, you select where the tagged files should be created.
Some stuff may look complex, but frankly its a no-brainer, when all you have to do is prepare an HTML file.
Find your files, open the rtf file in Word, and you are ready to translate.
3. Translating a tagged file.
This depends on your CAT. In Wordfast, start the translation as usual, with your TM and glossaries, the lock bolt on the door, gaffer tape across the neighbors kid mouth, Mozart playing (or AC/DC your call), ,whatever your set-up usually is when you translate. ;-)
Tags in tw4winInternal are considered as placeables. You can select them in the source segment using Ctrl + Alt + Left/Right and Ctrl + Alt + Down will copy it inside the target segment, at the insertion point. Type your translation in the target and bring down the tags at the appropriate points in the target sentence.
Use the tags to know how the text will look like and do not hesitate to refer to the original HTML file, when in doubt. As explained, before, keep keywords in mind and balance the text to match the originals proportions as closely as possible. (Of course, if the page is not meant for the general public but for Intranet, that becomes much less important).
Please refer to the tagged files section of your Wordfasts manual. In summary, you have to make sure that you do not forget tags (Wordfast has settings to remind you), that you keep the internal tags in the tw4winInternal and the translatable text in whatever is the style originally used.
You are translating an <b>HTML</b>
4. Done, now, what?
When your translation is done and the file cleaned (meaning all source segments and segment delimiter have been deleted), you have a nice RTF file. If both the source and the target language do not require Unicode and that you do not have special characters in the file, save it as txt (or copy all the code in Notepad) and change the extension to *.htm or *.html. If you use a language that requires Unicode (Chinese, Japanese, Russian, Thai,...), save the file with the appropriate encoding and modify the charset information in the file header to reflect the new language (i.e.: UTF-8.) See the HTML links to find out more about encodings and file formats.
If you have respected the tags, the file should look about right in the browser. However, the translation is seldom the same size as the original text, and if so, you may have to make a few arrangements to make it fit nice. If lucky, everything can stay the same.
You are through. I hope these information will help you tackling HTML files in a professional manner and feel confident with them. As you can see, there is nothing really hard in HTML files, but they do require some extra attention too. If it's HTML, it's not just text.
At times the client wants you to translate the text with no consideration with the HTML or a potential use on the net. Thats all right. If so, skip everything and ask him to provide a regular *.doc file, or open the HTML in word and save it as *.doc.
Good luck. ;-)
© Sylvain Galibert.
Please see some ads as well as other content from TranslationDirectory.com: