Translating HTML
files
By
Sylvain Galibert,
Ampur Muang Chiang Mai,
THAILAND,
English to French translation
Translator and owner of
www.your-translations.com
Contact
the author
Get the List of 5,400+ Translation Agencies Now! No Recurring Membership Fees!
How to translate
correctly HTML files. How HTML works, basic tags, style sheets, what are the issues a
translator should be aware of, how to prepare (tag) an HTML file for translation, what to
watch for when translating a website,... |
Translating Web Sites
Today, being able to
translate HTML is crucial, for obvious reasons, and about every translator will accept
HTML files. Yet, although it's not politically correct to mention this here, truth is that
many translators don't know enough about HTML and websites to do a professional job.
There are LOTS of good
HTML tutorials around, but they are all intended for webmasters wannabes or even
professional webmasters, and skip important issues a translator should be aware of. I hope
this fills in the gap and helps you do a better job.
If you are already well
familiar with HTML, Keywords handling and style sheets, go straight to How to
translate HTML for more on preparing an HTML file for translation and doing the
translation itself.
HTML issues
(Basic and not so basic)
What is HTML and how
does it work? HTML stands for HyperText Markup
Language. Hypertext is text characterized by the presence of links. Take
a book. You read from the beginning and move toward the end. With hypertext, you can have
access immediately to the information you are looking for by clicking on links.
An HTML file is
a simple text file with an htm or html extension. Do the
following experience: Take a simple text file, whatever.txt" and rename it to
whatever.htm. Double click on it and it will display in your default web
browser. Now, you will note that there are no links. There are no bold, no underlines, no
tables, no pictures and not even paragraph marks.
HTML is the
"language" that you use to tell the browser (Internet Explorer, Netscape,
Mozilla, Opera...) how the page should be displayed and what it should do in different
situations (the user click on a link, the navigator finds the page and display it, for
instance). To do that, it uses markups. A markup - or tag - is a small piece
of code that provides this information. In HTML, tags are made of a < sign,
some code and a > sign. Case is not important.
For instance
<b> tells the browser that whatever information follows that tag should
be displayed in bold. Now, unless you want everything to be displayed in bold, there must
be another tag to tell the browser where it should stop to display the text in bold. That
tag is </b>. Note the / sign. The tag triggering the bold
display (<b>) is called an opening tag. The tag canceling the
action of the opening tag (</b>) is called a closing tag. There are
tags for about every formatting option: italics, underline, color, size
You will
find them very easily on the net, like here for instance.
There are other
types of tags in an HTML document. For instance, there are tags detailing the structure
of the page and its general behavior. An HTML page is usually as follow:
<HTML> (To tell the browser that this page is in HTML)
<HEAD> (Header. Contains information about the page that will not be displayed, but
can nevertheless influence the display.)
</HEAD> (Closes the <head> tag. Most tags should be opened and
closed.)
<BODY> (The actual page. This is what you see when you open the page in the browser)
</BODY> (Closing tag for <body>)
</HTML> (Closing tag for <html>)
You need not change the
structure tags when you translate.
Another type of tag is
the Meta tag. These are located in the header and give information on the
page, used mostly by search engines, like keywords, description of the page, author and
copyrights
You will need to translate the contents of some of these tags. Bearing in
mind that these tags are mostly intended for search engines, you have to translate the keywords
and description using words that people will use to find the web site.
Its not a matter of just translating those.
You have to think a
little bit about which terms are applicable to the page and will be the most popular. You
are likely to find misspellings in the Meta tags. They are there on purpose, so that
people who misspell their search terms in the search engine find the page anyway. If so,
misspell too. Google listed the misspellings it found for Britney Spears.
There are hundreds, and they have been searched for by thousands of people, so misspelling
on popular searches could amount to a significant trafic.
If you find well thought
of descriptions and several typos in the Meta tags, be extra careful, for
this is evidence that your customer has attempted some search engine optimization, and
perhaps paid a lot of money to do so. Dont ruin it.
There is one other
important item in the Meta tags: The charset. It tells the browser which
character set is used in the page. If you translate from a language with a character
encoding different of yours, you may have to change the encoding for the page to display
properly. Here is what that Meta tag looks like:
<meta
http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
The TITLE tag
(in the header. Shows in the title bar of the web browser when you display the page)
<title>. THIS is the single most important piece of text in your web page.
Why? Because Search Engines value it above everything else, when they analyze the page.
Welcome to Whatever.inc is probably the most stupid title you can come up
with. A title should contain the keywords that will be used to find the page. If the page
talks about Blue widgets, the title should have Blue widget in it! Now, of
course, you are translating. That means you have to follow the original Web page, and if
the original name is Welcome to Whatever.inc, then keep it, but if you can see
the author has put some thought on the title to include keywords in a specific sequence,
give it some thought yourself.
Links.
In HTML, a link looks like this:
<a
href=http://www.website.com title=Good web site> Web Site
</a>
a stands for
Anchor, and href tells the browser where that anchor
is located (here, http://www.website.com). Title gives a title for
the link, so that when you pass the mouse over the link, a small note will display,
Good web site, in this example. You have to translate it. Web Site
is the text of the link. You may or may not have to translate it. </a>
is the closing tag.
Images.
Although you see images in web pages, they are not really inside the HTML document.
Its a simple text file, right? In fact, you have a tag that tells the web browser
where the picture is stored and how to display it (what size, with or without a border,
where in the screen
). The image tag is <img
src=http://www.website.com/image.jpg alt=Picture of a blue
widget>. It has no closing tag. You should not change the image tag except for
the content of the "alt" tag. Alt stands for Alternate
text.
In the early days of
Internet, many browsers were not able to display pictures, or it was too slow, so many
users disabled the pictures to surf faster. To enable those users to understand what
picture should be there, the alt text is displayed instead. Even if the image is
displayed, the alt text shows when you move the mouse over the image. You have to
translate it.
The alt" and
the title are usually loaded with keywords for the search engines. If this is
the case, make sure that the translation is the same way.
HTML has evolved a lot
from the first version. Nowadays, a web designer can decide exactly the size of the text,
create styles (a concept similar to styles in a word processor more on that later),
set the position and so on. But in the early days, HTML was much more frugal.
The web was used for
text. You had a series of tags to identify the documents hierarchy, called the
heading tags <h1>, <h2>, <h3>
and their closing tags,
</h1>, </h2>, </h3>. H1 is the main heading. It's big, bold, often too
big, in fact. H2 is a secondary heading, slightly smaller. H3 is again small... You got
the idea.
Although there are much
better ways in current HTML to arrange the display, the H tags have remained and are used
by search engines when they analyze a page, the rationale being that if a word is in a
heading, it is more relevant to the page content. This is the main reason why many web
sites still use those tags even if that means a little bit more work. As a translator,
these tags tell you that you are translating a heading, and its position in the document's
hierarchy.
They are also a warning
that you have to be aware that the words inside these tags. Exactly. Keywords. Usually,
you will see the same keywords used in the H tags and in the keywords Meta
tag. Make sure that you use the same keywords. Search Engines analyze, amongst other
things, the number of times a specific keyword appears compared to the total number of
words in the page, and where. Try to keep the same proportion as the original document,
and if a keyword is in a header, make sure your translation leaves a keyword in that same
header.
For the same reason,
HTML contains a number of redundant tags, like <b> and <strong>, or old ones
that you almost dont see anymore, like <big> (self explanatory, I
think). Look for these. Too easy to concentrate on the standard <b>,
<i>... and forget to handle those old things. you may need to move them, too.
Next, styles and
style sheets. A style is a series of attributes defined in advance,
either in the header of the document, or in a separate file
called a style sheet.
To understand styles,
you need to understand what problems they resolve:
Suppose you want the big
titles in your web site to be bold, italic, blue, and centered. In good old HTML, you
would write:
<h1><b><center><font
color=blue>Title 1</font></center></b></h1>
<h1><b><center><font color=blue>Title
2</font></center></b></h1>
<h1><b><center><font color=blue>Title
3</font></center></b></h1>
<h1><b><center><font color=blue>Title
4</font></center></b></h1>
<h1><b><center><font color=blue>Title
356</font></center></b></h1>
Pretty clumsy,
isnt it? And that's just 4 simple attributes. The solution is to define a style with
all these specifications: Its bold, its blue, it's centered, and you give it a
name, i.e.: bbc (For Bold Blue Centered. Just an example. Its normally named so that
one remembers easily what it is). Then, you don't need to write it every time. In the
header of the page, you write:
<style
type="text/css">
<!--
. bbc{
text-align: center;
font-weight: bold;
color: #blue;
}
-->
</style>
Then, anytime you have a
title, you write
<h1
class=bbc>Title 1</h1>
<h1 class=bbc>Title 2</h1>
<h1 class=bbc>Title 3</h1>
But the best is that if
after all is done, you decide that it would be nicer in red, or that italics would be
cool, you dont have to look all over the document and change all the tags, each
time. You simply change 1 word in the style definition and every instance change at once.
This not only saves a lot of time when you design the page, but also make the page size
smaller, and thus faster to load.
Now, if you want to use
a style in several pages, or even the whole site, you have to copy the same styles in the
header of each page. Not too smart. The solution was to write all the styles in a separate
file, called a style sheet, then to link each page to the style sheet.
That way, you write the styles only one time, and in each page, you have a link in the
header that looks like this:
<link
href="/stylesheet.css" rel="stylesheet" type="text/css">
A style sheet
files extension is *.css. Now, as a translator, this is relatively
important to know because it determines how the text will be displayed and where. The same
page can look completely different with and without the style sheet. With experience, you
can look at the source code and see the page (No, this aint the Matrix
yet ;-). That helps a lot, because you dont need to check out the page in the
browser every few minutes.
Anyway, this should
cover the basic HTML you need to translate. When you get a bit more time, pick one of the
many HTML tutorials on the Web and learn about tables and frames.
How to translate HTML
There are two reliable,
proven methods and many wrong methods. Amongst the wrong methods, the
most populars are:
Opening the HTML file in Word, working there and Save as a web page.
This changes the code and turns it into a complete mess that is twice the size of the
original page, cause display issues no-end and is about as popular for search engines as a
dead cat at a wedding. If you want to hear a knowledgeable customer scream, go ahead.
Translating in other WYSIWYG editors (What You See Is What You Get). They mess up
the code as well, usually, while I dont know any as bad as Word for that matter,
save perhaps frontpage. Dreamweaver is an exception to that rule, but a costly one if you
are simply translating.
Using a translation software that hides the tags. That can be very attractive for
beginners, but if you understood the section above properly, you will see why this is not
a good solution at all. An example of such software is Catscraddle. That software is very
smooth but will cause problems because you don't know what is what, and the sentences are
cut midway if the page use formating. If it was doing a correct job, I would be the first
to use it because I love the interface and it's very fast. Unfortunately, the basic
concept is VERY flawed and if you want to do a professional job, just dont.
The correct
methods include :
Open the page in an HTML editor, preferably one that support color coding of the
tags. There are many freewares. I like very much AceHTML, but that's far from the only one
available. Either way, translate the text and move the tags as needed. I.e.:
English: Johns <i>girlfriend</i> is quite cute.
French: La <i>petite amie</i> de John est plutôt mignone.
As you can see, you have to decide where the tags should be in the target language.
Working that way can be a pain, but if you know your code and are careful, the output will
be irreproachable. However, you must stay very alert not to forget or erase tags by
mistake.
Preparing the file, then using a CAT like Wordfast or Trados to translate it, then
restoring the HTML format. Not all CAT work the same way, but remember that professional
handling of web sites translation *requires* quick access to the tags. The ability to
move, edit or delete tags is not optional, its a must. With Trados, you can also use
TagEditor, although you may miss the flexibility that comes with working in Word.
Moving/deleting tags can be quite clumsy in TE.
Preparing the text for translation:
1. What are tagged files?
What do I mean by
Preparing the text for translation? For translation purposes, there are 2
types of tags:
Tags that you may need to move or edit and that are/could be located in the middle
of a segment
Tags that you will almost never change and are not (should not) be in the middle of
a segment
Overall, there are very
few tags that you may need to delete during the translation process.
"Preparing
files" means modifying the files so that they can be translated easily using a CAT.
What follow is a description of a file prepared for Wordfast/Trados, a tagged
file, in the translator lingo. Since Trados is/was widely used, most professional
CAT can handle this type of files, with more or less success. However, if you own and use
another CAT (SDLX, DV,
), please check your CAT's documentation. As you will use a
CAT to work of the tagged file, I assume that you are familiar with the basic concepts.
(If not, please read the following pages of this web site before going further: What are CATs? and
First
translation)
A tagged file is a RTF
file containing the source code (meaning, tags + text) of the original HTML file. The tags
are identified using 2 styles: tw4winInternal and tw4winExternal. Without getting into
details, the tw4winInternal style is red, and the tw4winExternal is light grey. Whenever
you receive a file with tags in red and grey, its almost a given that the file has
been tagged. Although the handling is very similar, beware that HTML files are not the
only tagged files, and many more exotic formats are tagged for use with CATs, like SGML,
XML, QuarkXpress, FrameMaker, etc.
All tags are protected
against deletion by default, to avoid you deleting one by mistake. Tags that you may need
to move, like <b> (bold), are in tw4winInternal. Internal because they
will be included in the segment you have to translate. They are in red. Tags that you
don't need to change or to be concerned about during the translation process are in
tw4winExternal, (like <p> (paragraph mark), <body>,
) and are in grey. A
tag in tw4winExternal style will end a segment automatically.
Here is an example:
Correct: You
are learning to translate <b>Web Sites</b></p>Bla bla bla
By now, you should know
that Web sites is in bold, and that the </p> shows the end of a
paragraph. When you open that sentence with Wordfast (or Trados), the segment will end
just after the </b>, although there is no period, because <p> is in
tw4winExternal style.
Incorrect: You are
learning to translate <b>Web Sites</b></p>Bla bla bla
(The segment would stop
right after translate).
Incorrect: You are
learning to translate <b>Web Sites</b></p>Bla bla bla
(The segment would
include everything).
Incorrect: You are
learning to translate <b>Web Sites</b></p>bla bla bla
(The segment would
include everything and the tags are not protected).
2.
Tagging an HTML file?
If you open the source
code of virtually any HTML file, you will see there are a LOT of tags. So changing the
styles manually is just not workable. You need to use another software to tag (prepare)
the file. Its rather easy to do for HTML, and other relatively common formats like
XML and SGML. My personal preference goes to a software called Rainbow (freeware). There
are other possibilities like +Tools (also
freeware).
The process is rather
simple and well explained in both software documentations, so I wont overkill it. In
Rainbow, (once installed), you click on Add, select the HTML files you need to
prepare, go to the Tools menu, select Prepare for translation, fill out the
needed options, and under the tab Package, you select where the tagged files
should be created.
Some stuff may look
complex, but frankly its a no-brainer, when all you have to do is prepare an HTML
file.
Find your files, open
the rtf file in Word, and you are ready to translate.
3. Translating a tagged file.
This depends on your
CAT. In Wordfast, start the translation as usual, with your TM and glossaries, the lock
bolt on the door, gaffer tape across the neighbors kid mouth, Mozart playing (or
AC/DC your call),
,whatever your set-up usually is when you translate. ;-)
Tags in tw4winInternal
are considered as placeables. You can select them in the source segment using Ctrl +
Alt + Left/Right and Ctrl + Alt + Down will copy it inside the target
segment, at the insertion point. Type your translation in the target and bring down the
tags at the appropriate points in the target sentence.
Use the tags to know how
the text will look like and do not hesitate to refer to the original HTML file, when in
doubt. As explained, before, keep keywords in mind and balance the text to match the
originals proportions as closely as possible. (Of course, if the page is not meant
for the general public but for Intranet, that becomes much less important).
Please refer to the
tagged files section of your Wordfasts manual. In summary, you have to
make sure that you do not forget tags (Wordfast has settings to remind you), that you keep
the internal tags in the tw4winInternal and the translatable text in whatever is the style
originally used.
Example:
You are translating an
<b>HTML</b> file!
Vous êtes en train de traduire un fichier <b>HTML</b> !
4. Done, now, what?
When your translation is
done and the file cleaned (meaning all source segments and segment delimiter have been
deleted), you have a nice
RTF file. If both the source and the
target language do not require Unicode and that you do not have special characters in the
file, save it as txt (or copy all the code in Notepad) and change the extension to
*.htm or *.html. If you use a language that requires Unicode
(Chinese, Japanese, Russian, Thai,...), save the file with the appropriate encoding and
modify the charset information in the file header to reflect the new language (i.e.:
UTF-8.) See the HTML links to find out more about encodings and file formats.
If you have respected
the tags, the file should look about right in the browser. However, the translation is
seldom the same size as the original text, and if so, you may have to make a few
arrangements to make it fit nice. If lucky, everything can stay the same.
You are through. I hope
these information will help you tackling HTML files in a professional manner and feel
confident with them. As you can see, there is nothing really hard in HTML files, but they
do require some extra attention too. If it's HTML, it's not just text.
At times the client
wants you to translate the text with no consideration with the HTML or a potential use on
the net. Thats all right. If so, skip everything and ask him to provide a regular
*.doc file, or open the HTML in word and save it as *.doc.
Good luck. ;-)
Sylvain
©
Sylvain Galibert.
This article is a courtesy of www.your-translations.com, professional
English to French translation services.
Your-translations.com offers professional translation
services and translation project management.
More articles from the same author can be found
there.
Read
more articles - Free!
E-mail
this article to your colleague!
Need
more translation jobs? Click here!
Translation
agencies are welcome to register here - Free!
Freelance
translators are welcome to register here - Free!
Subscribe
to TranslationDirectory.com newsletter - Free!
Take
part in TranslationDirectory.com poll - your voice counts!
|