[Kbabel] [Fwd: [l10n-dev] [Fwd: [Freecats-Dev] The other side: "Commercial" TM tools (cont.)]]

Gudmund Areskoug fta@algonet.se
Wed, 05 Feb 2003 15:52:45 +0100


Hi,

Stanislav Visnovsky wrote:
> On Wed, 5 Feb 2003, Gudmund Areskoug wrote:
> 
>>Hi,
>>
>>Stanislav Visnovsky wrote:
>>
>>>On Tue, 4 Feb 2003, Dwayne Bailey wrote:
>>>
>>>>Sorry if you've seen this already.  They're attempting to create a set 
>>>>of tools that would eventualy allow a translators using commercial 
>>>>translation products to work with our free software translation formats.
>>>>
>>>>Interesting to me is the discussion around features and needs of the tool.
>>>
>>>I'm quite puzzled about the features. I've never seen a commercial CAT 
>>>product, but the web pages for the commercial tools do not seem that great 
>>>to me.
>>
>>I've been using DejaVu from Atril as my "everyday tool" for some 
>>years now, so I can provide some input. It should rather be viewed 
>>as a toolbox than as a single tool.
>>
>>In short:
>>
>>It is project oriented and uses three distinct data sources 
>>(databases), that are/can be built up "on the fly" during translation.
>>
>>They're handled differently according to intended purpose: MDB for 
>>whole segments, TDB for terminology and shorter segments suited for 
>>assembly, and Lexicon, for temporary glossaries, project specific 
>>glossaries/resources that can be built and edited (making n-grams, 
>>sorting and filtering on e. g. frequency etc.) from all the files in 
>>a project according to your settings, and for language pairs that 
>>should override any other matches.
> 
> I see. But does that need to be that complicated? How hard is to manage 
> all these databases? You only need to setup reasonable thresholds etc
> to identify the particular purpose of the text to be stored in a correct 
> database?

Not sure if I understand what you mean, it is very doable, if that's 
what you mean.

If you meant "why make it that complicated": the point is of course 
that the three data sets are used in different ways and for 
different purposes, not that the data has to be stored in different 
files, which is rather academic - but perhaps easier to handle 
mentally for the user.

The approach has proven very flexible and efficient, with some 
unexpected results in how the system is actually used, thanks to its 
toolbox nature.

DV uses the (proprietary) Dewey system for subject classification. 
Been looking around for an open source classification system, so far 
haven't found any I thought were usable enough.

The thing about the Dewey system is its hierarchical nature, that 
allows for a priority order, so that in a project that was assigned, 
say, subject 1234.5678, perfect matches would be selected like this:

match1	1234.5678 = first choice for suggestion
match2	1234.567x = second choice for suggestion
match3	1234.56xx = third choice for suggestion

...and so on.

Something similar, but open source, could easily be set up for KDE, 
Gnome, almost anything GNU. As long as it stays more or less purely 
within the software domain, it doesn't have to be too hard to set 
up. I've started asking at the institute for archive and library 
science (or the like in Swedish) for some general non-proprietary 
system to use, and will go on to check with the computer 
linguisticts department and a few others.

Such a system could be used for keeping track of when and where 
global terms (e. g. KDE-wide "Yes" -> "Ja", "No" -> "Nej", "File" -> 
"Arkiv", "Save as..." -> "Spara som...", "Accept" -> "Verkställ" 
etc.) should be used, and where local terms  (e. g. "Accept" -> "Ja, 
starta").

This could probably be extended to prevent shortcut and menu 
conflicts and the like.

I started setting up a dummy for KDE and Gnome terminology 
hierarchy, along with a shortkut/fastkey list, but "All work and no 
play"...

>>How these data sources are used is fairly highly configurable.
>>
>>One decisive thing that DV has is subject hierarchy handling for 
>>improving selection of found matches. Along with the client setting, 
>>it's used for selecting what match is most likely to be correct, and 
>>offered as first alternative.
>>
>>Another thing it has, is its assembly feature, that puts together 
>>suggested translations from the (mostly) smaller snippets in the TDB 
>>and Lexicon.
> 
> We have single words only, which is pretty unusable IMHO.

Yes, it has to be whole snippets.

>>Add a terminology and figures check feature, and you've got the 
>>overview, although lots more could be said.
> 
> What is a terminology check? We do not work with figures, so this one is 
> out of question.

You have the program go through the file you're working on or the 
whole project, to see if there's any row (or string pair) where the 
corresponding target terms to the source terms according to the 
database isn't present. IMHO, it should be complemented by a reverse 
check, to see if the same target term has been used for different 
source terms.

The figures checking finds pairs where figures in the target string 
are different from the ones in the source string.

>>Taken by themselves, these features don't seem impressive, but put 
>>together and including the rules that can be set for how they're 
>>used and interact, they make a substantial difference in speed and - 
>>above all - translation quality and consistency.
>>
>>Unfortunately, they don't want to port to Linux until there's a 
>>demand. That app (along with business bookkeping and tax software) 
>>is the major reason I can't switch completely to Linux any time soon :(.
> 
> I've checked out their web page and the new version seems to support PO 
> files.

Guess why :). ...but I want to dump Windows altogether, if (when) I can.

>>>Probably the key area is a large-project management features support. But 
>>>I'd like to hear more on this topic. At least, KBabel should be able to 
>>>work as a GUI client for their server.
> 
> 
> According to atril.com, it seems like KBabel is MAHT tool 
> (Machine-Assisted Human Translation), right?

Yes. No pure machine translation, virtually no linguistic 
intelligence. It's essentially a database tool.

>>The major showstopper for professional translators in OpenSource 
>>CAT's, is lack of support for oodles of file formats, many of them 
>>proprietary monsters.
> 
>>I'm trying to remedy that now by starting a project for a general, 
>>freely customizable non-lossy two-way file filter with exchangeable 
>>filter profiles (for sharing and spreading filtering improvements) 
>>in the local LUG.  Lack of time has prevented it from taking off 
>>long ago.
>>
>>Any pointers to existing projects of that kind are welcome :).
> 
> What about po4a project? https://savannah.nongnu.org/mail/?group=po4a

Thanks, checking it!

> Thank you for information. Still, I'm missing the big picture how these 
> tools really work.

Download and try it, if you have access to a Windoze machine?

Haven't used Trados, DV's biggest competitor (M$ owns a large part 
of Trados, it's completely dependent on M$ Word...), but this is how 
the workflow can look like in DV (cutting out some to keep it short):

- You set up a project, assigning it languages (you typically work 
with one pair at at time, as a single user), the databases that 
should be used for this project, where the translatable files are 
found and where to export the finished result, subject and client 
settings etc.

- You (batch) import the translatable files into the project. You 
can either work on all files at once, or one at a time.

- You (optionally, if you like to set a terminology beforehand) 
build or import a lexicon and resolve it against the database(s).

- You (optionally) run a pretranslate, which is configurable (use 
newer matches over older, only allow perfect matches, etc. etc.).

- You go through the rows (translatable segments) and translate, 
optionally with DV autoassembling and inserting stuff as 
suggestions, optionally with DV showing any matches it finds in any 
of the DB's along with all additional info on the side.

- As soon as you leave a row, it is saved. Good for the evercrashing 
  Windows environment.

- You can set different status for rows, like pending, finished, 
locked, don't send to databse etc. These and other things can be 
used as fitering criteria to produce different project views to work in.

On top of this, there are some additional general settings like 
fuzziness level, sentence delimiters, etc.

- You can run QA checks on them, like the terminology check I 
mentioned.

- When a file/all files are ready, you export the finished translation.

Things can be filtered and exported for distributing work, for 
sending only segments you're unsure of for QA etc.

All DB searches are fairly quick.

> Looks like a cultural mismatch. :-(

Yes, definitely. When I first stumbled onto the KDE i18n list with 
CAT suggestions, it wasn't very popular. Going in the other 
direction to get Windozed people to understand the benefits of 
things free (GNU, KDE, ...) is at least as difficult.

Most 'dozers aren't as technically oriented, whereas Linux people 
mostly are, and often have their own pet solutions to things :).

And I'm trying to help bridge the gap.

BR,
Gudmund