A Shoujo & General Webpage

Machine Translation and OCR - FAQs

Read the FAQ's and then the original messages which follow.

Is it possible to get software that performs machine translation of Japanese texts?

Yes. See Neocor, & Dragon Writer, etc.Comment

Is it possible to scan Japanese text and thus input it into a computer as electronically coded text?
Yes. This is a two-part process; image scanning and optical character recognition (OCR). See Kanji-Scan and OmniPage.

So can I scan a page from a Japanese magazine, and get a machine translation of the text?
Yes.

So how well does this work?
That's really a multi-part question; let's ask it again...

How well does scanning and OCR work in practice?
This depends on several factors.
Scanning:The quality of the original print is very important:
Clear black-and white print on glazed paper is good, small characters on cheap paper not good, photocopies not good, colour and patterned backgrounds very bad, Japanese handwriting hopeless. Comment

More about scanning?
You can either use your own (scanner bundled) scan software or the scan software that is part of the Japanese OCR package. There are lots of variables to tweak that affect the final result. It may prove best to use your own OCR and an option like "B/W Line Art". Allowing the OCR software to convert from colour to B/W may prove nasty. Comment
What about the OCR?
Rather surprisingly, the results with Kanji can be startlingly good, with a very high percentage correct achieved at a very high speed. The results with hiragana/katakana are generally disappointing, with a high rate of errors, not helped by the variable size of kana characters. Performance is severely affected by the quality of the print being scanned.

Error correcting?
You always have to do this. If there are a lot of errors, it may well take longer to manually correct the text than it would take to type it in from scratch.

How well does machine translation work?
Results vary from the impressive (usually with carefully prepared text, as in software suppliers' demos) to the unintelligible. Overall, what you should expect is a translation which, while not proper English, does give you a fair idea
of what the original text said. You can then decide if you are sufficiently interested to have it translated properly by a human (with the aid of the machine draft). This is how the programs are actually employed in commerce and academia.

Results seem to be better with regular Japanese (as in company notices) rather than with slangy or colloquial texts like manga or anime magazines. Comment

So is it worth trying an OCR on a manga?
At the present state of the art it probably isn't. The scanner and OCR performance will be so poor that you could enter the text quicker by hand. We tried it on a regular size tankoubon (Video Girl Ai) and found that the scanner couldn't resolve the kanji properly, because of the poor print, and that it took far longer to correct the errors in the kana than it would have taken to type them. Comment

Is it worth trying an OCR on a page from an anime magazine?
On balance, yes, if you have the budget. In a few hours a person with elementary knowledge of Japanese should get a result that would take days for them to obtain by hand.

Is it worth using a machine translator on a manga after entering the text by hand?

Somebody try it and let us know.

Can Web pages be entered and translated?

We tried using Typhoon and found that it worked well enough on corporate Web sites to show that we were looking at employees' terms and conditions (for instance). It also worked on an Evangelion anime site but not quite so well.

Do I need a Japanese operating system?
Hopefully, no. Software is available that runs on US/English Windows. Installing Japanese Windows or dual boot Japanese/US Windows is for propeller heads only.

Are there any budget or shareware packages?
Sorry I can't recommend anything from personal experience. There are a few things from 99 dollars US upwards. I tried the Word Translator beta but couldn't get it to work properly.

Useful URLs

Shoudouka Launchpad (displays Japanese text)
Japanese software
EJ Bilingual
Viewing Japanese
OmniPage
Japanese Software Digest
eTypist
Word Translator
Web translator Translates Japanese WWW pages online.

Original Messages

X-Sender: dhauck@nala.qualcomm.com
Date: Mon, 06 Oct 1997 10:58:11
To: geoffc@shoujo.powernet.co.uk (Geoff Cowie)
From: Dan Hauck
Subject: Re: Electronic translation of Manga into English
X-UIDL: a8937e04f75c8f31190eb3b0befdf304
At 07:00 PM 10/4/97 +0100, you wrote:
>I thought this posting might be of interest to Shoujo ML members. If you
>have any hands-on experience, please let me know!
All I can say is that its really hard. I tried to write a translator once
using UNIX's lexical (.lex) libraries and it's just too unpredictable and
there's too much there. So much of its just context. There are shorter
words that can mean just about anything, and this is where it goes bonkers.
>
>I have, together with a friend, started to evaluate programs that will
>a) scan Japanese text and perform OCR
It's much harder than english / ASCII chars.
>b) machine translate Japanese electronic text into English.
I have and probably always will use JDIC (you've probably found this). It's
a huge dictionary, if you type in something romanized it'll give you just
about every possible definition of that word (in english).
>The potential of such programs for translating manga and Japanese anime/
>manga related magazines, e.g Animage, is obvious.
Maybe...
>
>It seems there are several programs available that will machine translate
>Japanese text, at prices ranging from shareware through to many hundreds of
>dollars.
I haven't looked in about a year. Is there anything that actually works out
there? At least for manga. There's differences between manga and ordinary
"japanesse"..., ie. the kind you learn from reading a book or taking a class.
>This still leaves the problem of input. It is possible to type in Japanese
Have you really found something that will go from romanized text to an
okay english translation? I can romanize the stuff in my sleep, that's how
my electronic dictionary takes its input. It's not very hard, just
memorization. Since manga always have furigana (hiragana next to any kanji),
at least all the manga I read does.
>using either the interface of one of the afore-mentioned programs, or a
>stand-alone wordprocessor such as JWP. This is easier than translating it,
>but...
>We know of one program that will scan Japanese text, perform OCR and output
>an electronic Japanese text. This is KanjiScan from Neocor (www.neocor.com)
Japanesse text in manga and other books is hard because there's no concept
of a space " ". It makes word recognition harder.
>Test report:
>We had a quick look at the shareware machine translator (called Kanji-Word,
>if I remember rightly) and at Neocor's TYPHOON and KANJI-SCAN.
>Frankly, after a hard business day I couldn't make any sense of the
>share-ware one, which is in fact an add-on to programs such as Microsoft
>WORD etc.
>The Typhoon demo seemed to work, though we couldn't figure how to feed a
>Japanese WWW page into it. It looks like some skill in text preparation may
>be required for the most intelligible result.
>The Kanji-scan certainly seemed to work, though again the completed
>electronic text seemed to need corrections. To have something that
>recognises Kanji that quickly is pretty amazing.
>Needless to say, you can transfer text from Kanji-scan to Typhoon at the
>click of a mouse.
>(BTW, once you have Kanji in electronic form it's not hard to look up the
>meaning in an electronic dictionary, like that in JWP for instance)
>
>If you are at all interested in this field, please E-mail me with your own
>experiences, and I will reply with a better bibliography of programs, and
>more test reports.
Please. I tried a year or so ago to find anything and all I found was JDIC,
which is useful but its only a dictionary. If there's anything out there
that can do romanized japanesse text (I can do this, no problem) to some
sembelence of english that you could tell me about I'd be very grateful.
Also I might be able to translate the rest of hime-chan.
---
Dan


X-Originating-IP: [165.196.197.16]
From: "Baka-sama"
To: geoffc@shoujo.powernet.co.uk
Subject: Greetings
Date: Mon, 06 Oct 1997 17:21:08 PDT
X-UIDL: 74006ad0d31b118893ce0ff27e477239
Greetings, I am definitly interested. I've been doing some computer
translation work, but it's kind of tedious transcribing and looking up
all those characters. Using a scaner would be great.
I've mostly been translating whatever I can get my hands on. I'm
currently working, very slowly, on my friend's copy of Sailor V v.1.
I'm a fan of Fushigi Yuugi, Maison Ikkoku, VGAi, KOR, etc.
Please get back to me. Thanks.
Steve "Baka-sama" McIntosh
baka_sama@hotmail.com
Baka-sama
baka_sama@hotmail.com
______________________________________________________


To: geoffc@shoujo.powernet.co.uk
Subject: Re: Computerized translation of manga into English
References: <3439506f.0@news.power.net.uk>
From: Jeffrey Rowe
Date: 09 Oct 1997 10:16:10 -0700
Lines: 37
X-UIDL: cdab7aaa1a561d23f5d051ff17ead44b
geoffc@shoujo.powernet.co.uk (Geoff Cowie) writes:
> Test report:
> We had a quick look at the shareware machine translator (called
> Kanji-Word, if I remember rightly) and at Neocor's TYPHOON and
> KANJI-SCAN.
> Frankly, after a hard business day I couldn't make any sense of the
> share-ware one, which is in fact an add-on to programs such as
> Microsoft WORD etc.
> The Typhoon demo seemed to work, though we couldn't figure how to feed
> a Japanese WWW page into it. It looks like some skill in text
> preparation may be required for the most intelligible result.
>
I too have tried Neocor's TYPHOON demo. In particular, I've tried translating
personal letters from my in-laws into English. I find, however, that the
translation of the often colloquial Japanese is nearly incomprehensible.
Typically, several alternate choices are offered during the translation
process, but selection requires knowledge of the original meaning which isn't
always obvious. The entertainment value from some of the mistranslations,
however, shouldn't be discounted :).
On the technical side though, Japanese text input for me is a simple
cut-and-paste operation from my mail reader to TYPHOON's text buffer.
Cheers,
Jeff Rowe


In article <34360c89.0@news.power.net.uk>, geoffc@shoujo.powernet.co.uk wrote:
> This still leaves the problem of input. It is possible to type in
> Japanese using either the interface of one of the afore-mentioned
> programs, or a stand-alone wordprocessor such as JWP. This is easier
> than translating it, but...
Once you get up to a couple hundred kanji, you can input the "easy" stuff
by its reading using an input method like Kotoeri from the Mac JLK.
Assuming you can already touch-type, that is. :) Leave some particular
character in the place of kanji you don't know the reading for, then go
all over all of them afterward, using whatever lookup method is easiest
first. And cheating (such as typing the wrong reading or a jukugo you
know and then deleting the excess) IS allowed.
Manga should be a LOT easier typing-wise, because there is so much less of
it per page. And a lot of it has furigana. Of course that DOES tend to
leave the "double-reading" (a planet name over "chikyuu" kanji, etc.)
problem open... :) (And it's kinda hard to find a double-byte enabled WP
that supports arbitrary furigana unless it's made for Japanese text!)
> We know of one program that will scan Japanese text, perform OCR and
> output an electronic Japanese text. This is KanjiScan from Neocor
> (www.neocor.com)
Mangajin had a review of an OCR program a couple of months back. Seems
these things, while great at Kanji, have some real problems recognizing
hiragana!
If an OCR program isn't accurate enough, you might as well not bother,
because the time to proofread could be more than the time to type it.
Except in the case of kanji, it's knowing the reading you need to type it
that is the troublesome part.


geoffc@shoujo.powernet.co.uk (Geoff Cowie) writes:
>I have, together with a friend, started to evaluate programs that
>will
>a) scan Japanese text and perform OCR
>b) machine translate Japanese electronic text into English.
>The potential of such programs for translating manga and Japanese
>anime/ manga related magazines, e.g Animage, is obvious.
Hmmm. Potential, yes. Problems, many. MT is hard enough, but attacking
the highly colloquial mangaese will be a fine challenge.
>It seems there are several programs available that will machine
>translate Japanese text, at prices ranging from shareware through to
>many hundreds of dollars.
The best affordable one I have encountered is Neocor's Tsunami.
It is several hundred dollars.
>This still leaves the problem of input. It is possible to type in
>Japanese using either the interface of one of the afore-mentioned
>programs, or a stand-alone wordprocessor such as JWP. This is easier
>than translating it, but...
It certainly is possible. Tsunami has a crudish WP builtin. JWP, NJSTAR
or any Japanese WP would do.
>We know of one program that will scan Japanese text, perform OCR and
>output an electronic Japanese text. This is KanjiScan from Neocor
>(www.neocor.com)
>Test report:
>We had a quick look at the shareware machine translator (called
>Kanji-Word, if I remember rightly) and at Neocor's TYPHOON and
>KANJI-SCAN.
>Frankly, after a hard business day I couldn't make any sense of the
>share-ware one, which is in fact an add-on to programs such as
>Microsoft WORD etc.
As I understand it, Kanji-Word is a word/phrase glosser, not an MT
system. There are other such systems. Really they are translation aids.
>The Typhoon demo seemed to work, though we couldn't figure how to feed
>a Japanese WWW page into it. It looks like some skill in text
>preparation may be required for the most intelligible result.
Save the page to a text file. Open up a blank Tsunami document (they
call them projects) and import the text file.
>The Kanji-scan certainly seemed to work, though again the completed
>electronic test seemed to need corrections. To have something that
>recognises Kanji that quickly is pretty amazing.
>Needless to say, you can transfer text from Kanji-scan to Typhoon at
>the click of a mouse.
>(BTW, once you have Kanji in electronic form it's not hard to look up
>the meaning in an electronic dictionary, like that in JWP for
>instance)
I suspect that for Manga, thius is what you'll find most useful.
>If you are at all interested in this field, please post a response, or
>E-mail me with your own experiences, and I will reply with a better
>bibliography of programs, and more test reports.
--
Jim Breen Department of Digital Systems
Email: j.breen@dgs.monash.edu.au Monash University
http://www.dgs.monash.edu.au/~jwb/ Clayton VIC 3168 Australia


Geoff,
Testing "yonde!!koko" of A.I.S. (Nagano-ken) I find it increasingly useful as far as I am able to produce original print quality.
I am mainly translating technical Japanese into German. The Japanese are very fond of text in tables.
I would scan in all the text tables and then perform a good part of the translation work simply by the find-and-replace function of my Japanese textprocessor program, because much of the text table contents is of non-grammatical nature.
The grammatic sections i.e. the sentences I will translate manually (with the OCRed text assisting me in looking up vocabulary by "drag-and-drop" in the very useful "kanjikai" dictionaries) and then erase the OCR sentences succesively.
Life would be more easier for me if my OCR would read bad quality copies as good as I can.
Wolfgang


Geoff:
Thanks for your reply. I have sent the following message to Richard, perhaps you might be interested in. It is taken from a recent NIHONGO mailing list.
>From Sabolc@compuserve.com Wed Sep 17 19:17:17 1997
Date: Sun, 14 Sep 1997 10:07:05 -0400
From: Szabolcs Varga
To: "INTERNET:NIHONGO@UTKVM1.UTK.EDU"
Subject: Japanese OCR
Greg Dabelstein wrote:
> Does anyone know of any Japanese capable OCR software
> for either Japanese or English Win 95???
There is a multitude of them here in Japan (for Japanese Windows), their
price ranging from about $100 to $1300. I can only refer to my experiences
with one program called E.typist. (I got it with my HP ScanJet 4c scanner I
bought in Japan. I also tried Autotype which promptly died on my machine.
So much about TWAIN compatibility.) Nothing fancy, does the job fairly
well. I did have some problems with it so I set out for getting something
better, but I was told that basically all the Japanese OCR programs have
exactly the same problems, namely:
1. It does make a lot of mistakes, but not where we gaijin would expect.
It recognizes the difficult kanjis (OK, not the extremely old ones but all
of the JIS 2-suijun) for it uses a dictionary and there is a very limited
number of jukugo with very difficult kanji. So rest assured that it reads
"kikai", "yuuutsu" and the like. But apparently it has a lot of problems
making the distinction between the small and big "ya", "yu" and "yo", and
especially with the voiced kana. It has to be a very neat printout to be
able to have all the "daku-on" recognized.
2. It kicks the bucket with furigana (and texts with all kinds of sizes).
Apparently it messes up something in its "genkoo yoshi"-like mind. I hope
there are already better ones.
I personally think that yes, it is faster to scan in and OCR a Japanese
text than to type it in, but it is significantly slower than to do the
same with English. To start with, the OCR process itself is a lot slower,
in my case (Pentium 100) about 100 char/sec, and then of course
proofreading is more difficult. To give an idea: I am still below the
Nooryoku Shiken Level 1 and I tried to scan in a few pages from "Chijin no
ai" by Tanizaki Jun'ichiro, from an ugly second-hand paperback edition.
There are about 5 mistook characters in every hundred so -- for me --
proofreading one page takes about 20-30 minutes. I had a lot better
results with better quality texts but still I managed a few times to make
a fool of myself when making a presentation and the material I prepared
(in haste, with OCR) included some tiny mistakes (like one dot missing,
but a completely stupid word). Proofreading seems to be an ugly job.
HTH.
Szabolcs Varga
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Paul Naitoh 12-Oct-1997, @ 12:50:18 PDT
Internet e-mail pnaitoh@electriciti.com
Using Windows NavCIS PRO 1.77
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Date: Mon, 3 Nov 1997 07:17:10 -0500
From: Szabolcs Varga
Subject: Re: OCR & Translation Software.
Sender: Szabolcs Varga
To: Geoff Cowie

Greetinx,

> Is it possible to get software that performs machine translation
> of Japanese texts?
> Yes. See Neocor, & Dragon Writer, etc.
I would rather recommend either Logovista E/J (and J/E), Atlas from Fujitsu
or Pensee from OKI. They run under Japanese OS only, though.

> How well does scanning and OCR work in practice?
> photocopies not good
Not exactly true. Depends mostly on what you photocopy.

> Japanese handwriting hopeless.
Not true again. There is here one piece of software (of course I don't
remember the name) which does exactly that... more than that, it is the
main product of its maker. Results are still mixed, but not hopeless.

> More about scanning?
I cannot recommend HP scanners enough. They have a feature called
AccuPage that does wonders on stained and cheap material like most of the
manga books. Honestly, wonders. I tried it with a coffee-stained page.

> How well does machine translation work?
This is my field so let me bark in.
> Results vary from the impressive (usually with carefully prepared
> text, as in software suppliers' demos)
Technical material with no ellipses usually gives results close to perfect.

Funny thing, the more technical, and the longer the sentences are, the better
result as long as the original sentence had a meaning.

> to the unintelligible.
Spoken English/Japanese is still close to impossible to translate.

But, honestly, many times the problems are with the human, not with the
machine. I met a project in Denmark where they tried to translate patents
by machine and of course, the results were hopeless.
However, when I saw the original English text, I could not believe what I
saw: one sentence, more than half a page long, without a predicate. I
seriously doubt that any human would have done much better on that. After
breaking the text into about six sentences and providing the missing
predicates, the result was practically perfect. I mean, perfect.
> Results seem to be better with regular Japanese (as in
> company notices) rather than with slangy or colloquial texts
> like manga or anime magazines.
I know NO translation software that would translate "Hirumeshi kui ni ikoo
ze" correctly, with all the emphasis and roughness included. They are not
for that...
> So is it worth trying an OCR on a manga?
It works with good quality manga -- like Buichi Terasawa's Gokuh and Cobra
that I have.

> Is it worth using a machine translator on a manga after entering
> the text by hand?
Definitely NOT! Machine translation SW cannot handle most of the features
of the spoken language, most of all ELLIPSIS. If the grammar parser fails in
the translating process, do not expect anything worthy from the above
levels of parsing...

> Can Web pages be entered and translated?
The other way (E-J) seems to be a lot more flourishing. There are SW here
especially for that, like a special Internet version of Pensee.

> Do I need a Japanese operating system?
> Hopefully, no. Software is available that runs on US/English
> Windows. Installing Japanese Windows or dual boot Japanese/US
> Windows is for propeller heads only.
I don't want to hurt anybody but the software that runs on JP only are a
lot better quality. I mean quality. Both OCR and MT. They start at $250 though.

Live long and prosper.

Szabolcs

Compiled by G.Cowie, 25 Oct 1997.Feedback