207x Filetype PDF File size 0.32 MB Source: aclanthology.org
The Transliteration from Alphabet Queries to Japanese Product Names
a a a
Rieko Tsuji , Yoshinori Nemoto , Wimvipa Luangpiensamut , Yuji
a a a b a
a Abe , Takeshi Kimura , Kanako Komiya , Koji Fujimoto , Yoshiyuki Kotani
Department of Computer and Information Science, Tokyo University of Agriculture and
Technology / 2-24-16 Nakamachi Koganei-shi Tokyo JAPAN
bTensor Consulting/ 2-10-1 Koujimachi Chiyoda-ku Tokyo JAPAN
{Riekon.m, wimvipa, kittykimura}@gmail.com,
50012646127@st.tuat.ac.jp,wizdomowl@yahoo.co.jp,
koji.fujimoto@tensor.co.jp, {kkomiya, kotani}@cc.tuat.ac.jp
Abstract However, sometimes it is not easy for foreign
buyers to find the products they want because of
There are some cases where the non- the language difference. In our case, the alphabetic
Japanese buyers are unable to find products queries that are input by non-Japanese buyers
they want through the Japanese shopping should be translated into Japanese to show product
Web sites because they require Japanese pages which they want to find.
queries. We propose to transliterate the There are many cases that non-Japanese people
inputs of the non-Japanese user, i.e., search get no or wrong result from their research queries
queries written in English alphabets, into and they are classified into three cases. The first is
Japanese Katakana to solve this problem. the case where the non-Japanese people write
In this research, the pairs of the non- Japanese product names in alphabets and we
Japanese search query which failed to get expected that this case would be solved by
the right match obtained from a Japanese transliteration. The second is the case where non-
shopping website and its transcribed word Japanese people write English product names and
given by volunteers were used for the this would be solved by translation. The final is the
training data. Since this corpus includes others, for example, the proper nouns such as the
some noise for transliteration such as the names of the animation characters etc., and the
free translation, we used two different misspellings. Among them, we expected that the
filters to filter out the query pairs that are first case is the most frequent because 53.7% of
not transliterlated in order to improve the them could be fully transliterated in the corpus.
quality of the training data. In addition, we Hence, we propose the transliteration from the
compared three methods, BIGRAM, HMM, alphabetic queries to Japanese product names cf.,
and CRF, using these data to investigate from lunchbox to “ランチボックス (translation
which is the best for the query into English: lunchbox, pronunciation in Japanese:
transliteration. The experiment revealed ranchibokkusu)”.
that the HMM was the best. Also, many researches about transliteration have
been accomplished for clean data, however, as far
as we know, there have been no research about
1 Introduction transliteration for noisy query data. Thus, we
In recent years, e-commerce is widely used investigated which method is the best for query
throughout the world and it enables people to transliteration, using the parallel data of the
purchase products from foreign countries. alphabetic queries which did not provide any
products when non-Japanese people searched (i.e.,
456
Copyright 2012 by Rieko Tsuji, Yoshinori Nemoto, Wimvipa Luangpensamut, Yuji Abe, Takeshi Kimura, Kanako Komiya
Copyright 2012 by Rieko Tsuji, Yoshinori Nemoto, Wimvipa Luangpiensamut, Yuji Abe, Takeshi Kimura, Kanako Komiya, Koji Fujimoto, and Yoshiyuki Kotani
26th Pacific Asia Conference on Language,Information and Computation pages 456–462
Koji Fujimoto, and Yoshiyuki Kotani
26 th Pacific Asia Conference on Language, Information and Computation pages 456-462
the Alphabet Queries) and the Japanese queries Thus, we employed the phonemic approach and the
which are transcribed from them (i.e., the Correct probabilistic method or the machine learning was
Queries). We refer to this parallel data as the pair used for the transliteration from phonemes to
corpus and Table1 shows the examples of it. Here, Japanese product names (i.e., the Correct Queries).
the Alphabet Queries are the keywords which were
actually used by non-Japanese user on a Japanese 3 Transformation from the Alphabet
website and the Correct Queries were transcribed Query to Phoneme
by volunteers. However, some pairs of them were
not transliterated into Japanese phonogram, i.e., We employed the phonemic approach; the
Katakana or Hiragana; they also have some free Alphabet Queries were transformed into phonemes
translations or Chinese characters. Instead of and then are transliterated. The transliteration was
manually editing the raw data, we automatically carried out as follows:
filter out those word pairs using two filters:
Chinese character filter (CF) and Chinese character 1. Transform the Alphabet Queries into
and alphabet filter (CAF). The experiments phonemes using a English-Phoneme
revealed that the HMM worked the best which dictionary (Section 3.1)
gave precision of 0.448 when the CF was used for
the looser evaluation. 2. Filter the Correct Queries to clean the
noisy data (Section 3.2)
2 Related Works 3. Calculate the translation probabilities from
phonemes to Japanese characters (Section
Many works on transliteration have been 3.3)
accomplished so far including phonemic, 4. Align the phonemes and Japanese
orthographic, rule based approaches, and characters (Section 3.4)
approaches which use machine learning. For
example, Aramaki et al. (2009) presented the 5. Transliterate the phoneme queries into
discriminative transliteration model using the CRF Japanese words using the probabilistic
with the English-to-Japanese transliteration. In method or machine learning (Section 3.5)
other language, Wang et al. (2011) worked on the
English-Korean translation. They compared four
methods: grapheme substring-based, phoneme The remainder of this section describes these five
substring-based, rule-based and mixture of them. steps. The steps from one to four were the
Jing et al. (2011) developed the English-Chinese generation phase of the training data and the step
transliteration, which consists of many-to-many five was the transliteration phase.
alignment and the CRF (conditional random fields)
using accessor variety. 3.1 Transform the Alphabet Queries
However, as far as we know, the transliteration
using noisy query data has not been accomplished CMU Pronunciation Dictionary1 (CMUdict) was
so far. Hence, we propose to transliterate the used for the transformation from the Alphabet
Alphabet Queries into the Correct Queries using Queries to phonemes. Thus, we targeted only the
the pair corpus and compared three transliteration alphabetic queries which include at least one
methods to investigate which is the best for query phoneme in it. We obtained 2833 Alphabet Queries
transliteration. after this process.
It is also possible to use the dictionary-based 3.2 Filter
approaches, however, the pair corpus includes
many new words like the title of the comics and Since the pair corpus is noisy, the training data
the names of the animation characters that are not were narrowed down and were refined using the
listed in the dictionaries. Therefore, the dictionary following two different filters:
based approach is not so powerful for
transliteration comparing with that for translation.
1http://www. speech.cs.cmu.edu/cgi-bin/cmudict
457
method BASE BIGRAM HMM CRF
system output フャブーンク ファブリック ファブリック フブック
(fabuunku) (faburikku: (faburikku: (fubukku)
the correct answer) the correct answer)
evaluation 1 3 3 2
Table 2: The system output when the input was “fabric” (Alphabet Query) and evaluation
1. Chinese character filter (CF) Correct Query translit
(translation into eration Type of
2. Chinese character and alphabet filter Alphabet Query English, (L) Characters
(CAF) (type of query) pronunciation in or of Correct
Japanese ) translat Query
These two filters were compared to adjust the ion(T)
quality and the amount of the training data. CF Doraemon ドラえもん Katakana,
filtered out the pair which has Chinese character (animation’s (Doraemon, L Hiragana
Correct Queries and CAF filtered out the pair character name) doraemon)
which has Chinese character Correct Queries and Miyazaki ジブリ
alphabetic Correct Queries. In other word, the pair (person's name) (GHIBRI, T Katakana
filtered by CFA has only Katakana and Hiragana ziburi)
Correct Query AKB48 poster AKB48 ポスター Katakana,
Table 1 lists the example of the pair corpus and (pop group’s (AKB48 poster, L Alphabet
the characteristics of the Alphabet and Correct name, poster) eikeibii48 posutaa)
Queries. Here, we focused on the character type Ufm rod Ufm ロッド Katakana,
of the Correct Queries because of the (brand name, (Ufm rod, L Alphabet
characteristics of the pair corpus. rod) uefuemu roddo,)
As shown in the table, although we want to use Tokyo adidas 東京 adidas Chinese
only the transliteration pairs as the training data, it (place name, (Tokyo adidas, L character,
is not easy to distinguish them. (The pair corpus brand name) toukyou adidasu) Alphabet
consists of only the Alphabet and Correct Queries.) Dress Tokyo 原宿 ドレス Chinese
The first problem was that some Correct Queries (general noun, (Harajuku dress, L, T character,
are written not only in Japanese phonogram, i.e., place name) Harajuku doresu) Katakana
Katakana or Hiragana, but also in ideograms, i.e., Table 1: The example of the pair corpus and the
Chinese characters that have many ways to characteristics of the Alphabet and Correct Queries
pronounce (cf. Tokyo-東京 (Tokyo,toukyou)).
Thus, we carried out the filtering by the character Here, we filtered out the pair which has
types to obtain as many transliteration pairs as alphabetic or Chinese character Correct Queries to
possible. We expected that this process would refine the pair corpus more (CAF: The shaded data
improve the quality of the training data because in with light gray and the shaded data with gray were
many cases, if the Correct Queries were in removed). However, if we filter out too many
Katakana, they were transliterated. However, we query pairs to improve the quality of the training
have to keep in mind that the Correct Queries in data, we may not be able to obtain enough training
Katakana could be free translation as shown in data for the probabilistic methods or machine
Table1 on the second line (cf. Miyazaki –ジブリ learning. Therefore, we filtered out the pair corpus
(translation into English: GHIBRI, pronunciation which has Chinese character Correct Queries (CF:
in Japanese: ziburi, meaning: a film studio name) . The shaded data with gray were removed). Namely,
we used two kinds of filters to find out which of
those is the best for query transliteration.
458
We could use 78.5% and 25.2% of the pair Figure 2 shows the result of the alignment when
corpus to calculate the translation probabilities by the Alphabet Queries was document and the
using the CF and the CAF, respectively. Correct Queries was ドキュメント (document,
3.3 Calculation of Translation Probabilities dokyumento). NULLJ and NULLP in Figure 2
represent the alignments in the horizontal and
The transliteration probabilities, from the vertical directions respectively.
phonemes of the Alphabet Queries which were
transformed in Section 3.1 to the Correct Queries [D -ド(do)]
which were filtered in Section3.2, were [AAI - NULLJ]
calculated using the filtered pair corpus. We used [K -キ(ki)]
the GIZA++2 toolkit (Och and Ney, 2003) to [Y -ュ(yu)]
calculate them. Here, we set phonemes as the
[AH0 - NULLJ]
source language and Japanese character as the [M -メ(me)]
target language. [EH0 - NULLJ]
3.4 Alignment [N -ン(n)]
[T -ト(to)]
The alignment of phonemes and Japanese
characters which is necessary before the
transliteration was carried out for each query pair. Figure 2: The result of the alignment of the
The Dijkstra algorithm was used to make phonemes of document and ド キ ュ メ ン ト
alignments. Fig.1 shows the alignment of the (document, dokyumento)
phonemes of document and its transcribed wordド
キュメント (document, dokyumento). In Fig 1, 3.5 Transliteration
the horizontal axis represents the phonemes of the The transliteration was carried out using the
Alphabet Queries and the vertical axis represents probabilistic method or machine learning. We
the Correct Queries. We used the negative compared the following three different approaches
logarithm of the translation probabilities (which were applied based on the alignments which were
are calculated in Section3.3) as costs of the obtained in Section 3.4:
alignment. Also, we set logarithm of 10-20 as the
cost when no translation probabilities were 1. BIGRAM: The Bigram Model
obtained. (cf., the horizontal direction and vertical
direction in Fig 1 are the cases). 2. HMM: The Hidden Marcov Model
3. CRF: The CRF model
3
We used NLTK for BIGRAM and the HMM and
adopted the CRF++4 toolkit for the CRF. We
trained the CRF models with the unigram, bigram,
and trigram features. The features are shown in the
following.
Unigram: s-2, s-1, s0, s1, and s2
Bigram: s-1s0 and s0s1
Trigram: s−2s−1s0, s−1s0s1, and s0s1s2
Figure 1: The alignment of the phonemes of We set parameters as f=50 and c=2. We set f=50
document and its transcribed word ドキュメント because the kinds of features were so variable.
(document, dokyumento)
3 http://www.nltk.org/
2 http://www.fjoch.com/GIZA++.html 4 http://crfpp.googlecode.com/svn/trunk/doc/index.html
459
no reviews yet
Please Login to review.