308x Filetype PDF File size 0.25 MB Source: ccc.inaoep.mx
Knowledge-Based Systems 17 (2004) 219–227
www.elsevier.com/locate/knosys
Amultilingual text mining approach to web cross-lingual text retrieval
*
Rowena Chau , Chung-Hsing Yeh
School of Business Systems, Faculty of Information Technology, Monash University, Clayton, Vic. 3800, Australia
Received 26 August 2003; accepted 6 April 2004
Available online 28 May 2004
Abstract
To enable concept-based cross-lingual text retrieval (CLTR) using multilingual text mining, our approach will first discover the
multilingual concept–termrelationshipsfromlinguisticallydiversetextualdatarelevanttoadomain.Second,themultilingualconcept–term
relationships, in turn, are used to discover the conceptual content of the multilingual text, which is either a document containing potentially
relevant information or a query expressing an information need. When language-independent concepts hidden beneath both document and
query are revealed, concept-based matching is made possible. Hence, concept-based CLTR is facilitated. This approach is employed for
developing a multi-agent system to facilitate concept-based CLTR on the Web.
q2004Elsevier B.V. All rights reserved.
Keywords: Multilingual text mining; Cross-lingual text retrieval; Agent; Fuzzy clustering; Fuzzy classification
1. Introduction Documents and queries about the same concept do not
necessarily contain matching sets of translation equivalents
TheexponentialgrowthoftheWorldWideWeboverthe ofeachother.Conceptualrelevancebetweendocumentsand
globe is the most influential factor that contributes to the queries is not to be determined in an explicit way. To realize
increasing awareness of cross-lingual text retrieval (CLTR) concept-based CLTR, the development of a conceptual
in recent years. Relevant information exists in different interlingua to support lexical transfer across multiple
languages. A user may want to find documents in languages languages is required. To encode a conceptual interlingua,
other than the one the query is formulated in. Among terms from multiple languages describing the same concept
various CLTR techniques developed recently, query should be mapped to a language-independent scheme. In
translation is the most extensively studied one. Such this way, it is possible to match a term to its corresponding
CLTR approaches are developed mainly to facilitate term- counterparts in all other languages and to achieve concept-
based lexical transfer between a single pair of source and based CLTR.
target languages. However, a bilingual lexical transfer is not Multilingual thesaurus (e.g. EuroWordNet) encoding
sufficient for fully supporting the user’s need of multilingual conceptual relationship among multilingual terms is such a
information seeking. conceptual interlingua that has been used to achieve this
Within a multilingual information community, users goal [7]. However, the manual construction of multilingual
often rely on CLTR to explore global knowledge relevant to thesauri is very labor expensive and their coverage is not
a certain topic/area. Instead of looking for some specific domain specific. An automatic and possibly unsupervised
documents that can be characterized by a few translation approach for generating such linguistic knowledge for
equivalents of the query terms, users are often interested in a CLTR by discovering structures of lexical relationships
broader view of a particular domain. They are thinking in among multilingual terms from analyzing text of relevant
terms of concepts and expecting to receive all relevant domain is highly desirable.
documentsexisting in any language. In such cases, concept-
based CLTR capable of identifying multilingual documents To provide better support to CLTR, a knowledge
about the concept of a query is necessary. discovery technology, known as text mining, looks
promising in discovering such kind of in-depth multilingual
* Corresponding author. linguistic knowledge. Typically, text mining concerns the
E-mail address: rowena.chau@infotech.monash.edu.au (R. Chau). discovery and extraction of hidden relationships, such as
0950-7051/$ - see front matter q 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.knosys.2004.04.001
220 R. Chau, C.-H. Yeh / Knowledge-Based Systems 17 (2004) 219–227
conceptual associations, among textual items, including written in multiple languages. Corpus-based query trans-
terms and documents. lation is based on the idea that terms are represented as
To enable concept-based CLTR using multilingual text points in a multi-dimensional semantic space, and terms (in
mining, our approach will first discover the multilingual different languages) mapped to the same set of points in that
concept–term relationships from linguistically diverse semantic space are used to describe the same concept.
textual data relevant to a domain. Second, the multilingual Geometric relationships between terms within the semantic
concept–termrelationships, in turn, are used to discover the space are automatically extracted by analyzing co-occur-
conceptual content of the multilingual text, which can be rence statistics of terms across a parallel corpus. By
either a document containing potentially relevant infor- substituting every query term with its geometrically close
mation or a query expressing an information need. When translations in the semantic space, query translation is then
language-independent concepts hidden beneath both docu- facilitated [6,12]. The corpus-based approach is most
ments and queries are revealed, concept-based matching is effective for CLTR when the document collection is
made possible, thus facilitating concept-based CLTR. This domain-specific. In this paper, a corpus-based approach to
approach is employed for developing a multi-agent system CLTRthatapplies multilingual text mining using a parallel
to facilitate concept-based CLTR on the Web. corpus is proposed.
2. Current CLTR techniques 3. A multilingual text mining approach
to cross-lingual text retrieval
Given a query expressed in one language, the objective
of CLTR is to search for relevant documents in other Our work for enabling CLTR with multilingual text
languages. To break the language barrier, either document mining is focused on exploiting the knowledge discovery
or query translation is required. As query translation is less capability of text mining over multilingual text. This is a
resource demanding than document translation, it has logical approach due to the complementary nature of these
proven to be a more feasible approach to CLTR. There twoareas. Both CLTR andmultilingual text mining analyze
are three major approaches to query translation: (a) machine multilingual textual data employing techniques from
translation, (b) knowledge-based methods using machines- information retrieval, natural language processing and
readable dictionary [2,8], and (c) corpus-based methods machine learning. In terms of the functions they perform,
using parallel corpus [14]. CLTR facilitates multilingual information access while
Despite translating query using machine translation multilingual text mining enables knowledge discovery from
being straightforward, it is argued that machine translation multilingual texts. The objective of CLTR is to locate
and CLTR have divergent concerns [13]. Machine trans- relevant documents from a multilingual document collec-
lation aiming at syntactically accurate translation is tion in response to a query represented by a set of terms,
redundant to CLTR. Since query is short, grammatically while the objective of multilingual text mining is to reveal
invalid and is just formulated with a few terms, it offers little concepts and their relationships embedded within a collec-
context for the machine translation system to translate tion of multilingual texts. To determine the conceptual
accurately. Besides, machine translation always replaces the relevance between documents and a query written in
original query term with only one of its many possible different languages, CLTR requires understanding of their
synonymous translations in the target language. This semantics. Multilingual text mining has the potential to
prevents a query expansion by which all synonymous complement CLTR by discovering intrinsic meanings of
terms are considered to enhance recall. multilingual texts. Our approach to concept-based CLTR
Query can easily be translated by replacing every query with multilingual text mining is depicted in Fig. 1.
termwithasetofallitspossibletranslations as encoded in a Within an integrated framework, multilingual text
machine-readable dictionary. However, this approach is mining yields knowledge that supports CLTR. First, the
ineffective mainly due to the translation ambiguity of multilingual concept–term relationships, which are necess-
polysemous terms (i.e. terms with multiple meanings). A ary for a CLTR system to associate documents and query
polysemous term may have several alternative translations across languages, are mined from a parallel corpus. This is
carrying different senses (meanings) in any foreign achieved by a fuzzy multilingual term clustering algorithm.
language. Translating a query by including every possible By grouping conceptually related multilingual terms into
translation of every query term can greatly increase the set clusters, the multilingual concept–term relationships are
of possible meanings in the translated query, thus revealed. Second, using the conceptual relationship among
contributing to poor precision. Moreover, inadequate multilingual terms discovered in the previous step as the
coverage of specific terminology and phrases is also a linguistic knowledge base, conceptual content exhibiting
serious shortcoming of such machine-readable dictionary. ideas hidden beneath the multilingual texts is also mined.
Analternative to machine-readable dictionary is using a Thisisfacilitated by a fuzzy multilingual text categorization
parallel corpus. A parallel corpus is a set of identical text algorithm. As a result, both documents and query in
R. Chau, C.-H. Yeh / Knowledge-Based Systems 17 (2004) 219–227 221
Fig. 1. A multilingual text mining approach to concept-based CLTR.
different languages can then be encoded with language- a concept-oriented frame of lexical reference. A cluster of
independent concepts, instead of language-specific terms. conceptually related multilingual terms helps enormously in
As such, concept-based matching is made possible and focusing solely on relevant lexical alternatives by establish-
concept-based CLTR is facilitated. ing a virtual semantic domain.
Clustering is an unsupervised method for automatic class
3.1. Mining the conceptual relationship formation. It offers the advantage that a priori knowledge of
of multilingual terms classes is not required. Typically, clustering algorithms (e.g.
k-means) [9] aim to maximize inter-clustering distance and
Successful application of text mining in supporting minimizeintra-clusterdistancesofsomesimilaritymeasure.
monolingual information retrieval has been well reported In the context of mining conceptual relationships among
[1]. To facilitate CLTR, our first multilingual text mining multilingual terms, clustering looks at building up clusters
task is to discover the conceptual relationships among of semantically related multilingual terms.
multilingual terms. Towards this end, a fuzzy multilingual As concepts tend to overlap in terms of meaning, crisp
term clustering algorithm is developed using a fuzzy clustering algorithms like k-means that generate partitions
clustering technique, known as fuzzy c-means [3]. Its such that each term is assigned to exactly one cluster is
purpose is to generate a partition of a set of multilingual inadequate for representing the real textual data structure. In
terms for revealing their concept–term relationships with this aspect, fuzzy clustering methods that allow objects
additional concept membership degrees. Application of the (terms)tobeclassifiedtomorethanoneclusterwithdifferent
multilingual term clustering algorithm thus results in a membership values are more appropriate. With the appli-
collection of concepts represented by clusters of concep- cation of fuzzy c-means, the resulting fuzzy multilingual
tually related multilingual terms. This collection of clusters, term clusters, which are overlapping, will provide a more
analogous to a multilingual thesaurus, represents a com- realistic representation of the multilingual semantic space.
pression and reflection of the usage of multiple languages. The fuzzy c-means algorithm aims at minimizing the
P P
objective function JðX;U;vÞ¼ c n m 2
Its importance in concept-based CLTR is in providing i¼1 k¼1 ðmikÞ d ðvi;xkÞ
222 R. Chau, C.-H. Yeh / Knowledge-Based Systems 17 (2004) 219–227
P
under the constraints n m .0foralli[{1;…;c}and and k ¼ 1;…;K randomly such that
k¼1 ik
Pc m ¼1foralli[{1;…;c}whereX¼{x ;…;x }#Rp
i¼1 ik 1 n c
is the set of objects; c the number of fuzzy clusters; m [ X
ik mik ¼ 1 ;k ¼ 1;…;K ð1Þ
½0;1 the membership degree of object xk to cluster i; vi the i¼1
prototype (cluster center) of cluster i, and dðv ;x Þ the
i k and
Euclidean distance between prototype vi and object xk:
Theparameter m . 1is the fuzziness index. For m ! 1; the mik [ ½0;1 ;i ¼ 1;…c; ;k ¼ 1;…k ð2Þ
clusters tend to be crisp, i.e. either m !1orm !0;for
ik ik
m!1;m !1=c:
ik
2. Calculate the concept prototype (cluster centers) v ; using
On the basis of the objective function optimization, i
these membership values m :
fuzzy c-means is most suitable for finding optimal ik
groupings of objects that best represent the structure of XK ðmikÞmxk
the data set. By minimizing the sum of within-group v ¼ k¼1 ; ;i ¼ 1;…;c ð3Þ
i XK m
variance, the strength of associations of objects is k¼1 ðmikÞ
maximized within clusters and minimized between
clusters. In this aspect, fuzzy c-means is particularly new
useful in text mining applications, such as term clustering, 3. Calculate the new membership values mik using these
where intrinsic conceptual structure and semantic relation- cluster centers vi :
ships among terms must be revealed in order to gain new 1
m ¼ ;
ik !
knowledge for better text categorization and retrieval. c 2=ðm21Þ
Statistical analysis of parallel corpus has been proven to X kvi2xkk ð4Þ
be an effective means of extracting useful multilingual j¼1 kvj 2 xkk
lexical knowledge for CLTR and this has been successfully
applied to the development of translation models for CLTR ;i ¼ 1;…;c; ;k ¼ 1;…;K
[12]. Text in parallel translation is increasingly available as
a result of the global explosion of the World Wide Web. new new
Toward using the World Wide Web as a source of parallel 4. If km 2mk.1; let m¼m and go to step 2.
Otherwise, stop.
text, effective techniques for automatically identifying 5. Concept labeling. As a result of clustering, every
parallel translated documents on the Web have also been multilingual term is assigned to various concepts
developed [4,15]. (clusters) with various membership values. To apply
Based on the hypothesis that semantically related these found clusters as a multilingual concept directory,
multilingual terms representing similar concepts tend to concepts can be labeled by giving meaningful tags. This
co-occur with similar inter- and intra-document frequencies can be done manually using expert knowledge or by
across a parallel corpus, fuzzy c-means can be applied to selecting the term being assigned the highest member-
sort a set of multilingual terms into clusters (concepts) such ship in each cluster for every language involved. As a
that terms belonging to any one of the clusters (concepts) result, a fuzzy partition of the multilingual term space
should be as similar as possible while terms of different acting as a multilingual linguistic knowledge base is now
clusters (concepts) are as dissimilar as possible in terms of available for mining the conceptual content of all
the concepts they represent. multilingual text.
To realize the idea of mining the multilingual concept–
term relationship using fuzzy c-means, a fuzzy multilingual 3.2. Mining the conceptual content of multilingual text
term clustering algorithm is developed. To begin with, a set
of multilingual terms, which are the objects to be clustered, Aiming at discovering the conceptual content of both
is first extracted from a parallel corpus of N parallel multilingual document and query, our second multilingual
documents. Each term is then represented as an input vector text mining task concerns the mapping of multilingual text
of N features where each of the N parallel documents is to concepts This process is considered a text categorization
regarded as an input feature with each feature value task.
representing the frequency of that term in the nth parallel Text categorization is conducted based on the cluster
document. Details of the fuzzy multilingual term clustering hypothesis [16], which states that documents with similar
algorithm is presented as follows: contents are relevant to the same concept. To accomplish
The fuzzy multilingual term clustering algorithm: the task, the crisp k-nearest neighbor algorithm [5] is among
the most widely used method [11,17]. It determines the
membership of an unclassified text d to a concept c by
1. Initialize the membership values mik of the k multilingual examining whether the k pre-classified texts, which are
termsx toeachoftheiconcepts(clusters)fori ¼ 1;…;c
k closest to d have also been classified to c.
no reviews yet
Please Login to review.