235x Filetype PDF File size 0.28 MB Source: aclanthology.org
CzechGrammarErrorCorrectionwithaLargeandDiverseCorpus
´ ´
JakubNaplava MilanStraka JanaStrakova Alexandr Rosen
Charles University, Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics, Czech Republic
{naplava,straka,strakova}@ufal.mff.cuni.cz
Charles University, Faculty of Arts
Institute of Theoretical and Computational Linguistics, Czech Republic
alexandr.rosen@ff.cuni.cz
Abstract ´
2021; Cotet et al., 2020; Naplava and Straka,
Weintroduce a large and diverse Czech cor- 2019), the lack of adequate data is even more
pusannotatedforgrammaticalerrorcorrection acute in languages other than English. We aim to
(GEC) with the aim to contribute to the still address both the issue of scarcity of non-English
scarce data resources in this domain for lan- data and the ubiquitous need for broad domain
guages other than English. The Grammar coverage by presenting a new, large and diverse
Error Correction Corpus for Czech (GECCC) Czechcorpus, expertly annotated for GEC.
offersavarietyoffourdomains,coveringerror Grammar Error Correction Corpus for Czech
distributions ranging from high error density (GECCC) includes texts from multiple domains
essays written by non-native speakers, to web- in a total of 83058 sentences, being, to our knowl-
site texts, whereerrorsareexpectedtobemuch edge,thelargestnon-EnglishGECcorpus,aswell
lesscommon.WecompareseveralCzechGEC as being one of the largest GEC corpora overall.
systems, including several Transformer-based
ones, setting a strong baseline to future re- In order to represent a diversity of writing
search. Finally, we meta-evaluate common styles and origins, besides essays of both native
GECmetricsagainsthumanjudgmentsonour and non-native speakers from Czech learner cor-
data. We make the new Czech GEC corpus pora,wealsoscrapedwebsitetextstocomplement
publicly available under the CC BY-SA 4.0 li- the learner domain with supposedly lower error
cense at http://hdl.handle.net/11234 density texts, encompassing a representation of
/1-4639. the following four domains:
1 Introduction • Native Formal – essays written by native
Representative data both in terms of size and studentsofelementaryandsecondaryschools
domaincoveragearevitalforNLPsystemsdevel- • Native Web Informal – informal website
opment. However, in the field of grammar error discussions
correction (GEC), most GEC corpora are limited • Romani – essays written by children and
to corrections of mistakes made by foreign or teenagers of the Romani ethnic minority
second language learners even in the case of En-
glish (Tajiri et al., 2012; Dahlmeier et al., 2013; • Second Learners – essays written by non-
Yannakoudakisetal.,2011,2018;Ngetal.,2014; native learners
Napolesetal.,2017).Atthesametime,asrecently
pointed out by Flachs et al. (2020), learner cor- Using the presented data, we compare several
pora are only a part of the full spectrum of GEC state-of-the-art Czech GEC systems, including
applications. To alleviate the skewed perspective, someTransformer-based.
the authors released a corpus of website texts. Finally, we conduct a meta-evaluation of GEC
Despite recent efforts aimed to mitigate the metrics against human judgments to select the
notoriousshortageofnationalGEC-annotatedcor- mostappropriatemetricforevaluatingcorrections
pora (Boyd, 2018; Rozovskaya and Roth, 2019; on the new dataset. The analysis is performed
Davidson et al., 2020; Syvokon and Nahorna, across domains, in line with Napoles et al. (2019).
452
Transactions of the Association for Computational Linguistics, vol. 10, pp. 452–467, 2022. https://doi.org/10.1162/tacl a 00470
Action Editor: Alice Oh. Submission batch: 6/2021; Revision batch: 11/2021; Published 4/2022.
c
2022AssociationforComputational Linguistics. Distributed under a CC-BY 4.0 license.
Language Corpus Sentences Err. r. Domain # Refs.
Lang-8 1147451 14.1% SL 1
NUCLE 57151 6.6% SL 1
English FCE 33236 11.5% SL 1
W&I+LOCNESS 43169 11.8% SL, native students 5
CoNLL-2014test 1312 8.2% SL 2,10,8
JFLEG 1511 — SL 4
GMEG 6000 — web, formal articles, SL 4
AESW over 1M — scientific writing 1
CWEB 13574 ∼2% web 2
Czech AKCES-GEC 47371 21.4% SLessays, Romani ethnolect of Czech 2
German Falko-MERLIN 24077 16.8% SLessays 1
Russian RULEC-GEC 12480 6.4% SL, heritage speakers 1
Spanish COWS-L2H 12336 — SL, heritage speakers 2
Ukrainian UA-GEC 20715 7.1% natives/SL, translations and personal texts 2
Romanian RONACC 10119 — native speakers transcriptions 1
Table 1: Comparison of GEC corpora in size, token error rate, domain, and number of reference
annotations in the test portion. SL = second language learners.
Our contributions include (i) a large and di- plemented by the LOCNESS corpus (Granger,
verse Czech GEC corpus, covering learner cor- 1998), a collection of essays written by native
pora and website texts, with unified and, in some English students.
domains, completely new GEC annotations, (ii) The GEC error annotations for the learner
a comparison of Czech GEC systems, and (iii) corpora above were distributed with the BEA-
a meta-evaluation of common GEC metrics 2019 Shared Task on Grammatical Error Correc-
against human judgment on the released corpus. tion (Bryant et al., 2019).
TheCoNLL-2014sharedtasktestset(Ngetal.,
2 Related Work 2014) is often used for GEC systems evaluation.
2.1 GrammarErrorCorrectionCorpora This small corpus consists of 50 essays written
by 25 South-East Asian undergraduates.
Until recently, attention has been focused mostly JFLEG (Napoles et al., 2017) is another fre-
on English, while GEC data resources for other quently used GEC corpus with fluency edits in
languages were in short supply. Here we list a addition to usual grammatical edits.
few examples of English GEC corpora, collected To broaden the restricted variety of domains,
mostly within an English-as-a-second-language focused primarily on learner essays, a CWEB col-
(ESL) paradigm. For a comparison of their rele- lection (Flachs et al., 2020) of website texts was
vant statistics see Table 1. recently released, aiming at contributing lower
Lang-8CorpusofLearnerEnglish(Tajirietal., error density data.
2012)isacorpusofEnglishlanguagelearnertexts AESW (Daudaravicius et al., 2016) is a large
from the Lang-8 social networking system. corpus of scientific writing (over 1M sentences),
NUCLE (Dahlmeier et al., 2013) consists of edited by professional editors.
essays written by undergraduate students of the Finally, Napoles et al. (2019) recently released
National University of Singapore. GMEG,acorpusfortheevaluationofGECmetrics
FCE (Yannakoudakis et al., 2011) includes across domains.
short essays written by non-native learners for the Grammatical error correction corpora for lan-
Cambridge ESOLFirst Certificate in English. guagesotherthanEnglisharelesscommonand—
W&I+LOCNESSisaunionoftwodatasets,the if available—usually limited in size and domain:
W&I(Write & Improve) dataset (Yannakoudakis German Falko-MERLIN (Boyd, 2018), Russian
et al., 2018) of non-native learners essays, com- RULEC-GEC (Rozovskaya and Roth, 2019),
453
Spanish COWS-L2H (Davidson et al., 2020), CzeSL, which differ mainly to what extent and
3
Ukrainian UA-GEC (SyvokonandNahorna,2021), howthetexts are annotated (Rosen et al., 2020).
and Romanian RONACC (Cotet et al., 2020). More recently, hand-written essays have been
To better account for multiple correction op- transcribed and annotated in TEITOK (Janssen,
tions, datasets often contain several reference sen- 2016),4 a tool combining a number of cor-
tences for each original noisy sentence in the test pus compilation, annotation and exploitation
set, proposed by multiple annotators. As we can functionalities.
seeinTable1,thenumberofannotationstypically LearnerCzechisalsorepresentedinMERLIN,a
ranges between 1 and 5 with an exception of the multilingual (German, Italian, and Czech) corpus
CoNLL14testset,which—ontopoftheofficial2 built in 2012–2014 from texts submitted as a part
reference corrections—later received 10 annota- of tests for language proficiency levels (Boyd
tions from Bryant and Ng (2015) and 8 alternative et al., 2014).5
annotations from Sakaguchi et al. (2016). ´
Finally, AKCES-GEC (Naplava and Straka,
2.2 CzechLearnerCorpora 2019) is a GEC corpus for Czech created from
the subset of the above mentioned AKCES re-
By the early 2010s, Czech was one of a few ˇ
sources (Sebesta, 2010): the CzeSL-man corpus
languages other than English to boast a series (non-native Czech learners with manual annota-
of learner corpora, compiled under the umbrella tion) and a part of the ROMi corpus (speakers of
projectAKCES,evokingtheconceptofacquisition the Romani ethnolect).
ˇ
corpora (Sebesta, 2010). Compared to the AKCES-GEC, the new
The native section includes transcripts of GECCCcorpuscontainsmuchmoredata(47371
hand-written essays (SKRIPT 2012) and class- sentences vs. 83058 sentences, respectively), by
room conversation (SCHOLA 2010) from ele- extending data in the existing domains and also
mentary and secondary schools. Both have their addingtwonewdomains:essayswrittenbynative
counterparts documenting the Roma ethnolect of learners and website texts, making it the largest
1 essays (ROMi 2013) and recordings and
Czech: non-English GEC corpus and one of the largest
2
transcripts of dialogues (ROMi 1.0). GECcorporaoverall.
The non-native section goes by the name of
CzeSL, the acronym of Czech as the Second 3 Annotation
Language. CzeSL consists of transcripts of short 3.1 DataSelection
hand-written essays collected from non-native
learners with various levels of proficiency and na- We draw the original uncorrected data from
tive languages, mostly students attending Czech the following Czech learner corpora or Czech
language courses before or during their studies at websites:
a Czech university. There are several releases of
1The Romani ethnolect of Czech is the result of contact • NativeFormal–essayswrittenbynativestu-
with Romani as the linguistic substrate. To a lesser (and dents of elementary and secondary schools
weakening) extent the ethnolect shows some influence of from the SKRIPT 2012 learner corpus,
SlovakorevenHungarian,becausemostofitsspeakershave compiled in the AKCES project
roots in Slovakia. The ethnolect can exhibit various specifics • Native Web Informal – newly annotated
across all linguistic levels. However, nearly all of them
are complementary with their colloquial or standard Czech informal website discussions from Czech
counterparts. A short written text, devoid of phonological Facebook Dataset (Habernal et al., 2013a,b)
properties, may be hard to distinguish from texts written by and Czech news site novinky.cz.
learners without the Romani backround. The only striking
exceptionaremisspellingsincontextswherethelatterbenefit • Romani – essays written by children and
frommoreexposuretowrittenCzech.Thetypicalexampleis teenagersoftheRomaniethnicminorityfrom
the omission of word boundaries within phonological words, the ROMi corpus of the AKCES project and
e.g., betweenacliticanditshost.Inotherrespects,thepattern
of error distribution in texts produced by ethnolect speakers the ROMisectionoftheAKCES-GECcorpus
ˇ ´
is closer to native rather than foreign learners (Borkovcova,
2007, 2017). 3ForalistofCzeSLcorporawiththeirsizesandannotation
2AmorerecentreleaseSKRIPT2015includesabalanced details see http://utkl.ff.cuni.cz/learncorp/.
mixofessays from SKRIPT 2012 and ROMi 2013. For more 4http://www.teitok.org.
details and links see http://utkl.ff.cuni.cz/akces/. 5https://www.merlin-platform.eu.
454
Dataset Documents Selected To achieve more fine-grained balancing of the
AKCES-GEC-test 188 188 splits, we used additional metadata where avail-
AKCES-GEC-dev 195 195 able: users proficiency levels and origin language
MERLIN 441 385 from MERLIN andtheagegroupfromAKCES.
Novinky.cz — 2695 3.2 Preprocessing
Facebook 10000 3850 De/tokenization is an important part of data pre-
SKRIPT2012 394 167 processing in grammar error correction. Some
ROMi 1529 218 2 format (Dahlmeier and
formats, such as the M
Table 2: Data resources for the new Czech GEC Ng, 2012), require tokenized formats to track and
corpus. The second column (Selected) shows the evaluate correction edits. On the other hand, deto-
size of the selected subset from all available kenizedtextinitsnaturalformisrequiredforother
documents (first column, Documents). applications. We therefore release our corpus in
2 format and deto-
two formats: a tokenized M
kenizedformatalignedatsentence,paragraph,and
documentlevel. As part of our data is drawn from
earlier, tokenized GEC corpora AKCES-GEC
• Second Learners – essays written by non- and MERLIN, this data had to be detokenized. A
native learners, from the Foreigners section 6
slightly modified Moses detokenizer is attached
oftheAKCES-GECcorpus,andtheMERLIN 2
to the corpus. To tokenize the data for the M
corpus format, we use the UDPipe tokenizer (Straka
et al., 2016).
Since we draw our data from several Czech cor-
pora originally created in different tools with 3.3 Annotation
different annotation schemes and instructions, we The test and development sets in all domains
re-annotated the errors in a unified manner for the were annotated from scratch by five in-house ex-
entire development and test set and partially also 7
for the training set. pert annotators, including re-annotations of the
development and test data of the earlier GEC cor-
The data split was carefully designed to main- pora to achieve a unified annotation style. All the
tain representativeness, coverage and backwards test sentences were annotated by two annotators;
compatibility. Specifically, (i) test and develop- one half of the development sentences received
ment data contain roughly the same amount of two annotations and the second half one annota-
annotated data from all domains, (ii) original tion. The annotation process took about 350 hours
AKCES-GEC dataset splits remain unchanged, in total.
and (iii) additional available detailed annotations Theannotationinstructions were unified across
such as user proficiency level in MERLIN were all domains: The corrected text must not contain
leveraged to support the split balance. Overall, any grammatical or spelling errors and should
the main objective was to achieve a representative sound fluent. Fluency edits are allowed if the
cover over development and testing data. Table 2 original is incoherent. The entire document was
presents the sizes of data resources in the num- given as a context for the annotation. Annotators
ber of documents. The first column (Documents) were instructed to remove documents that were
shows the number of all available documents too incomprehensible or those containing private
collected in an initial scan. The second column information.
(Selected) is a selected subset from the available To keep the annotation process simple for the
documents, due to budgetary constraints and to annotators, the sentences were annotated (cor-
achieve a representative sample over all domains rected) in a text editor and postprocessed auto-
anddataportions.Therelativelyhighernumberof matically to retrieve and categorize the GEC edits
documents selected for the Native Web Informal
domain is due to its substantially shorter texts, 6https://github.com/moses-smt/mosesdecoder
yielding fewer sentences; also, we needed to pop- /blob/master/scripts/tokenizer/detokenizer.perl.
7Our annotators are senior undergraduate students of
ulate this part of the corpus as a completely new humanities,regularlyemployedforvariousannotationefforts
domainwithnopreviously annotated data. at our institute.
455
no reviews yet
Please Login to review.