267x Filetype PDF File size 0.70 MB Source: aclanthology.lst.uni-saarland.de
Point to the Expression: Solving Algebraic Word Problems using the
Expression-Pointer Transformer Model
BugeunKim KyungSeoKi DonggeonLee GahgeneGweon
Department of Transdisciplinary Studies,
Seoul National University,
Seoul, Republic of Korea
{cd4209,kskee88,lsw5835,ggweon}@snu.ac.kr
Abstract Problem One number is eight more than
Solving algebraic word problems has recently twice another and their sum is 20.
emerged as an important natural language pro- Whataretheir numbers?
cessing task. To solve algebraic word prob- Numbers 1(‘one’), 8(‘eight’), 2(‘twice’), 20.
lems, recent studies suggested neural mod- Equations x −2x =8,x +x =20
0 1 0 1
els that generate solution equations by using Answers (16,4)
‘Op (operator/operand)’ tokens as a unit of
input/output. However, such a neural model Table 1: A sample algebraic word problem
suffered two issues: expression fragmentation
and operand-context separation. To address
each of these two issues, we propose a pure et al., 2018; Amini et al., 2019; Chiang and Chen,
neuralmodel,Expression-PointerTransformer 2019; Wang et al., 2019). However, suggested
(EPT), which uses (1) ‘Expression’ token and neural models showed a fairly large performance
(2) operand-context pointers when generat-
ing solution equations. The performance of gap compared to existing state-of-the-art models
the EPT model is tested on three datasets: based on hand-crafted features in popular algebraic
ALG514, DRAW-1K, and MAWPS. Com- wordproblemdatasets,suchasALG514(44.5%for
pared to the state-of-the-art (SoTA) models, pureneuralmodelvs. 83.0%forusinghand-crafted
the EPT model achieved a comparable perfor- features) (Huangetal.,2018;UpadhyayandChang,
mance accuracy in each of the three datasets; 2016). To address the large performance gap in
81.3% on ALG514, 59.5% on DRAW-1K, this study, we propose a larger unit of input/output
and 84.5% on MAWPS. The contribution of
this paper is two-fold; (1) We propose a (I/O) token called “Expressions” for a pure neural
pure neural model, EPT, which can address model. Figure1illustratesconventionallyused“Op
the expression fragmentation and the operand- (operator/operands)” versus our newly proposed
context separation. (2) The fully automatic “Expression” token.
EPT model, which does not use hand-crafted To improve the performance of pure neural
features, yields comparable performance to models that can solve algebraic word problems, we
existing models using hand-crafted features,
and achieves better performance than existing identified two issues that can be addressed using
pure neural models by at most 40%. Expression tokens, which are shown in Figure
1 Introduction 1: (1) expression fragmentation and (2) operand-
context separation. First, the expression fragmen-
Solving algebraic word problems has recently tation issue is a segmentation of an expression
become an important research task in that auto- tree, which represents a computational structure
matically generating solution equations requires of equations that are used to generate a solution.
understanding natural language. Table 1 shows This issue arises when Op, rather than the whole
a sample algebraic word problem, along with expression tree, is used as an input/output unit
corresponding solution equations that are used of a problem-solving model. For example, as
to generate answers for the problem. To solve shown in Figure 1 (a), using Op tokens as an
such problems with deep learning technology, input to a problem-solving model disassembles a
researchers recently suggested neural models that tree structure into operators (“×”) and operands
generate solution equations automatically (Huang (“x1” and “2”). Meanwhile, we propose using the
3768
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 3768–3779,
c
November16–20,2020.
2020Association for Computational Linguistics
Figure 1: Illustration using the word problem in Table 1 for the (a) expression fragmentation issue, (b) operand-
context separation issue, and (c) our solution for these two issues.
“Expression”(×(x1,2))token,whichcanexplicitly two studies are presented. Section 5.1 presents a
capture a tree structure as a whole, as shown in performancecomparisonbetweenEPTandexisting
Figure 1 (c). SoTA models. Section 5.2 presents an ablation
Thesecondissue of operand-context separation study examining the effects of Expression tokens
is the disconnection between an operand and and applying operand-context pointers. Finally, in
a number that is associated with the operand. Section 6, a conclusion is presented with possible
This issue arises when a problem-solving model future directions for our work.
substitutes a number stated in an algebraic word
problem into an abstract symbol for generalization. 2 Related work
AsshowninFigure1(b),whenusinganOptoken,
the number 8 is changed into an abstract symbol Our goal is to design a pure neural model that
‘N1’. Meanwhile, when using an Expression token, generates equations using ‘Expression’ tokens to
the number 8 is not transformed into a symbol. solve algebraic word problems. Early attempts
Rather a pointer is made to the location where the for solving algebraic word problems noted the
number8occurredinanalgebraic word problem. importance of Expressions in building models with
Therefore, using such an “operand-context pointer” hand-crafted features (Kushman et al., 2014; Roy
enables a model to access contextual information et al., 2015; Roy and Roth, 2015; Zhou et al., 2015;
about the number directly, as shown in Figure 1 (c); Upadhyay et al., 2016). However, recent neural
thus, the operand-context separation issue can be models have only utilized ‘Op (operator/operand)’
addressed. tokens (Wang et al., 2017; Amini et al., 2019;
In this paper, we propose a pure neural model Chiang and Chen, 2019; Huang et al., 2018; Wang
called Expression-Pointer Transformer (EPT) to et al., 2019), resulting in two issues: (1) the
address the two issues above. The contribution of expression fragmentation issue and (2) the operand-
this paper is two-fold; context separation issue. In the remaining section,
wepresent existing methods for tackling each of
1. Wepropose a pure neural model, Expression- these two issues.
PointerTransformer(EPT),whichcanaddress To address the expression fragmentation issue,
the expression fragmentation and operand- researchers tried to reflect relational information
context separation issues. between operators and operands either by using a
2. The EPT model is the first pure neural model two-step procedure or a single step with sequence-
that showed comparable accuracy to the exist- to-sequence models. Earlier attempts predicted
ing state-of-the-art models, which used hand- operators and their operands by using a two-step
crafted features. Compared to the state-of- procedure. Such early models selected operators
the-art pure neural models, the EPT achieves first by classifying a predefined template (Kushman
better performance by about 40%. et al., 2014; Zhou et al., 2015; Upadhyay et al.,
2016), then in the second step, operands were ap-
In the rest of the paper, we introduce existing plied to the template selected in the first step. Other
approaches to solve algebraic word problems models selected operands first before constructing
in Section 2. Next, Section 3 introduces our expression trees with operators in the second step
proposed model, EPT, and Section 4 reports the (Royet al., 2015; Roy and Roth, 2015). However,
experimental settings. Then in Section 5, results of such two-step procedures in these early attempts
3769
Input Output Expression token Meaning
position index Operator (f ) Operand 0 (a ) Operand 1 (a )
i i0 i1
0 - BEGIN (Start an equation)
1 R VAR (Generate variable x )
0 0
2 R VAR (Generate variable x )
1 1
3 R × 2 R 2x
2 1 1
4 R − R R x −2x
3 0 2 0 1
5 R = R 8 x −2x =8
4 3 0 1
6 R + R R x +x
5 0 1 0 1
7 R = R 20 x +x =20
6 5 0 1
- R7 END (Gather all equations)
Table 2: The Expression token sequence for x − 2x = 8 and x +x = 20
0 1 0 1
can be performed via a single-step procedure with Amini et al., 2019). For example, Huang et al.
neural models. Specifically, recent attempts have (2018) used a pointer-generator network that can
utilized sequence-to-sequence (seq2seq) models as point to the context of a number in a given math
a single-step procedure to learn the implicit rela- problem. Although Huang’s model can address the
tionship between operators and operands (Amini operand-context separation issue using pointers,
et al., 2019; Chiang and Chen, 2019; Wang et al., their pure neural model did not yield a comparable
2019). For example, to capture the operator- performance to the state-of-the-art model using
operand relationship, Chiang and Chen (2019) hand-crafted features (44.5% vs. 83.0%). In this
constructed a seq2seq model that used push/pop paper, we propose that by including additional
actions on a stack for generating operator/operand pointers that utilize the contextual information
tokens. Similarly, Amini et al. (2019) built a of operands and neighboring Expression tokens,
seq2seq model to generate an operator token right performance of pure neural models can improve.
after producing required operand tokens. However, 3 EPT:Expression-Pointer Transformer
although these seq2seq approaches consider rela-
tional information of operands when generating Figure 2 shows the proposed Expression-Pointer
operators, the approach still does not address Transformer (EPT)1 model, which adopts the
the problem of lacking relation information of encoder-decoder architecture of a Transformer
operators when generating operands. On the other model (Vaswani et al., 2017). The EPT utilizes
hand, by using Expression token, our model can the ALBERTmodel(Lanetal.,2019),apretrained
consider relational information when generating language model, as the encoder. The encoder input
both operator and operands. is tokenized words of the given word problem, and
Secondly, there were efforts to address the encoderoutputistheencoder’shidden-statevectors
operand-context separation issue. To utilize contex- that denote numeric contexts of the given problem.
tual information of an operand token, researchers After obtaining the encoder’s hidden-state vec-
built hand-craftedfeaturesthatcapturethesemantic tors from the ALBERT encoder, the transformer
content of a word, such as the unit of a given num- decoder generates ‘Expression’ tokens. The two
ber (Roy and Roth, 2015; Koncel-Kedziorski et al., decoder inputs are Expression tokens and the
2015; Zhou et al., 2015; Upadhyay et al., 2016; ALBERT encoder’s hidden-state vectors, which
Roy and Roth, 2017) or dependency relationship are used as memories. For the given example
between numbers (Kushman et al., 2014; Zhou problem, the input is a list of 8 Expression tokens
et al., 2015; Upadhyay et al., 2016). However, shown in Table 2. We included three special
devising hand-crafted input features was time- commands in the list: VAR (generate a variable),
consuming and required domain expertise. There- BEGIN (start an equation), and END (gather all
fore, recent approaches have employed distributed equations). Following the order specified in the list
representations and neural models to learn numeric of Table 2, the EPT receives one input Expression
context of operands automatically (Wang et al., 1The code is available on https://github.com/
2017; Huang et al., 2018; Chiang and Chen, 2019; snucclab/ept.
3770
Figure 2: The architecture of Expression-Pointer Transformer (EPT) where two ideas applied: (1) Expression
token and (2) operand-context pointer.
at a time. For the ith Expression input, the model The embedding vector a , which represents
ij
computes an input vector vi. The EPT’s decoder the jth operand of ith Expression, is calculated
then transforms this input vector to a decoder’s differently according to the operand aij’s source.
hidden-state vector d . Finally, the EPT predicts Toreflectcontextualinformationofoperands, three
i
the next Expression token by generating the next possible sources are utilized: problem-dependent
operator and operands simultaneously. numbers, problem-independent constants, and the
To produce ‘Expression’ tokens, two compo- result of prior Expression tokens. First, problem-
nents are modified from the vanilla Transformer: dependent numbers are numbers provided in an
input vector and output layer. In the following algebraic problem (e.g., ‘20’ in Table 1). To
subsections, we explain the two components. compute a of a number, we reuse the encoder’s
ij
hidden-state vectors corresponding to such number
3.1 Input vector of EPT’s decoder tokens as follows:
The input vector vi of ith Expression token is a =LN c u +e , (3)
obtained by combining operator embedding f and ij a a num aij
i
operand embedding a as follows: where u denotes a vector representing the source,
ij and e ∗is the encoder’s hidden-state vector cor-
aij
v =FF (Concat(f ,a ,a ,···,a )), (1) 2
i in i i1 i2 ip responding to the number a . Second, problem-
ij
independent constants are predefined numbers that
where FF indicates a feed-forward linear layer, are not stated in the problem (e.g., 100 is often used
∗
and Concat(·) means concatenation of all vectors for percentiles). To compute aij of a constant, we
inside the parentheses. All the vectors, including use a look-up table Ec as follows:
v , f , and a , have the same dimension D. For-
i i ij a =LN (c u +E(a )). (4)
mulae for computing the two types of embedding ij a a const c ij
vectors, f and a are stated in the next paragraph.
i ij Note that LN , c are shared across different
For the operator token f of ith Expression, the a a
i sources. Third, the result of the prior Expression
EPTcomputestheoperator embedding vector fi as token is an Expression generated before the ith
in Vaswani et al. (2017)’s setting: Expression (e.g., R ). To compute a of a result,
0 ij
3
weutilize the positional encoding as follows :
f =LN (c E (f )+PE(i)), (2)
i f f f i a =LN (c u +PE(k)), (5)
ij a a expr
where E∗(·) indicates a look-up table for embed- 2Whentwoormoretokensformanumberintheproblem,
ding vectors, c denotes a scalar parameter, and weaveraged all related hidden-state vectors.
∗ 3Since we want to sustain simultaneous decoding, which
LN (·) and PE(·) represent layer normalization
∗ is one of the strengths in the Transformer, we use PE(k) for
(Ba et al., 2016) and positional encoding (Vaswani the kth prior Expression, although it is possible to use decoder
et al., 2017), respectively. hidden state dk.
3771
no reviews yet
Please Login to review.