We suggest 2 ideas to perform constituency parsing: a score based parser and a transformer based architecture.
We attempt to learn a score function that induces a recursive split in a given sentence into the phrase structure.
Our score function is dependent on the sentence, spans in the sentence (indices) and the rules.
A span
Note: We assume that each phrase splits into atmost 2 phrases. Thus, we use the Chomsky Normal Form of the phrase structure trees in Penn Treebank.
\[score(w,s,r; W_1, W_2) = g(W_1 w[s])^T W_2 r\]
where,
The sentence representation
For all labels (non-terminals) in the dataset, we one-hot encode each label as a vector.
For a production rule,
For example, for a rule
For a given encoded vector
-
$A → t$ $t$ is a terminal and thus$r = (0 \ldots a \ldots 0)^T$ -
$A → A$ $r = (0 \ldots a+b \ldots 0)^T$ -
$A → B$ $r = (0 \ldots a \ldots b \ldots 0)^T$ or$r = (0 \ldots b \ldots a \ldots 0)^T$ -
$A → A A$ $r = (0 \ldots a+b+c \ldots 0)^T$ -
$A → B B$ $r = (0 \ldots a \ldots b+c \ldots 0)^T$ or$r = (0 \ldots b+c \ldots a \ldots 0)^T$ -
$A → B C$ $r = (0 \ldots a \ldots b \ldots c \ldots 0)^T$ and all possible permutations of positions for$a$ ,$b$ and$c$
We choose the values of
For a given sentence
Thus, we can learn the weights using the log-likelihood of the conditional distribution
\[L(W_1,W_2) = ∑d=1^D ∑(r,s) ∈ T_d score(w_d,s,r;W_1,W_2)\]
where,
We model the weight matrices
Given a new sentence, we compute the parse tree as follows:
At each step, we compute the ideal span and rule as follows: \[s*, r* = arg max score(w,s,r;W_1,W_2)\]
To compute the above arg max we use dynamic programming to compute the best scores to infer the tree.
- To start with we define $sbest(i,i+1,j,j+1)$ for all
$i < j - 1$ as \[sbest(i,i+1,j,j+1) = max_r s(w,(i,i+1,j,j+1), r; W_1,W_2)\] The rule to choose is the argument$r$ for which the above is maximised. - With our DP we wish to compute the function $sbest(0,n-3,n-2,n-1,r)$ where
$n$ is the size of the input sentence. - We reduce the above problem into smaller problems by allowing the indices
$(i,j,k,l)$ to move right, left, left and right respectively. Thus, there are$2^4 = 16$ possible sub-problems to maximise over. Let the set of $sbest$ scores for all these smaller problems be $Sbest$ - The recursion to express this dynamic programming algorithm is: \[sbest(i,j,k,l) = max_r \{ s(w, (i,j,k,l),r), max \{Sbest\}\}\]
We build a bottom up model and using the best score to split the sentence into a tree.
The code and the results for this model are at this repository.
We also have another idea (which we could not implement/train) but wish to present nevertheless.
We extend the architecture of the Tree Transformer proposed by Yau-Shian Wang et al. which introduces the notion of an additional “Constituent Attention” module that implements self-attention between two adjacent words to induce tree structures on sequential language data. We define an additional label prior on top of the constituent prior defined by the paper.
The constituent attention module lets each word attend to the neighboring words to define the “score” or “probability” of forming a constituent with those words. The information missing here is the label information, which we induce using a label attention module.
Consider, a query vector for each label
We train this model using the same training objective of the Tree-Transformer and perform parsing
as described in the paper. Additionally to infer the label, we use the span representation $sij$
for positions