LIN3022 Natural Language Processing Lecture 9 Albert Gatt In this lecture We continue with our discussion of parsing algorithms We introduce dynamic programming approaches We then look at probabilistic context-free grammars Statistical parsers Part 1 Dynamic programming approaches Top-down vs bottom-up search Top-down Never considers

derivations that do not end up at root S. Wastes a lot of time Bottom-up Generates many subtrees that will never lead to an S. Only considers with trees that are trees that cover inconsistent with the some part of the input. input.approaches, we NB: With both top-down and bottom-up view parsing as a search problem. Beyond top-down and bottom-up

One of the problems we identified with top-down and bottom-up search is that they are wasteful. These algorithms proceed by searching through all possible alternatives at every stage of processing. Wherever there is local ambiguity, these possibly alternatives multiply. There is lots of repeated work. Both S NP VP and S VP involve a VP The VP rule is therefore applied twice! Ideally, we want to break up the parsing problem into sub- problems and avoid doing all this extra work. Extra effort in top-down parsing Input: a flight from Indianapolis to Houston. NP Det Nominal rule. (Dead end) NP Det Nominal PP

+ Nominal Noun PP (Dead end) NP Det Nominal + Nominal Nominal PP + Nominal Nominal PP Dynamic programming In essence, dynamic programming involves solving a task by breaking it up into smaller sub-tasks. In general, this is carried out by: 1. Breaking up a problem into sub-problems. 2. Creating a table which will contain solutions to each sub-problem. 3. Resolving each sub-problem and populating the table. 4. Reading off the complete solution from

the table, by combining the solutions to the sub-problems. Dynamic programming for parsing Suppose we need to parse: Book that flight. We can split the parsing problem into sub- problems as follows: Store sub-trees for each constituent in the table. This means we only parse each part of the input once. In case of ambiguity, we can store multiple possible sub-trees for each piece of input. Part 2 The CKY Algorithm and Chomsky Normal Form CKY parsing

Classic, bottom-up dynamic programming algorithm (Cocke-Kasami-Younger). Requires an input grammar based on Chomsky Normal Form A CNF grammar is a Context-Free Grammar in which: Every rule LHS is a non-terminal Every rule RHS consists of either a single terminal or two non-terminals. Examples: A BC NP Nominal PP A a Noun man But not: NP the Nominal S VP Chomsky Normal Form Any CFG can be re-written in CNF, without any loss of expressiveness. That is, for any CFG, there is a corresponding

CNF grammar which accepts exactly the same set of strings as the original CFG. Converting a CFG to CNF To convert a CFG to CNF, we need to deal with three issues: Rules that mix terminals and non-terminals on the RHS 1. E.g. NP the Nominal Rules with a single non-terminal on the RHS (called unit productions) 2. E.g. NP Nominal

Rules which have more than two items on the RHS 3. E.g. NP Det Noun PP Converting a CFG to CNF 1. Rules that mix terminals and non- terminals on the RHS E.g. NP the Nominal Solution: Introduce a dummy non-terminal to cover the original terminal E.g. Det the Re-write the original rule: NP Det Nominal

Det the Converting a CFG to CNF 2. Rules with a single non-terminal on the RHS (called unit productions) E.g. NP Nominal Solution: Find all rules that have the form Nominal ... Nominal Noun PP Nominal Det Noun Re-write the above rule several times to eliminate the intermediate non-terminal: NP Noun PP

NP Det Noun Note that this makes our grammar flatter Converting a CFG to CNF 3. Rules which have more than two items on the RHS E.g. NP Det Noun PP Solution: Introduce new non-terminals to spread the sequence on the RHS over more than 1 rule. Nominal Noun PP NP Det Nominal The outcome If we parse a sentence with a CNF

grammar, we know that: Every phrase-level non-terminal (above the part of speech level) will have exactly 2 daughters. NP Det N Every part-of-speech level non-terminal will have exactly 1 daughter, and that daughter is a terminal: N lady Part 3 Recognising strings with CKY Recognising strings with CKY Example input: The flight includes a meal. The CKY algorithm proceeds by: 1. Splitting the input into words and indexing each position. (0) the (1) flight (2) includes (3) a (4) meal (5)

2. Setting up a table. For a sentence of length n, we need (n+1) rows and (n+1) columns. 3. Traversing the input sentence left-to-right 4. Use the table to store constituents and their span. The table Rule: Det the [0,1] for the 1 0 2 3 4 5 Det

S 1 2 3 4 the flight includes a meal The table Rule1: Det the Rule 2: N flight [0,1] for the [1,2] for flight 1

0 2 3 4 5 Det S N 1 2 3 4 the flight

includes a meal The table [0,2] for the flight Rule1: Det the Rule 2: N flight Rule 3: NP Det N [0,1] for the [1,2] for flight 0 1 2

3 Det NP 4 5 S N 1 2 3 4 the flight

includes a meal A CNF CFG for CYK (!!) S NP VP NP Det N VP V NP V includes Det the Det a N meal N flight CYK algorithm: two components Lexical step: for j from 1 to length(string) do: let w be the word in position j find all rules ending in w of the form X w

put X in table[j-1,1] Syntactic step: for i = j-2 to 0 do: for k = i+1 to j-1 do: for each rule of the form A B C do: if B is in table[i,k] & C is in table[k,j] then add A to table[i,j] CKY algorithm: two components We actually interleave the lexical and syntactic steps: for j from 1 to length(string) do: let w be the word in position j find all rules ending in w of the form X w put X in table[j-1,1] for i = j-2 to 0 do: for k = i+1 to j-1 do: for each rule of the form A B C do: if B is in table[i,k] & C is in table[k,j] then add A to table[i,j]

CKY: lexical step (j = 1) The flight includes a meal. Lexical lookup Matches Det the 1 0 1 2 3 4 5 Det 2 3 4 5

CKY: lexical step (j = 2) The flight includes a meal. Lexical lookup Matches N flight 1 0 1 2 3 4 5 2 Det N 3 4

5 CKY: syntactic step (j = 2) The flight includes a meal. Syntactic lookup: look backwards and see if there is any rule that will cover what weve done so far. 0 1 2 3 4 5 1 2 Det NP

N 3 4 5 CKY: lexical step (j = 3) The flight includes a meal. Lexical lookup Matches V includes 0 1 2 3 4 5 1 2

Det NP 3 N V 4 5 CKY: lexical step (j = 3) The flight includes a meal. Syntactic lookup There are no rules in our grammar that will cover Det, NP, V 0 1

2 3 4 5 1 2 Det NP 3 N V 4 5 CKY: lexical step (j = 4)

The flight includes a meal. Lexical lookup Matches Det a 0 1 2 3 4 5 1 2 Det NP 3 4

N V Det 5 CKY: lexical step (j = 5) The flight includes a meal. Lexical lookup Matches N meal 0 1 2 3 4 1 2 Det

NP 3 4 5 N V Det N CKY: syntactic step (j = 5) The flight includes a meal. Syntactic lookup We find that we have NP Det N 0 1 2 3

4 1 2 Det NP 3 4 5 Det NP N V N

CKY: syntactic step (j = 5) The flight includes a meal. Syntactic lookup We find that we have VP V NP 0 1 2 3 4 1 2 Det NP 3

4 5 N V VP Det NP N CKY: syntactic step (j = 5) The flight includes a meal. Syntactic lookup We find that we have S NP VP 0 1 2 3

4 1 2 Det NP 3 4 5 S N V VP Det

NP N From recognition to parsing The procedure so far will recognise a string as a legal sentence in English. But wed like to get a parse tree back! Solution: We can work our way back through the table and collect all the partial solutions into one parse tree. Cells will need to be augmented with backpointers, i.e. With a pointer to the cells that the current cell covers. From recognition to parsing 0 1 2

3 4 1 2 Det NP 3 4 5 S N V

VP Det NP N From recognition to parsing 0 1 2 3 4 1 2 Det NP 3

4 5 S N V VP Det NP N NB: This algorithm always fills the top triangle of the table! What about ambiguity? The algorithm does not assume that there is only one parse tree for a sentence.

(Our simple grammar did not admit of any ambiguity, but this isnt realistic of course). There is nothing to stop it returning several parse trees. If there are multiple local solutions, then more than one non-terminal will be stored in a cell of the table. Part 4 Probabilistic Context Free Grammars CFG definition (reminder) A CFG is a 4-tuple: (N,,P,S): N = a set of non-terminal symbols (e.g. NP, VP) = a set of terminals (e.g. words) N and are disjoint (no element of N is also an element of

) P = a set of productions of the form A where: A is a non-terminal (a member of N) is any string of terminals and non-terminals CFG Example S NP VP S Aux NP VP NP Det Nom NP Proper-Noun Det that | the | a Probabilistic CFGs A CFG where each production has an associated probability PCFG is a 5-tuple: (N,,P,S, D):

D is a function assigning each rule in P a probability usually, probabilities are obtained from a corpus most widely used corpus is the Penn Treebank Example tree Building a tree: rules S NP NNP Mr VP NNP VBZ Vinken is S NP VP NP NNP NNP

NNP Mr NNP Vinken NP NP PP NN IN NN chairman of NNP Elsevier Characteristics of PCFGs

In a PCFG, the probability P(A) expresses the likelihood that the non-terminal A will expand as . e.g. the likelihood that S NP VP (as opposed to SVP, or S NP VP PP, or ) can be interpreted as a conditional probability: probability of the expansion, given the LHS non- terminal P(A) = P(A|A) Therefore, for any non-terminal A, probabilities of every rule of the form A must sum to 1 in this case, we say the PCFG is consistent Uses of probabilities in parsing Disambiguation: given n legal parses of a string,

which is the most likely? e.g. PP-attachment ambiguity can be resolved this way Speed: weve defined parsing as a search problem search through space of possible applicable derivations search space can be pruned by focusing on the most likely sub-parses of a parse parser can be used as a model to determine the probability of a sentence, given a parse typical use in speech recognition, where input utterance can be heard as several possible sentences Using PCFG probabilities PCFG assigns a probability to every parse- tree t of a string W e.g. every possible parse (derivation) of a sentence

recognised by the grammar Notation: G = a PCFG s = a sentence t = a particular tree under our grammar t consists of several nodes n each node is generated by applying some rule r Probability of a tree vs. a sentence We work out the probability of a parse tree t by multiplying the probability of every rule (node) that gives rise to t (i.e. the derivation of t). Note that: A tree can have multiple derivations (different sequences of rule applications could give rise to the same tree) But the probability of the tree remains the same (its the same probabilities being multiplied) We usually speak as if a tree has only one derivation,

called the canonical derivation Picking the best parse in a PCFG A sentence will usually have several parses we usually want them ranked, or only want the n best parses we need to focus on P(t|s,G) probability of a parse, given our sentence and our grammar definition of the best parse for s: The tree for which P(t|s,G) is highest Probability of a sentence Given a probabilistic context-free grammar G, we can the probability of a sentence (as opposed to a tree). Observe that: As far as our grammar is concerned, a sentence is only a sentence if it can be recognised by the grammar (it is legal) There can be multiple parse trees for a sentence.

Many trees whose yield is the sentence The probability of the sentence is the sum of all the probabilities of the various trees that yield the sentence. Flaws I: Structural independence Probability of a rule r expanding node n depends only on n. Independent of other non-terminals Example: P(NP Pro) is independent of where the NP is in the sentence but we know that NPPro is much more likely in subject position Francis et al (1999) using the Switchboard corpus: 91% of subjects are pronouns; only 34% of objects are pronouns Flaws II: lexical

independence vanilla PCFGs ignore lexical material e.g. P(VP V NP PP) independent of the head of NP or PP or lexical head V Examples: prepositional phrase attachment preferences depend on lexical items; cf: dump [sacks into a bin] dump [sacks] [into a bin] (preferred parse) coordination ambiguity: [dogs in houses] and [cats] [dogs] [in houses and cats] Lexicalised PCFGs Attempt to weaken the lexical independence assumption. Most common technique: mark each phrasal head (N,V, etc) with the lexical material this is based on the idea that the most crucial lexical dependencies are between

head and dependent E.g.: Charniak 1997, Collins 1999 Lexicalised PCFGs: Matt walks Makes probabilities S(walks) partly dependent on lexical content. P(VPVBD|VP) becomes: NP(Matt) VP(walks) P(VPVBD|VP,h(VP)=walks) NNP(Matt) NB: normally, we cant assume that all heads of a phrase of category C are equally

probable. Matt VBD(walks) walks Practical problems for lexicalised PCFGs data sparseness: we dont necessarily see all heads of all phrasal categories often enough in the training data flawed assumptions: lexical dependencies occur elsewhere, not just between head and complement I got the easier problem of the two to solve of the two and to solve are very likely because of the prehead modifier easier Structural context

The simple way: calculate p(t|s,G) based on rules in the canonical derivation d of t assumes that p(t) is independent of the derivation could condition on more structural context but then, P(t) could really depend on the derivation! Part 5 Parsing with a PCFG Using CKY to parse with a PCFG The basic CKY algorithm remains unchanged. However, rather than only keeping partial solutions in our table cells (i.e. The rules that match some input), we also keep their probabilities.

Probabilistic CKY: example PCFG S NP VP [.80] NP Det N [.30] VP V NP [.20] V includes [.05] Det the [.4] Det a [.4] N meal [.01] N flight [.02] Probabilistic CYK: initialisation The flight includes a meal. S NP VP [.80] NP Det N [.30] VP V NP [.20] V includes [.05] Det the [.4]

Det a [.4] N meal [.01] N flight [.02] 1 0 1 2 3 4 5 2 3 4 5 Probabilistic CYK: lexical step The flight includes a meal. S NP VP [.80]

NP Det N [.30] VP V NP [.20] V includes [.05] Det the [.4] Det a [.4] N meal [.01] N flight [.02] 1 0 1 2 3 4 5 Det (.4) 2

3 4 5 Probabilistic CYK: lexical step The flight includes a meal. S NP VP [.80] NP Det N [.30] VP V NP [.20] V includes [.05] Det the [.4] Det a [.4] N meal [.01] N flight [.02] 1 0

1 2 3 4 5 2 Det (.4) N .02 3 4 5 Probabilistic CYK: syntactic step The flight includes a meal. S NP VP [.80]

NP Det N [.30] VP V NP [.20] V includes [.05] Det the [.4] Det a [.4] N meal [.01] N flight [.02] 0 1 1 2 Det (.4) NP .0024

3 4 5 N .02 2 3 4 5 Note: probability of NP in [0,2] P(Det the) * P(N meal) * P(NP Det N) Probabilistic CYK: lexical step The flight includes a meal. S NP VP [.80] NP Det N

[.30] VP V NP [.20] V includes [.05] Det the [.4] Det a [.4] N meal [.01] N flight [.02] 0 1 2 3 4 5 1 2 Det (.4)

NP .0024 3 N .02 V .05 4 5 Probabilistic CYK: lexical step The flight includes a meal. S NP VP [.80] NP Det N [.30] VP V NP [.20] V includes [.05]

Det the [.4] Det a [.4] N meal [.01] N flight [.02] 0 1 2 3 4 5 1 2 Det (.4) NP .0024 3

4 N .02 V .05 Det .4 5 Probabilistic CYK: syntactic step The flight includes a meal. S NP VP [.80] NP Det N [.30] VP V NP [.20] V includes [.05] Det the [.4]

Det a [.4] N meal [.01] N flight [.02] 0 1 2 3 4 1 2 Det (.4) NP .0024 3 4

5 N .02 V .05 Det .4 N .01 Probabilistic CYK: syntactic step The flight includes a meal. S NP VP [.80] NP Det N [.30] VP V NP [.20] V includes [.05] Det the [.4]

Det a [.4] N meal [.01] N flight [.02] 0 1 2 3 4 1 2 Det (.4) NP .0024 3 4

5 Det .4 NP .001 N .02 V .05 N .01 Probabilistic CYK: syntactic step The flight includes a meal. S NP VP [.80] NP Det N

[.30] VP V NP [.20] V includes [.05] Det the [.4] Det a [.4] N meal [.01] N flight [.02] 0 1 2 3 4 1 2 Det (.4) NP

.0024 3 4 5 N .02 V .05 VP .00001 Det .4 NP .001 N .01

Probabilistic CYK: syntactic step The flight includes a meal. S NP VP [.80] NP Det N [.30] VP V NP [.20] V includes [.05] Det the [.4] Det a [.4] N meal [.01] N flight [.02] 0 1 2 3 4 1

2 Det (.4) NP .0024 3 4 5 S .00000001 92 N .02 V .05

VP .00001 Det .4 NP .001 N .01 Probabilistic CYK: summary Cells in chart hold probabilities Bottom-up procedure computes probability of a parse incrementally. To obtain parse trees, we traverse the table backwards as before. Cells need to be augmented with backpointers.