Dictionaries and Grammar Questions to Address Do we

Dictionaries and Grammar Questions to Address  Do we

Dictionaries and Grammar Questions to Address Do we include all forms of a particular word, or do we include only the base word and derive its forms? How are the grammatical rules of a language represented? How do we represent the parts of speech that go with particular grammatical rules? Morphology Definitions Morphology The study of the patterns used to form words E.g. inflection, derivation, and compounds Morpheme - Minimal meaning-bearing unit

Could be a stem or an affix Stem {unthinkable realization distrust} The part of a word that contains the root meaning (E.g. cat) Affixes {-s, un-, de-, -en, -able, -ize, -hood} a linguistic element added to a word modify the meaning E.g.: prefix (unbuckle), suffix (buckled), infix (absobloodylutely), and circumfix (gesagt in German for said). Affixes can attach to other affixes (boyishness) Knowing Words When we know a word, we know its 1. 2. 3.

4. Phonological sound sequences Semantic meanings Morphological relationships Syntactic categories and proper structure of a sentence Morphological relationships adjust word meanings

Person Jill waits. Number Jill carried two buckets. Case The chairs leg is broken. Tense Jill is waiting there now. Degree Jill ran faster than Jack. Gender Jill is female Part of Speech Jill is a proper noun These are the kind of things we want our computers to figure out Units of Meaning

How many morphemes do each of the following sentences have? I have two cats She wants to leave soon He walked across the room Her behavior was unbelievable Free Morphemes {eye, think, run, apple} Bound Morphemes {-able, un-, -s, -tion, -ly} Affix Examples Prefixes from Karuk, a Hokan language of California [pasip]

[nipasip] [/upasip] Shoot! I shoot She/he shoots Suffixes from Mende spoken in Liberia and Sierra Leone [pElE] [pElEi] [mEmE] [mEmEi]

house the house glass the glass Infixes from Bontoc spoken in the Phillipines [fikas] [fumikas] [fusul] [fumusal] strong she is becoming strong

enemy she is becoming an enemy Turkish Morpology Uygarlastiramadiklarimizdanmissinizcasina Meaning: `behaving as if you are among those whom we could not civilize Uygar `civilized + las `become + tir `cause + ama `not able + dik `past + lar plural+ imiz p1pl + dan abl + mis past + siniz 2pl + casina as if How does the Mind Store Meanings?

Hypotheses Full listing: We store all words individually Minimum redundancy: We store morphemes and how they relate Analysis Determine if people understand new words based on root meanings Observe whether children have difficulty learning exceptions Regular form: government/govern, Irregular form: department/depart Evidence suggests The mind represents words and affix meanings separately Linguists observe that affixes were originally separate words that speakers slur together over time General Observations about Lexicons Meanings are continually changing Roots and Morphemes do not have to occur in a fixed position

in relation to other elements. How many words do people know? Shakespeare uses 15,000 words A typical high school student knows 60,000 (learning 10 words a day from 12 months to 18 years) How many English words are there? Over 300,000 words without Morphemes in 1988 Computational Morphology Speech recognition requires a language dictionary How many words would it contain? Consider all of the morphemes of the word true true, truer, truest, truly, untrue, truth, truthful, truthfully, untruthfully, untruthfulness Untruthfulness = un- + true + -th + -ful + -ness Productive morphemes An affix that at a point in time spread rapidly through the language

Consider goose and geese versus cat and cats The former was an older way to indicate plurals The latter is a more recent way that spread throughout If we store morpheme rules, not all words, we can Reduce storage requirements and simplify creating entire dictionaries More closely mimic how the mind does it Be able to automatically understand newly encountered word forms Morphology Rules There are rules used to form complex words from their roots re- only precedes verbs (rerun, release, return) -s indicates plurals -ed indicates past tense Affix Rules Regular: follow productive affix rules Irregular: dont follow productive affix rules Nouns

Regular: (cat, thrush), (cats, thrushes), (cats thrushes) Irregular: (mouse, ox), (mice, oxen) Observation: More frequent words resist changes that result from productive affixes and take irregular forms (E.g. am, is, are). Exceptions: A singer sings, and a writer writes. Why doesnt a whisker whisk, a spider spid, or a finger fing? Parsing Identify components and underlying structure Morphological parsing Identifies stem and affixes and how they relate Example: fish fish + Noun + Singular or goose + Verb fish fish +Noun +Plural fish fish +Verb +Singular

Bracketing: indecipherable [in [[de [cipher]] able]] Why do we parse? spell-checking: Is muncheble a real word? Identify a words part-of-speech (pos) Sentence parsing and machine translation Identify word stems for data mining search operations Speech recognition and text to speech Parsing Applications Lexicon Create a word list

Include both stems and affixes (with the part of speech) Morphotactics Models how morphemes can be affixed to a stem. E.g., plural morpheme follows noun in English Orthographic rules Defines spelling modifications during affixation E.g. true tru in context of true truthfully Grammatical Morphemes New forms are rarely added to closed morpheme classes Examples prepositions at, for, by articles a, the

conjunctions and, but, or Morphological Parsing (stemming) Goal: Break the surface input into morphemes foxes Fox is a noun stem It has -es as a plural suffix rewrites Write is the verb stem It has re- as a prefix meaning to do again It has a s suffix indicating a continuing activity Inflectional Morphology Does not change the grammatical category Nouns plural marker: -s (dog + s = dogs) possessive marker: -s (dog + s = dogs)

Verbs 3rd person present singular: -s (walk + s = walks) past tense: -ed (walk + ed = walked) progressive: -ing (walk + ing = walking) past participle: -en or -ed (eat + en = eaten) Adjectives comparative: -er (fast + er = faster) superlative: -est (fast + est = fastest) In English Meaning transformations are predictable

All inflectional affixes are suffices Inflectional affixes are attached after any derivational (next slide) affixes E.g. modern + ize + s = modernizes; not modern + s + ize Concatenative and Non-concatenative Concatenative morphology combines by concatentation prefixes and suffixes Non-concatentative morphology combines in complex ways circumfixes and infixes templatic morphology words change by internal changes to the root E.g. (Arabic, Hebrew) ktb (write), kuttib (will have been written) ktb C V C C V C kuttib ui Templative Example

Verbal Inflective Morphology Verbal inflection Main verbs (sleep, like, fear) are relatively regular Standard morphemes: -s, ing, ed These morphemes are productive: Emails, Emailing, Emailed Combination with nouns for syntactical agreement I am, we are, they were There are exceptions Eat (will eat, eats, eating, ate) Catch (will catch, catches, catching, caught)

Be (will be, is, being, was) Have (will have, has, having, had) General Observations about English There are approximately 250 Irregular verbs that occur Other languages have more complex verbal inflection rules Nominal Inflective Morphology Plural forms (s or es) Possessives (cats or cats) Regular Nouns Singular (cat, bush) Plural (cats, bushes) Possessive (cats bushes) Irregular Nouns Singular (mouse, ox) Plural (mice, oxen)

Derivational Morphology Word stem combines with grammatical morpheme Usually produces word of different class Complex rules that are less productive with many exceptions Sometimes meanings of derived terms are hard to predict (E.g. hapless) Examples: verbs to nouns generalize, realize generalization, realization Murder, spell murderer, speller Examples: verbs and nouns to adjectives embrace, pity embraceable, pitiable care, wit careless, witless Example: adjectives adverbs happy happily

More complicated to model than inflection Less productive: science-less, concern-less, go-able, sleep-able Derivational Morphology Examples Level 1 Examples: ize, ization, ity, ic, al, ity, ion, y, ate, ous, ive, ation Observations Can attach to non-words (e.g. fratern-al, paternal) Often changes stems stress and vowel quality Level 2

Examples: hood, ness, ly, s, ing, ish, ful, ly, less, y (adj.) Observations Never precede Level 1 suffixes Never change stress or vowel quality Almost always attach to words that exists Level 1 + Level 1: histor-ic-al, illumina-at-tion, indetermin-aty; Level 1 + Level 2: fratern-al-ly, transform-ate-ion-less; Level 2 + Level 2: weight-less-ness Big one: antidisestablishmenterrianism (if I spelled it right) Adjective Morphology Standard Forms

Big, bigger, biggest Cool, cooler, coolest, cooly Red, redder, reddest Clear, clearer, clearest, clearly, unclear, unclearly Happy, happier, happiest, happily Unhappy, unhappier, unhappiest, unhappily Real, unreal, really Exceptions: unbig, redly, realest

Identify and Classify Morphemes In each group Two words have a different morphological structure One word has a different type of suffix One word has no suffix at all Perform the following tasks 1.Isolate the suffix that two of the words share. 2.Identify whether it is (i) free or bound; (ii) prefix, infix, suffix; (iii) inflectional or derivational. 3.Give its function/meaning. 4.Identify the word that has no suffix 5.Identify the word that has a suffix which is different from the others in each group. a. rider colder silver actor

b. tresses melodies Besss guess c. d. running foundling handling fling tables lens witches calculates

Computational Techniques Regular Grammars Finite State Automata Finite State Transducer Parsing Top down and bottom up Regular Grammars Grammar: Rules that define legal characters strings A regular grammar accepts regular expressions A regular expression must satisfy the following: The grammar with no strings is regular The grammar that accepts the empty string is regular A single character is a regular grammar If r1 and r2 are regular grammars, then r1 union r2, and r1 concatenated with r2 are regular grammars If r is a regular grammar, then r* ( where * means zero or more occurrences) is regular

Notations to Express Regular Expressions Conjunction: abc Disjunction: [a-zA-Z], gupp(y|ies) Counters: a*, a+, ?, a{5}, a{5,8}, a{5,} Any character: a.b

Not: [^0-9] Anchors: /^The dog\.$/ Note: the backslash before the period is an escape character Other escape characters include \*, \?, \n, \t, \\, \[, \], etc. Operators \d equivalent to [0-9], \D equivalent to [^0-9] \w equivalent to [a-zA-z0-9 ], \W equivalent to [^\w] \s equivalent to [ \r\t\n\f], \S equivalent to [^s] Substitute one regular expression for another: s/regExp1/regExp2/ Examples of Regular Expressions All strings ending with two zeroes All strings containing three consecutive zeroes All strings that every block of five consecutive symbols have at least two zeroes All strings that the tenth symbol from the right is a

one The set of all modular five numbers Finite State Automata (FSA) FSAs recognize grammars that are regular Definition: A FSA consists of 1. a set of states ()) 2. a starting state (q0) 3. a set of final or accepting states (F Q) 4. a finite set of symbols (Q) 5. a transition function ((q,i) ) that maps Qx) to Q. It switches from a from-state to a to-state, based on one of the valid symbols Synonyms: Finite Automata, Finite State Machine Recognition Determine if the machine accepts a particular string

i.e. Is a string in the language? Traditionally, Turing used a tape reader to depict a FSA Algorithm Begin in the start state Examine the current input character Consult the table Go to a new state and update the tape pointer. Until you run out of tape. The machine accepts the string processing stops in a final state Graphs and State Transition Tables

What can we can say about this machine? It has 5 states At least b,a, and ! are in its alphabet q0 is the start state q4 is an accept state It has 5 transitions Questions Which strings does it accept? baaaa, aaabaaa, ba Is this the only FSA that can accept this language? State Transition Table

Annotated Directed Graph An FSA only can accept regular strings. Question: Can you think of a string that is not regular? Recognizer Implementation index = beginning of tape state = start state DO IF transition[index, tape[index]] is empty RETURN false state = transition[index, tape[index]] index = index + 1 UNTIL end of tap is reached IF state is a final state RETURN true ELSE RETURN false

Key Points Regarding FSAs This algorithm is a state-space search algorithm Implementation uses simple table lookups Success occurs when at the end of a string, we reach a final state The results are always deterministic There is one unique choice at each step The algorithm recognizes all regular languages Perl, Java, etc. use a regular expression algorithm Create a state transition table from the expression pass the table to the FSA interpreter FSA algorithms Recognizer: determines if a string is in the language Generator: Generates all strings in the language

Non-Deterministic FSA Deterministic: Given a state and symbol, only one transition is possible Nondeterministic: Given a state and a symbol, multiple transitions are possible Epsilon transitions: those which DO NOT examine or advance the tape The Nondeterministic FSA recognizes a string if: At least one transition sequence ends at a final state Note: all sequences DO NOT have to end at a final state Note: String rejection occurs only when NO sequence ends at a final state Examples Concatenation Closure Closure

Union Using NFSAs Input State 0 1 2 3 4 b 1 0 0 0 0

a 0 2 2,3 0 0 ! 0 0 0 4 0 e 0 0

0 0 0 NFSA Recognition of baaa! Breadth-first Recognition of baaa! Nondeterministic FSA Example b a q0 q1 a a q2

q2 ! q3 \ q4 Other FSA Examples Dollars and Cents Exercise: Create a FSA for the following regular expressions (0|1)* [a-f1-9] abc{5} Non Deterministic FSA Recognizer

Recognizer (index, state) LOOP IF end of tape THEN IF state is final RETURN true ELSE RETURN false IF no possible transitions RETURN false IF there is only one transition state = transition[index, tape[index]] IF not an epsilon transition THEN index++ ELSE FOR each possible transition not considered result = CALL recognizer(nextState,nextIndex) IF result = true RETURN true END LOOP RETURN false FSAs and Morphology Apply an FSA to each word in the dictionary to capture the

morphological forms. Groups of words with common morphology can share FSAs Building a Lexicon with a FSA Derivational Rules Simple Morphology Example unq0 q1 e adj-root q2 er

(un) adj roots ( est ) ly Stop states: q2 and q3 -er est -ly q3 From To Output 0

1 un 0 1 NULL 1 2 adj-root-list 2

3 er;est;ly An Extended Example unq0 e adj-root-1 -er est -ly q1 q2 q5 adj-root-1 q3 q4 -er est

adj-root-2 Adj-root1: clear, happy, real Adj-root2: big, red er (un) adj roots1 ( est ) ly From To Output 0

1 un 0 3 NULL 1 2 adj-root-list-1 2

5 er;est;ly 3 2 adj-root-list-1 3 4 adj-root-list-2 4

5 er;est er adj roots 2 ( ) est Representing Derivational Rules nouns noun ity, ness ize ation

er verbs verb ative ive able adjectives ful adj ly ly adverbs adverb Finite State Transducer (FST)

Definition: A FST is a 5-tuple consisting of Q: set of states {q0,q1,q2,q3,q4} : an alphabet of complex symbols Each complex symbol contains two simple symbols The first symbol is from an input alphabet i I The second symbol is from an output alphabet o O is in I x O, is the null character q0: a start state F: a set of final states in Q {q4} (q,i:o): a transition function mapping Q x to Q Concept: Translates and writes to a second tape a:o

b:m q0 a:o q1 a:o q2 !:? q3 Example: baaaamoooo q4 Transition Example c:c

a:a t:t +N: +PL:s c:c means read a c on one tape and write a c on the other +N: means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s On-line demos Finite state automata demos http://www.xrce.xerox.com/competencies/content

-analysis/fsCompiler/fsinput.html Finite state morphology http://www.xrce.xerox.com/competencies/content -analysis/demos/english Some other downloadable FSA tools: http://www.research.att.com/sw/tools/fsm/ Lexicon for L0 Rule based languages Top Down Parsing Driven by the grammar, working down S NP VP

NP Nom Pro Verb Det Noun Noun I prefer a

morning flight S NP VP, NPPro, ProI, VPV NP, Vprefer, NPDet Nom, Deta, NomNoun Nom, Nounmorning, Nounflight [S [NP [Pro I]] [VP [V prefer] [NP [Det a] [Nom [N morning] [N flight]]]]] Bottom Up Parsing Driven by the words, working up The Bottom Up Parse The Grammar 0) S E $ 1)E E + T | E - T | T 2)T T * F | T / F | F

3) F num | id 1)id - num * id 2)F - num * id 3)T - num * id 4)E - num * id 5)E - F * id 6)E - T * id 7)E - T * F 8)E - T 9)E 10)S correct sentence Note: If there is no rule that applies, backtracking is necessary Top-Down and Bottom-Up Top-down Advantage: Searches only trees that are legal

Disadvantage: Tries trees that dont match the words Bottom-up Advantage: Only forms trees matching the words Disadvantage: Tries trees that make no sense globally Efficient combined algorithms Link top-down expectations with bottom-up data Example: Top-down parsing with bottom-up filtering Stochastic Language Models A probabilistic view of language modeling Problems A Language model cannot cover all grammatical rules Spoken language is often ungrammatical Solution

Constrain search space emphasizing likely word sequences Enhance the grammar to recognize intended sentences even when the sequence doesn't satisfy the rules Probabilistic Context-Free Grammars (PCFG) Goal: Assist in discriminating among competing choices Definition: G = (VN, VT, S, P, p); VN = non-terminal set of symbols VT = terminal set of symbols S = start symbol p = set of rule probabilities R = set of rules P(S ->W |G): S is the start symbol, W = expression in grammar G Training the Grammar: Count rule occurrences in a training corpus P(R | G) = Count(R) / C(R) PFSA (Probabilistic Finite State

Automata) A PFSA is a type of Probabilistic Context Free Grammar The states are the non-terminals in a production rule The output symbols are the observed outputs The arcs represent a context-free rule The path through the automata represent a parse tree A PCFG considers state transitions and the transition path S1 a a S1

S2 b S2 S3 b S3 Probabilistic Finite State Machines Probabilistic models determine weights of the transitions The sum of weights leaving a

state total to unity Operations Consider the weights to compute the probability of a given string or most likely path. The machine can learn the weights over time .001 .01 Companion Canine .0035 Tooth

Another Example Pronunciation decoding [n iy] Merging the machines together [n iy] Another Example

Recently Viewed Presentations

  • nameher e namehere Month Year 121 Month Year

    nameher e namehere Month Year 121 Month Year

    Paste Picture Here. message goes here. This should be a sentence or two and possibly include a # that is relevant. message goes here. This should be a sentence or two and possibly include a # that is relevant. message...
  • How ICT is used - Weebly

    How ICT is used - Weebly

    In groups of three compete the charts listing advantages and disadvantages for each. How ICT is used by individuals and businesses. Input Devices. Today. Learning Objectives. To understand how and why different input devices are used. Learning Outcomes.
  • Growing IIT-M - Deakin University Partnership

    Growing IIT-M - Deakin University Partnership

    Expand IITM-Deakin Collaborations, with the industry. Prioritise on high impact verticals for our industries. Sustainable manufacture ( mining to finished product) - Wear, refurbishment, efficient manufacture. pollution control. Light-weighting - through design and materials / manufacturing. Alloy and Coatings design...
  • Document Security - DISA

    Document Security - DISA

    Nick Barron, DISA IT [email protected]+44 7720 508085. About me. Day job. Security controller, sysadmin, software developer. Medium size List-X contractor. DISA IT advisor. After hours. 44CON security conference. SC Magazine. Way too many computers at home.
  • Leadership Development Programmes

    Leadership Development Programmes

    Outcomes of professional learning which are in line with key Welsh Government policies such as Successful Futures, Qualified for Life and Rewriting the Future; Motivated, enthusiastic staff who are engaged with leading edge research and practice and who want to...
  • Biblestudyresourcecenter. com biblestudyresourcecenter.com Terumah Terumah Heave offering The

    Biblestudyresourcecenter. com biblestudyresourcecenter.com Terumah Terumah Heave offering The

    The Torah is called the light of the world,23 as it says in Proverbs 6:23, "For the commandment is a lamp and the teaching is light." Messiah applies the same term to himself. He says, "While I am in the...
  • Project Mangement - BC Open Textbooks

    Project Mangement - BC Open Textbooks

    The Project Management Institute (PMI) Project Management Knowledge Areas. ... Scrum development. The Project Management Office. Project Management as a Profession. Body of knowledge. Standards. Professional organizations. Currently, anyone can call him or herself a project manager.
  • Boiler Safety

    Boiler Safety

    Sodium sulfite. is an oxygen scavenger that is commonly used to treat boiler water. Sodium sulfite combines with oxygen to form sodium sulfate, which accumulates at the bottom of the boiler. The sodium sulfate is then discharged from the boiler...