Concordia University Department of Computer Science

Concordia University Department of Computer Science

Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002 1 Part I Building a Lexical Analyzer Joey Paquet, 2000, 2002 2 Roles of the Scanner Removal of comments Comments are not part of the programs meaning Multiple-line comments? Nested comments? Case conversion Is the lexical definition case sensitive? For identifiers For keywords Joey Paquet, 2000, 2002 3 Roles of the Scanner Removal of white spaces Blanks, tabulars, carriage returns and line feeds Is it possible to identify tokens in a program without spaces? Interpretation of compiler directives #include, #ifdef, #ifndef and #define

are directives to redirect the input of the compiler May be done by a precompiler Joey Paquet, 2000, 2002 4 Roles of the Scanner Communication with the symbol table A symbol table entry is created when an identifier is encountered The lexical analyzer cannot create the whole entries Preparation of the output listing Output the analyzed code Output error messages and warnings Output a table of summary data Joey Paquet, 2000, 2002 5 Tokens and Lexemes Token: An element of the lexical definition of the language. Lexeme: A sequence of characters identified as a token. id relop openpar if then assignop semi distance,rate,time,a,x >=,<,== )

if then = ; Joey Paquet, 2000, 2002 6 Design of a Lexical Analyzer Steps 1- Construct a set of regular expressions (REs) that define the form of all valid token 2- Derive an NDFA from the REs 3- Derive a DFA from the NDFA 4- Translate to a state transition table 5- Implement the table 6- Implement the algorithm to interpret the table Joey Paquet, 2000, 2002 7 Regular Expressions :{} s : {s | s in s^} a : {a} r | s : {r | r in r^} or {s | s in s^} s* : {sn | s in s^ and n>=0} s+ : {sn | s in s^ and n>=1} id ::= letter(letter|digit)* Joey Paquet, 2000, 2002

8 Derive NDFA from REs Could derive DFA from REs but: Much easier to do NDFA, then derive DFA No standard way of deriving DFAs from Res Use Thompsons construction (Loudens) letter letter digit Joey Paquet, 2000, 2002 9 Derive DFA from NDFA Use subset construction (Loudens) May be optimized Easier to implement: No edges Determinist (no backtracking)

letter letter digit letter letter er tt le letter [other] di gi t digit Joey Paquet, 2000, 2002 digit 10 Generate State Transition Table letter 0 letter 1 [other]

2 digit 0 1 2 letter 1 1 digit other 1 2 Joey Paquet, 2000, 2002 final N N Y 11 Implementation Concerns Backtracking Principle : A token is normally recognized only when the next character is read. Problem : Maybe this character is part of the next token. Example : x<1. < is recognized only when 1 is read. In this case, we have to backtrack on character to continue token

recognition. Can include the occurrence of these cases in the state transition table. Joey Paquet, 2000, 2002 12 Implementation Concerns Ambiguity Problem : Some tokens lexemes are subsets of other tokens. Example : n-1. Is it <-><1> or <-1>? Solutions : Postpone the decision to the syntactic analyzer Do not allow sign prefix to numbers in the lexical specification Interact with the syntactic analyzer to find a solution. (Induces coupling) Joey Paquet, 2000, 2002 13 Example Alphabet : {:, *, =, (, ), <, >, {, }, [a..z], [0..9]} Simple tokens : {(, ), {, }, :, <, >} Composite tokens : {:=, >=, <=, <>, (*, *)} Words : id ::= letter(letter | digit)* num ::= digit* Joey Paquet, 2000, 2002 14

Example Ambiguity problems: Character Possible tokens : :, := > >, >= < <, <=, <> ( (, (* * *, *) Backtracking: Must back up a character when we read a character that is part of the next token. Occurences are coded in the table Joey Paquet, 2000, 2002 15 ld l 3 4 5 d d { sp

2 Final state with backtracking Final state } 6 7 1 19 20 ( 8 : < > * 9 * 10 ) 11 12

= 13 14 = > 15 17 = Joey Paquet, 2000, 2002 16 18 16 Table-driven Scanner (Table) l d { } ( *

) : = < > sp backup 1 2 4 6 19 8 19 19 12 19 14 17 1

2 2 2 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1

1 1 1 1 1 1 1 4 5 4 5 5 5 5 5 5 5 5 5 5

5 1 1 1 1 1 1 1 1 1 1 1 1 6 6 6 6 7 6

6 6 6 6 6 6 6 7 1 1 1 1 1 1 1 1 1 1 1 1

8 20 20 20 20 20 9 20 20 20 20 20 20 9 9 9 9 9 9

10 9 9 9 9 9 9 10 9 9 9 9 9 9 11 9 9 9 9 9

11 1 1 1 1 1 1 1 1 1 1 1 1 12 20 20 20 20 20

20 20 20 13 20 20 20 13 1 1 1 1 1 1 1 1 1 1 1 1

14 20 20 20 20 20 20 20 20 15 20 16 20 15 1 1 1 1 1

1 1 1 1 1 1 1 no [<=] 16 1 1 1 1 1 1 1 1 1 1 1

1 no [<>] 17 20 20 20 20 20 20 20 20 18 20 20 20 18 1 1 1

1 1 1 1 1 1 1 1 1 no [>=] 19 1 1 1 1 1 1 1 1 1

1 1 1 no 20 1 1 1 1 1 1 1 1 1 1 1 1 yes [various] Joey Paquet, 2000, 2002

yes [id] yes [num] no [{}] no [(**)] no [:=] 17 Table-driven Scanner (Algorithm) nextToken() state = 0 token = null do lookup = nextChar() state = Table(state, lookup) if (isFinalState(state)) token = createToken() if (Table(state, backup) == yes) backupChar() until (token != null) return (token) Joey Paquet, 2000, 2002 18 Table-driven Scanner nextToken() Extract the next token in the program (called by syntactic analyzer) nextChar() Read the next character (except spaces) in the input program backupChar() Backs up one character in the input file

Joey Paquet, 2000, 2002 19 Table-driven Scanner isFinalState(state) Returns TRUE if state is a final state table(state, column) Returns the value corresponding to [state, column] in the state transition table. createToken() Creates and returns a structure that contains the token type, its location in the source code, and its value (for literals). Joey Paquet, 2000, 2002 20 Hand-written Scanner nextToken() c = nextChar() case (c) of "[a..z],[A..Z]": c = nextChar() while (c in {[a..z],[A..Z],[0..9]}) do s = makeUpString() c = nextChar() if ( isReservedWord(s) )then token = createToken(RESWORD,null) else token = createToken(ID,s) backupChar() "[0..9]": c = nextChar()

while (c in [0..9]) do v = makeUpValue() c = nextChar() token = createToken(NUM,v) backupChar() Joey Paquet, 2000, 2002 21 Hand-written Scanner "{": c = nextChar() while ( c != "}" ) do c = nextChar() token = createToken(LBRACK,null) "(": c = nextChar() if ( c == "*" ) then c = nextChar() repeat while ( c != "*" ) do c = nextChar() c = nextChar() until ( c != ")" ) return else token = createToken(LPAR,null) ":": c = nextChar() if ( c == "=" ) then token = createToken(ASSIGNOP,null) else token = createToken(COLON,null) backupChar() Joey Paquet, 2000, 2002

22 Hand-written Scanner "<": c = nextChar() if ( c == "=" ) then token = createToken(LEQ,null) else if ( c == ">" ) then token = createToken(NEQ,null) else token = createToken(LT,null) backupChar() ">": c = nextChar() if ( c == "=" ) then token = createToken(GEQ,null) else token = createToken(GT,null) backupChar() ")": token = createToken(RPAR,null) "}": token = createToken(RBRACK,null) "*": token = createToken(STAR,null) "=": token = createToken(EQ,null) end case return token2000, 2002 Joey Paquet, 23 Part II Error recovery in Lexical Analysis

Joey Paquet, 2000, 2002 24 Possible Lexical Errors Depends on the accepted conventions: Invalid character letter not allowed to terminate a number numerical overflow identifier too long end of line before end of string Are these lexical errors? Joey Paquet, 2000, 2002 25 Accepted or Not? 123a 123456789012345678901234567 related to machines limitations Hello world Either is skipped or ThisIsAVeryLongVariableName = 1 Limit identifier length? Joey Paquet, 2000, 2002

26 Error Recovery Techniques Finding only the first error is not acceptable Panic Mode: Skip characters until a valid character is read Guess Mode: do pattern matching between erroneous strings and valid strings Example: (beggin vs. begin) Rarely implemented Joey Paquet, 2000, 2002 27 Conclusions Joey Paquet, 2000, 2002 28 Possible Implementations Lexical Analyzer Generator (e.g. Lex) + safe, quick Must learn software, unable to handle unusual situations Table-Driven Lexical Analyzer + general and adaptable method, same function can be used for all table-driven lexical analyzers Building transition table can be tedious and error-prone Joey Paquet, 2000, 2002

29 Possible Implementations Hand-written + Can be optimized, can handle any unusual situation, easy to build for most languages Error-prone, not adaptable or maintainable Joey Paquet, 2000, 2002 30 Lexical Analyzers Modularity Why should the Lexical Analyzer and the Syntactic Analyzer be separated? Modularity/Maintainability : system is more modular, thus more maintainable Efficiency : modularity = task specialization = easier optimization Reusability : can change to whole lexical analyzer without changing other parts Joey Paquet, 2000, 2002 31

Recently Viewed Presentations

  • Mini EE: Notecards

    Mini EE: Notecards

    Card Topic. The card topic is the title for the kind of information on the card.The card topic is a name that you make up yourself.This is essential for organizing your research paper.. After writing down the information from your...
  • Cognitive behavioral principles and the treatment

    Cognitive behavioral principles and the treatment

    "F.E.A.R" plan takes children through elements of CBT including techniques to calm their bodies, cognitive restructuring, exposure, and rewards Just print for notes then delete CBT, Meds, and combination: what's best?
  • Laboratoř kardiovaskulární biomechaniky

    Laboratoř kardiovaskulární biomechaniky

    Biomechanika oběhové soustavy Biomechanika a biomateriály 2009 Lukáš Horný Laboratoř biomechaniky člověka Ústavu mechaniky, biomechaniky a mechatroniky
  • John Wayne Gacy - Central Bucks School District

    John Wayne Gacy - Central Bucks School District

    The Businessman. In 1972, Gacy quit his job as a cook and started his own construction business, PDM Contractors ('Painting, Decorating and Maintenance') Gacy met and was photographed with then First Lady Rosalynn Carter on May 6, 1978. Relationships.
  • Neuroscience quizz 1 - WordPress.com

    Neuroscience quizz 1 - WordPress.com

    Neuroscience quiz 1. Domina Petric, MD. What kind of neuronal. circuits are present. i. ... Allows one neuron to relay information to its neighbor. Long chains of these. feedforward excitatory connections. propagate information through the nervous system.
  • SAFETY TALK Bat Removal During warm summer months,

    SAFETY TALK Bat Removal During warm summer months,

    SAFETY TALK. Bat Removal. Assess the Situation: If the bat is located in an area that is not typically accessible to faculty, staff, and students, such as an attic, it should be left alone until it can be removed by...
  • Linear Plot - Weebly

    Linear Plot - Weebly

    Linear Plot. Every short story or novel follows a linear plot, a straightforward storyline that contains all of the plot elements in chronological order.. Plot: the important series of events or actions that happen in a story.
  • Texture, Contours and Regions: Cue Integration in Image ...

    Texture, Contours and Regions: Cue Integration in Image ...

    Principles of Autonomy and Decision Making Brian Williams and Nicholas Roy 16.410/16.413 September 8th, 2004***