handouts/ho05-bak.tex
author Christian Urban <christian.urban@kcl.ac.uk>
Tue, 13 Oct 2020 14:29:10 +0100
changeset 779 5385c8342f02
parent 358 b3129cff41e9
child 871 94b84d880c2b
permissions -rw-r--r--
updated
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
151
df229ec49b22 added ho
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     1
\documentclass{article}
292
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
     2
\usepackage{../style}
217
cd6066f1056a updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 173
diff changeset
     3
\usepackage{../langs}
292
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
     4
\usepackage{../graphics}
151
df229ec49b22 added ho
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     5
df229ec49b22 added ho
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     6
\begin{document}
df229ec49b22 added ho
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     7
292
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
     8
\section*{Handout 5 (Lexing)}
151
df229ec49b22 added ho
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     9
292
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
    10
Whenever you want to design a new programming language or
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
    11
implement a compiler for an existing language, the first task
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
    12
is to fix the basic ``words'' of the language. For example
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
    13
what are the keywords, or reserved words, of the language,
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
    14
what are permitted identifiers, numbers, expressions and so
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
    15
on. One convenient way to do this is, of course, by using
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
    16
regular expressions. 
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    17
292
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
    18
In this course we want to take a closer look at the WHILE
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
    19
programming language. This is a simple imperative programming
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
    20
language consisting of arithmetic expressions, assignments,
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
    21
if-statements and loops. For example the Fibonacci program can
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
    22
be written in this language as follows:
151
df229ec49b22 added ho
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
    23
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    24
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    25
\mbox{\lstinputlisting[language=while]{../progs/fib.while}}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    26
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    27
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    28
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    29
The keywords in this language will be
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    30
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    31
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    32
\texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    33
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    34
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    35
\noindent
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    36
In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 158
diff changeset
    37
and strings (which we however ignore for the moment). We also need to specify what the ``whitespace''
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    38
is in our programming language and what comments should look like.
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 158
diff changeset
    39
As a first try, we might specify the regular expressions for our language roughly as follows
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    40
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    41
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    42
\begin{tabular}{lcl}
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 158
diff changeset
    43
$\textit{LETTER}$ & $:=$ & $\texttt{a} + \texttt{A} + \texttt{b} + \texttt{B} + \ldots$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 158
diff changeset
    44
$\textit{DIGIT}$ & $:=$ & $\texttt{0} + \texttt{1} + \texttt{2} + \ldots$\\
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    45
$\textit{KEYWORD}$ & $:=$ & $\texttt{while}  + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    46
$\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    47
$\textit{OP}$      &  $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    48
$\textit{NUM}$   &  $:=$ & $\textit{DIGIT}^+$\\
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 158
diff changeset
    49
$\textit{WHITESPACE}$ & $:=$ & $("\hspace{2mm}" + \backslash\texttt{n})^+$
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    50
\end{tabular}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    51
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    52
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    53
Having the regular expressions in place, the problem we have to solve is: 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    54
given a string of our programming language, which regular expression 
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 158
diff changeset
    55
matches which part of the string. By solving this problem, we can split up a string 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 158
diff changeset
    56
of our language into components. 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 158
diff changeset
    57
For example given the input string
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    58
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    59
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    60
\texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    61
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    62
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    63
\noindent
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
    64
we expect it is split up as follows
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    65
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    66
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    67
\tt
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    68
\Grid{if}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    69
\Grid{\VS}\, 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    70
\Grid{true}\, 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    71
\Grid{\VS}\, 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    72
\Grid{then}\, 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    73
\Grid{\VS}\, 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    74
\Grid{x}\, 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    75
\Grid{+}\, 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    76
\Grid{2}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    77
\Grid{\VS}\, 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    78
\Grid{else}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    79
\Grid{\VS}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    80
\Grid{x}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    81
\Grid{+}\,
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    82
\Grid{3}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    83
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    84
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    85
\noindent
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
    86
This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}.
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    87
It is usually the first phase of a compiler. Note that the separation into words cannot, in general, 
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
    88
be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
    89
in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
    90
not separated by any whitespace. Another reason for recognising whitespaces explicitly is
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
    91
that in some languages, for example Python, whitespaces matters, that is carry meaning. However in 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
    92
our small language we will eventually just filter out all whitespaces and also all comments.
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 152
diff changeset
    93
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
    94
Lexing not just separates a string into its components, but also classifies the components, meaning it explicitly records that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
    95
For the moment, though, we will only focus on the simpler problem of just splitting up 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
    96
a string into components.
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
    97
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 158
diff changeset
    98
There are a few subtleties  we need to consider first. For example, say the string is
154
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 153
diff changeset
    99
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 153
diff changeset
   100
\begin{center}
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   101
\texttt{\Grid{iffoo\VS}}\;\ldots
154
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 153
diff changeset
   102
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 153
diff changeset
   103
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 153
diff changeset
   104
\noindent
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 158
diff changeset
   105
then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
154
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 153
diff changeset
   106
by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a 
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
   107
single identifier. The choice that is often made in lexers is to look for the longest possible match.
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   108
This leaves  \texttt{iffoo} as the only match in this case (since it is longer than \texttt{if}).
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
   109
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   110
Unfortunately, the convention about the longest match does not yet make the process 
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 158
diff changeset
   111
of lexing completely deterministic. Consider the string
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
   112
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
   113
\begin{center}
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   114
\texttt{\Grid{then\VS}}\;\dots
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
   115
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
   116
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
   117
\noindent
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 158
diff changeset
   118
Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our 
156
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 155
diff changeset
   119
regular expressions. In our running example we just use the ranking
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 155
diff changeset
   120
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 155
diff changeset
   121
\[
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 155
diff changeset
   122
\textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 155
diff changeset
   123
\]
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
   124
156
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 155
diff changeset
   125
\noindent
160
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 159
diff changeset
   126
So even if both regular expressions match in the example above,
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 159
diff changeset
   127
we give preference to the regular expression for keywords.
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 154
diff changeset
   128
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   129
Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   130
and $der$, it will  use the function \emph{zeroable} defined as follows:
157
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   131
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   132
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   133
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   134
$zeroable(\varnothing)$      & $\dn$ & $true$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   135
$zeroable(\epsilon)$           & $\dn$ &  $f\!alse$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   136
$zeroable (c)$                    & $\dn$ &  $f\!alse$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   137
$zeroable (r_1 + r_2)$        & $\dn$ &  $zeroable(r_1) \wedge zeroable(r_2)$ \\ 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   138
$zeroable (r_1 \cdot r_2)$  & $\dn$ &  $zeroable(r_1) \vee zeroable(r_2)$ \\
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   139
$zeroable (r^*)$                  & $\dn$ & $f\!alse$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   140
\end{tabular}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   141
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   142
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   143
\noindent
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   144
Recall that the function $nullable(r)$ tests whether a regular expression
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   145
can match the empty string. The function $zeroable$, on the other hand, tests whether a regular
158
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 157
diff changeset
   146
expression cannot match anything at all. The mathematical way of stating this
160
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 159
diff changeset
   147
property is
157
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   148
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   149
\begin{center}
160
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 159
diff changeset
   150
$zeroable(r)$ if and only if $L(r) = \varnothing$
157
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   151
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   152
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   153
\noindent
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   154
For what follows let us fix a set of regular expressions $rs$ as being 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   155
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   156
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   157
\textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   158
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   159
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   160
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   161
specifying the ``words'' 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   162
of our programming language. The algorithm takes as input the $rs$ and a string, say
158
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 157
diff changeset
   163
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 157
diff changeset
   164
\begin{center}
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   165
\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   166
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   167
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   168
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   169
and tries to chop off one word from the beginning of the string. If none of the
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   170
regular expression in $rs$ matches, we will just return
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   171
the empty string.
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   172
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   173
The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   174
to the first character $c_1$. Then we take the results and continue with 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   175
building the derivatives with respect to $c_2$ until we have either exhausted our 
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   176
input string or all of the regular expressions are ``zeroable''.  Suppose the input string is 
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   177
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   178
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   179
\texttt{\Grid{if2\VS}}\;\dots
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   180
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   181
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   182
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   183
then building the derivatives with respect to \texttt{i} gives
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   184
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   185
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   186
\begin{tabular}{l|c}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   187
 & $zeroable$\\\hline
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   188
 $der\;\texttt{i}\;(\textit{KEYWORD})$      & no\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   189
 $der\;\texttt{i}\;(\textit{IDENT})$              & no\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   190
 $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   191
\end{tabular}
158
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 157
diff changeset
   192
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 157
diff changeset
   193
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 157
diff changeset
   194
\noindent
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   195
We can eliminate \textit{WHITESPACE} as a potential candidate, because
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   196
no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   197
two regular expressions as potential candidate and we have to consider the
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   198
next character, \texttt{f}, from the input string
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   199
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   200
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   201
\begin{tabular}{l|c}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   202
 & $zeroable$\\\hline
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   203
 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$      & no\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   204
 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$              & no\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   205
\end{tabular}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   206
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   207
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   208
\noindent
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   209
Since both are `no', we have to continue with \texttt{2} from the input string
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   210
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   211
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   212
\begin{tabular}{l|c}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   213
 & $zeroable$\\\hline
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   214
 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$      & yes\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   215
 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$              & no\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   216
\end{tabular}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   217
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   218
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   219
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   220
Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   221
know how much of the input string should be considered as an \textit{IDENT}. So we
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   222
still have to continue and consider the next derivative.
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   223
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   224
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   225
\begin{tabular}{l|c}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   226
 & $zeroable$\\\hline
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   227
 $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$              & yes\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   228
\end{tabular}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   229
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   230
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   231
\noindent
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   232
Since the answer is now `yes' also in this case, we can stop: once all derivatives are
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   233
zeroable, we know the regular expressions cannot match any more letters from
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   234
the input. In this case we only have to go back to the derivative that is nullable.  In this case it
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   235
is
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   236
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   237
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   238
$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   239
\end{center} 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   240
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   241
\noindent
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   242
which means we recognised an identifier. In case where there is a choice of more than one
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   243
regular expressions that are nullable, then we choose the one with the highest precedence. 
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   244
You can try out such a case with the input string
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   245
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   246
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   247
\texttt{\Grid{then\VS}}\;\dots
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   248
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   249
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   250
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   251
which can both be recognised as a keyword, but also an identifier. 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   252
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   253
While in the example above the last nullable derivative is the one directly before
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   254
the derivative turns zeroable, this is not always the case. Imagine, identifiers can be 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   255
letters, as permuted by the regular expression \textit{IDENT}, but must end with an 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   256
undercore.
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   257
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   258
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   259
\begin{tabular}{lcl}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   260
$\textit{NEWIDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^* \cdot \_$\\ 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   261
\end{tabular}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   262
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   263
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   264
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   265
If we use \textit{NEWIDENT} with the input string
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   266
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   267
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   268
\texttt{\Grid{iffoo\VS}}\;\ldots
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   269
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   270
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   271
\noindent
163
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 162
diff changeset
   272
then it will only become $zeroable$ after the $\VS$ has been analysed. In this case we have to go back to the
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   273
first \texttt{f} because only 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   274
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   275
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   276
 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$  
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   277
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   278
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   279
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   280
is nullable. As a result we recognise successfully the keyword \texttt{if} and the remaining
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 161
diff changeset
   281
string needs to be consumed by other regular expressions or lead to a lexing error.
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   282
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 160
diff changeset
   283
157
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 156
diff changeset
   284
151
df229ec49b22 added ho
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   285
\end{document}
df229ec49b22 added ho
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   286
df229ec49b22 added ho
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   287
%%% Local Variables: 
df229ec49b22 added ho
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   288
%%% mode: latex
df229ec49b22 added ho
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   289
%%% TeX-master: t
df229ec49b22 added ho
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   290
%%% End: