| author | Christian Urban <urbanc@in.tum.de> | 
| Mon, 27 Jan 2020 10:11:44 +0000 | |
| changeset 708 | 1b5dc2468ce3 | 
| parent 358 | b3129cff41e9 | 
| child 871 | 358a72d7bf71 | 
| permissions | -rw-r--r-- | 
| 
151
 
df229ec49b22
added ho
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
1  | 
\documentclass{article}
 | 
| 
292
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
2  | 
\usepackage{../style}
 | 
| 
217
 
cd6066f1056a
updated handouts
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
3  | 
\usepackage{../langs}
 | 
| 
292
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
4  | 
\usepackage{../graphics}
 | 
| 
151
 
df229ec49b22
added ho
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
5  | 
|
| 
 
df229ec49b22
added ho
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
6  | 
\begin{document}
 | 
| 
 
df229ec49b22
added ho
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
7  | 
|
| 
292
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
8  | 
\section*{Handout 5 (Lexing)}
 | 
| 
151
 
df229ec49b22
added ho
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
9  | 
|
| 
292
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
10  | 
Whenever you want to design a new programming language or  | 
| 
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
11  | 
implement a compiler for an existing language, the first task  | 
| 
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
12  | 
is to fix the basic ``words'' of the language. For example  | 
| 
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
13  | 
what are the keywords, or reserved words, of the language,  | 
| 
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
14  | 
what are permitted identifiers, numbers, expressions and so  | 
| 
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
15  | 
on. One convenient way to do this is, of course, by using  | 
| 
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
16  | 
regular expressions.  | 
| 
155
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
17  | 
|
| 
292
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
18  | 
In this course we want to take a closer look at the WHILE  | 
| 
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
19  | 
programming language. This is a simple imperative programming  | 
| 
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
20  | 
language consisting of arithmetic expressions, assignments,  | 
| 
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
21  | 
if-statements and loops. For example the Fibonacci program can  | 
| 
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
22  | 
be written in this language as follows:  | 
| 
151
 
df229ec49b22
added ho
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
23  | 
|
| 
153
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
24  | 
\begin{center}
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
25  | 
\mbox{\lstinputlisting[language=while]{../progs/fib.while}}
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
26  | 
\end{center}
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
27  | 
|
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
28  | 
\noindent  | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
29  | 
The keywords in this language will be  | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
30  | 
|
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
31  | 
\begin{center}
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
32  | 
\texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
33  | 
\end{center}
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
34  | 
|
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
35  | 
\noindent  | 
| 
155
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
36  | 
In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers
 | 
| 
159
 
ae5ceef5355e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
158 
diff
changeset
 | 
37  | 
and strings (which we however ignore for the moment). We also need to specify what the ``whitespace''  | 
| 
155
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
38  | 
is in our programming language and what comments should look like.  | 
| 
159
 
ae5ceef5355e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
158 
diff
changeset
 | 
39  | 
As a first try, we might specify the regular expressions for our language roughly as follows  | 
| 
153
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
40  | 
|
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
41  | 
\begin{center}
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
42  | 
\begin{tabular}{lcl}
 | 
| 
159
 
ae5ceef5355e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
158 
diff
changeset
 | 
43  | 
$\textit{LETTER}$ & $:=$ & $\texttt{a} + \texttt{A} + \texttt{b} + \texttt{B} + \ldots$\\
 | 
| 
 
ae5ceef5355e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
158 
diff
changeset
 | 
44  | 
$\textit{DIGIT}$ & $:=$ & $\texttt{0} + \texttt{1} + \texttt{2} + \ldots$\\
 | 
| 
153
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
45  | 
$\textit{KEYWORD}$ & $:=$ & $\texttt{while}  + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
46  | 
$\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ 
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
47  | 
$\textit{OP}$      &  $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
48  | 
$\textit{NUM}$   &  $:=$ & $\textit{DIGIT}^+$\\
 | 
| 
159
 
ae5ceef5355e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
158 
diff
changeset
 | 
49  | 
$\textit{WHITESPACE}$ & $:=$ & $("\hspace{2mm}" + \backslash\texttt{n})^+$
 | 
| 
153
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
50  | 
\end{tabular}
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
51  | 
\end{center}
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
52  | 
|
| 
155
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
53  | 
Having the regular expressions in place, the problem we have to solve is:  | 
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
54  | 
given a string of our programming language, which regular expression  | 
| 
159
 
ae5ceef5355e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
158 
diff
changeset
 | 
55  | 
matches which part of the string. By solving this problem, we can split up a string  | 
| 
 
ae5ceef5355e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
158 
diff
changeset
 | 
56  | 
of our language into components.  | 
| 
 
ae5ceef5355e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
158 
diff
changeset
 | 
57  | 
For example given the input string  | 
| 
153
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
58  | 
|
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
59  | 
\begin{center}
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
60  | 
\texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
61  | 
\end{center}
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
62  | 
|
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
63  | 
\noindent  | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
64  | 
we expect it is split up as follows  | 
| 
153
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
65  | 
|
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
66  | 
\begin{center}
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
67  | 
\tt  | 
| 
155
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
68  | 
\Grid{if}\,
 | 
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
69  | 
\Grid{\VS}\, 
 | 
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
70  | 
\Grid{true}\, 
 | 
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
71  | 
\Grid{\VS}\, 
 | 
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
72  | 
\Grid{then}\, 
 | 
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
73  | 
\Grid{\VS}\, 
 | 
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
74  | 
\Grid{x}\, 
 | 
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
75  | 
\Grid{+}\, 
 | 
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
76  | 
\Grid{2}\,
 | 
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
77  | 
\Grid{\VS}\, 
 | 
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
78  | 
\Grid{else}\,
 | 
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
79  | 
\Grid{\VS}\,
 | 
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
80  | 
\Grid{x}\,
 | 
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
81  | 
\Grid{+}\,
 | 
| 
153
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
82  | 
\Grid{3}
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
83  | 
\end{center}
 | 
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
84  | 
|
| 
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
85  | 
\noindent  | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
86  | 
This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}.
 | 
| 
153
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
87  | 
It is usually the first phase of a compiler. Note that the separation into words cannot, in general,  | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
88  | 
be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
89  | 
in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are 
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
90  | 
not separated by any whitespace. Another reason for recognising whitespaces explicitly is  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
91  | 
that in some languages, for example Python, whitespaces matters, that is carry meaning. However in  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
92  | 
our small language we will eventually just filter out all whitespaces and also all comments.  | 
| 
153
 
70ab41cb610e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
152 
diff
changeset
 | 
93  | 
|
| 
162
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
94  | 
Lexing not just separates a string into its components, but also classifies the components, meaning it explicitly records that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
 | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
95  | 
For the moment, though, we will only focus on the simpler problem of just splitting up  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
96  | 
a string into components.  | 
| 
155
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
97  | 
|
| 
159
 
ae5ceef5355e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
158 
diff
changeset
 | 
98  | 
There are a few subtleties we need to consider first. For example, say the string is  | 
| 
154
 
51d6b8b828c4
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
153 
diff
changeset
 | 
99  | 
|
| 
 
51d6b8b828c4
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
153 
diff
changeset
 | 
100  | 
\begin{center}
 | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
101  | 
\texttt{\Grid{iffoo\VS}}\;\ldots
 | 
| 
154
 
51d6b8b828c4
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
153 
diff
changeset
 | 
102  | 
\end{center}
 | 
| 
 
51d6b8b828c4
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
153 
diff
changeset
 | 
103  | 
|
| 
 
51d6b8b828c4
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
153 
diff
changeset
 | 
104  | 
\noindent  | 
| 
159
 
ae5ceef5355e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
158 
diff
changeset
 | 
105  | 
then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
 | 
| 
154
 
51d6b8b828c4
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
153 
diff
changeset
 | 
106  | 
by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a 
 | 
| 
155
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
107  | 
single identifier. The choice that is often made in lexers is to look for the longest possible match.  | 
| 
162
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
108  | 
This leaves  \texttt{iffoo} as the only match in this case (since it is longer than \texttt{if}).
 | 
| 
155
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
109  | 
|
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
110  | 
Unfortunately, the convention about the longest match does not yet make the process  | 
| 
159
 
ae5ceef5355e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
158 
diff
changeset
 | 
111  | 
of lexing completely deterministic. Consider the string  | 
| 
155
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
112  | 
|
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
113  | 
\begin{center}
 | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
114  | 
\texttt{\Grid{then\VS}}\;\dots
 | 
| 
155
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
115  | 
\end{center}
 | 
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
116  | 
|
| 
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
117  | 
\noindent  | 
| 
159
 
ae5ceef5355e
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
158 
diff
changeset
 | 
118  | 
Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our 
 | 
| 
156
 
95eaee695636
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
155 
diff
changeset
 | 
119  | 
regular expressions. In our running example we just use the ranking  | 
| 
 
95eaee695636
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
155 
diff
changeset
 | 
120  | 
|
| 
 
95eaee695636
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
155 
diff
changeset
 | 
121  | 
\[  | 
| 
 
95eaee695636
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
155 
diff
changeset
 | 
122  | 
\textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots
 | 
| 
 
95eaee695636
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
155 
diff
changeset
 | 
123  | 
\]  | 
| 
155
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
124  | 
|
| 
156
 
95eaee695636
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
155 
diff
changeset
 | 
125  | 
\noindent  | 
| 
160
 
8134c3b981e0
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
159 
diff
changeset
 | 
126  | 
So even if both regular expressions match in the example above,  | 
| 
 
8134c3b981e0
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
159 
diff
changeset
 | 
127  | 
we give preference to the regular expression for keywords.  | 
| 
155
 
9b2d128765e1
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
154 
diff
changeset
 | 
128  | 
|
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
129  | 
Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
130  | 
and $der$, it will  use the function \emph{zeroable} defined as follows:
 | 
| 
157
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
131  | 
|
| 
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
132  | 
\begin{center}
 | 
| 
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
133  | 
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
 | 
| 
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
134  | 
$zeroable(\varnothing)$ & $\dn$ & $true$\\  | 
| 
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
135  | 
$zeroable(\epsilon)$ & $\dn$ & $f\!alse$\\  | 
| 
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
136  | 
$zeroable (c)$ & $\dn$ & $f\!alse$\\  | 
| 
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
137  | 
$zeroable (r_1 + r_2)$ & $\dn$ & $zeroable(r_1) \wedge zeroable(r_2)$ \\  | 
| 
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
138  | 
$zeroable (r_1 \cdot r_2)$ & $\dn$ & $zeroable(r_1) \vee zeroable(r_2)$ \\  | 
| 
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
139  | 
$zeroable (r^*)$ & $\dn$ & $f\!alse$\\  | 
| 
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
140  | 
\end{tabular}
 | 
| 
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
141  | 
\end{center}
 | 
| 
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
142  | 
|
| 
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
143  | 
\noindent  | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
144  | 
Recall that the function $nullable(r)$ tests whether a regular expression  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
145  | 
can match the empty string. The function $zeroable$, on the other hand, tests whether a regular  | 
| 
158
 
77e8397ec2ec
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
157 
diff
changeset
 | 
146  | 
expression cannot match anything at all. The mathematical way of stating this  | 
| 
160
 
8134c3b981e0
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
159 
diff
changeset
 | 
147  | 
property is  | 
| 
157
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
148  | 
|
| 
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
149  | 
\begin{center}
 | 
| 
160
 
8134c3b981e0
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
159 
diff
changeset
 | 
150  | 
$zeroable(r)$ if and only if $L(r) = \varnothing$  | 
| 
157
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
151  | 
\end{center}
 | 
| 
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
152  | 
|
| 
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
153  | 
\noindent  | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
154  | 
For what follows let us fix a set of regular expressions $rs$ as being  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
155  | 
|
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
156  | 
\begin{center}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
157  | 
\textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
158  | 
\end{center}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
159  | 
|
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
160  | 
\noindent  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
161  | 
specifying the ``words''  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
162  | 
of our programming language. The algorithm takes as input the $rs$ and a string, say  | 
| 
158
 
77e8397ec2ec
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
157 
diff
changeset
 | 
163  | 
|
| 
 
77e8397ec2ec
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
157 
diff
changeset
 | 
164  | 
\begin{center}
 | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
165  | 
\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
166  | 
\end{center}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
167  | 
|
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
168  | 
\noindent  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
169  | 
and tries to chop off one word from the beginning of the string. If none of the  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
170  | 
regular expression in $rs$ matches, we will just return  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
171  | 
the empty string.  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
172  | 
|
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
173  | 
The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
174  | 
to the first character $c_1$. Then we take the results and continue with  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
175  | 
building the derivatives with respect to $c_2$ until we have either exhausted our  | 
| 
162
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
176  | 
input string or all of the regular expressions are ``zeroable''. Suppose the input string is  | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
177  | 
|
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
178  | 
\begin{center}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
179  | 
\texttt{\Grid{if2\VS}}\;\dots
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
180  | 
\end{center}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
181  | 
|
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
182  | 
\noindent  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
183  | 
then building the derivatives with respect to \texttt{i} gives
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
184  | 
|
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
185  | 
\begin{center}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
186  | 
\begin{tabular}{l|c}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
187  | 
& $zeroable$\\\hline  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
188  | 
 $der\;\texttt{i}\;(\textit{KEYWORD})$      & no\\
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
189  | 
 $der\;\texttt{i}\;(\textit{IDENT})$              & no\\
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
190  | 
 $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
191  | 
\end{tabular}
 | 
| 
158
 
77e8397ec2ec
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
157 
diff
changeset
 | 
192  | 
\end{center}
 | 
| 
 
77e8397ec2ec
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
157 
diff
changeset
 | 
193  | 
|
| 
 
77e8397ec2ec
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
157 
diff
changeset
 | 
194  | 
\noindent  | 
| 
162
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
195  | 
We can eliminate \textit{WHITESPACE} as a potential candidate, because
 | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
196  | 
no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
197  | 
two regular expressions as potential candidate and we have to consider the  | 
| 
162
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
198  | 
next character, \texttt{f}, from the input string
 | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
199  | 
|
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
200  | 
\begin{center}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
201  | 
\begin{tabular}{l|c}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
202  | 
& $zeroable$\\\hline  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
203  | 
 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$      & no\\
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
204  | 
 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$              & no\\
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
205  | 
\end{tabular}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
206  | 
\end{center}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
207  | 
|
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
208  | 
\noindent  | 
| 
162
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
209  | 
Since both are `no', we have to continue with \texttt{2} from the input string
 | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
210  | 
|
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
211  | 
\begin{center}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
212  | 
\begin{tabular}{l|c}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
213  | 
& $zeroable$\\\hline  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
214  | 
 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$      & yes\\
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
215  | 
 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$              & no\\
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
216  | 
\end{tabular}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
217  | 
\end{center}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
218  | 
|
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
219  | 
\noindent  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
220  | 
Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
221  | 
know how much of the input string should be considered as an \textit{IDENT}. So we
 | 
| 
162
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
222  | 
still have to continue and consider the next derivative.  | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
223  | 
|
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
224  | 
\begin{center}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
225  | 
\begin{tabular}{l|c}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
226  | 
& $zeroable$\\\hline  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
227  | 
 $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$              & yes\\
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
228  | 
\end{tabular}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
229  | 
\end{center}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
230  | 
|
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
231  | 
\noindent  | 
| 
162
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
232  | 
Since the answer is now `yes' also in this case, we can stop: once all derivatives are  | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
233  | 
zeroable, we know the regular expressions cannot match any more letters from  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
234  | 
the input. In this case we only have to go back to the derivative that is nullable. In this case it  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
235  | 
is  | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
236  | 
|
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
237  | 
\begin{center}
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
238  | 
$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ 
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
239  | 
\end{center} 
 | 
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
240  | 
|
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
241  | 
\noindent  | 
| 
162
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
242  | 
which means we recognised an identifier. In case where there is a choice of more than one  | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
243  | 
regular expressions that are nullable, then we choose the one with the highest precedence.  | 
| 
162
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
244  | 
You can try out such a case with the input string  | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
245  | 
|
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
246  | 
\begin{center}
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
247  | 
\texttt{\Grid{then\VS}}\;\dots
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
248  | 
\end{center}
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
249  | 
|
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
250  | 
\noindent  | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
251  | 
which can both be recognised as a keyword, but also an identifier.  | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
252  | 
|
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
253  | 
While in the example above the last nullable derivative is the one directly before  | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
254  | 
the derivative turns zeroable, this is not always the case. Imagine, identifiers can be  | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
255  | 
letters, as permuted by the regular expression \textit{IDENT}, but must end with an 
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
256  | 
undercore.  | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
257  | 
|
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
258  | 
\begin{center}
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
259  | 
\begin{tabular}{lcl}
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
260  | 
$\textit{NEWIDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^* \cdot \_$\\ 
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
261  | 
\end{tabular}
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
262  | 
\end{center}
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
263  | 
|
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
264  | 
\noindent  | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
265  | 
If we use \textit{NEWIDENT} with the input string
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
266  | 
|
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
267  | 
\begin{center}
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
268  | 
\texttt{\Grid{iffoo\VS}}\;\ldots
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
269  | 
\end{center}
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
270  | 
|
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
271  | 
\noindent  | 
| 
163
 
89d6d89d9844
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
162 
diff
changeset
 | 
272  | 
then it will only become $zeroable$ after the $\VS$ has been analysed. In this case we have to go back to the  | 
| 
162
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
273  | 
first \texttt{f} because only 
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
274  | 
|
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
275  | 
\begin{center}
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
276  | 
 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$  
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
277  | 
\end{center}
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
278  | 
|
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
279  | 
\noindent  | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
280  | 
is nullable. As a result we recognise successfully the keyword \texttt{if} and the remaining
 | 
| 
 
edcd84c7b491
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
161 
diff
changeset
 | 
281  | 
string needs to be consumed by other regular expressions or lead to a lexing error.  | 
| 
161
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
282  | 
|
| 
 
bfcf838dda4d
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
160 
diff
changeset
 | 
283  | 
|
| 
157
 
b6eee9571a63
added
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
156 
diff
changeset
 | 
284  | 
|
| 
151
 
df229ec49b22
added ho
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
285  | 
\end{document}
 | 
| 
 
df229ec49b22
added ho
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
286  | 
|
| 
 
df229ec49b22
added ho
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
287  | 
%%% Local Variables:  | 
| 
 
df229ec49b22
added ho
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
288  | 
%%% mode: latex  | 
| 
 
df229ec49b22
added ho
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
289  | 
%%% TeX-master: t  | 
| 
 
df229ec49b22
added ho
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
290  | 
%%% End:  |