|
1 \documentclass{article} |
|
2 \usepackage{charter} |
|
3 \usepackage{hyperref} |
|
4 \usepackage{amssymb} |
|
5 \usepackage{amsmath} |
|
6 \usepackage[T1]{fontenc} |
|
7 \usepackage{listings} |
|
8 \usepackage{xcolor} |
|
9 |
|
10 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}% |
|
11 |
|
12 \definecolor{javared}{rgb}{0.6,0,0} % for strings |
|
13 \definecolor{javagreen}{rgb}{0.25,0.5,0.35} % comments |
|
14 \definecolor{javapurple}{rgb}{0.5,0,0.35} % keywords |
|
15 \definecolor{javadocblue}{rgb}{0.25,0.35,0.75} % javadoc |
|
16 |
|
17 \lstdefinelanguage{scala}{ |
|
18 morekeywords={abstract,case,catch,class,def,% |
|
19 do,else,extends,false,final,finally,% |
|
20 for,if,implicit,import,match,mixin,% |
|
21 new,null,object,override,package,% |
|
22 private,protected,requires,return,sealed,% |
|
23 super,this,throw,trait,true,try,% |
|
24 type,val,var,while,with,yield}, |
|
25 otherkeywords={=>,<-,<\%,<:,>:,\#,@}, |
|
26 sensitive=true, |
|
27 morecomment=[l]{//}, |
|
28 morecomment=[n]{/*}{*/}, |
|
29 morestring=[b]", |
|
30 morestring=[b]', |
|
31 morestring=[b]""" |
|
32 } |
|
33 |
|
34 \lstset{language=Scala, |
|
35 basicstyle=\ttfamily, |
|
36 keywordstyle=\color{javapurple}\bfseries, |
|
37 stringstyle=\color{javagreen}, |
|
38 commentstyle=\color{javagreen}, |
|
39 morecomment=[s][\color{javadocblue}]{/**}{*/}, |
|
40 numbers=left, |
|
41 numberstyle=\tiny\color{black}, |
|
42 stepnumber=1, |
|
43 numbersep=10pt, |
|
44 tabsize=2, |
|
45 showspaces=false, |
|
46 showstringspaces=false} |
|
47 |
|
48 \begin{document} |
|
49 |
|
50 \section*{Handout 2} |
|
51 |
|
52 Having specified what problem our matching algorithm, $match$, is supposed to solve, namely |
|
53 for a given regular expression $r$ and string $s$ answer $true$ if and only if |
|
54 |
|
55 \[ |
|
56 s \in L(r) |
|
57 \] |
|
58 |
|
59 \noindent |
|
60 Clearly we cannot use the function $L$ directly in order to solve this problem, because in general |
|
61 the set of strings $L$ returns is infinite (recall what $L(a^*)$ is). In such cases there is no algorithm |
|
62 then can test exhaustively, whether a string is member of this set. |
|
63 |
|
64 The algorithm we define below consists of two parts. One is the function $nullable$ which takes a |
|
65 regular expression as argument and decides whether it can match the empty string (this means it returns a |
|
66 boolean). This can be easily defined recursively as follows: |
|
67 |
|
68 \begin{center} |
|
69 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} |
|
70 $nullable(\varnothing)$ & $\dn$ & $f\!\/alse$\\ |
|
71 $nullable(\epsilon)$ & $\dn$ & $true$\\ |
|
72 $nullable (c)$ & $\dn$ & $f\!alse$\\ |
|
73 $nullable (r_1 + r_2)$ & $\dn$ & $nullable(r_1) \vee nullable(r_2)$\\ |
|
74 $nullable (r_1 \cdot r_2)$ & $\dn$ & $nullable(r_1) \wedge nullable(r_2)$\\ |
|
75 $nullable (r^*)$ & $\dn$ & $true$ \\ |
|
76 \end{tabular} |
|
77 \end{center} |
|
78 |
|
79 \noindent |
|
80 The idea behind this function is that the following property holds: |
|
81 |
|
82 \[ |
|
83 nullable(r) \;\;\text{if and only if}\;\; ""\in L(r) |
|
84 \] |
|
85 |
|
86 \noindent |
|
87 On the left-hand side we have a function we can implement; on the right we have its specification. |
|
88 |
|
89 The other function is calculating a \emph{derivative} of a regular expression. This is a function |
|
90 which will take a regular expression, say $r$, and a character, say $c$, as argument and return |
|
91 a new regular expression. Beware that the intuition behind this function is not so easy to grasp on first |
|
92 reading. Essentially this function solves the following problem: if $r$ can match a string of the form |
|
93 $c\!::\!s$, what does the regular expression look like that can match just $s$. The definition of this |
|
94 function is as follows: |
|
95 |
|
96 \begin{center} |
|
97 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}} |
|
98 $der\, c\, (\varnothing)$ & $\dn$ & $\varnothing$ & \\ |
|
99 $der\, c\, (\epsilon)$ & $\dn$ & $\varnothing$ & \\ |
|
100 $der\, c\, (d)$ & $\dn$ & if $c = d$ then $\epsilon$ else $\varnothing$ & \\ |
|
101 $der\, c\, (r_1 + r_2)$ & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$ & \\ |
|
102 $der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $nullable (r_1)$\\ |
|
103 & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\ |
|
104 & & else $(der\, c\, r_1) \cdot r_2$\\ |
|
105 $der\, c\, (r^*)$ & $\dn$ & $(der\,c\,r) \cdot (r^*)$ & |
|
106 \end{tabular} |
|
107 \end{center} |
|
108 |
|
109 \noindent |
|
110 The first two clauses can be rationalised as follows: recall that $der$ should calculate a regular |
|
111 expression, if the ``input'' regular expression can match a string of the form $c\!::\!s$. Since neither |
|
112 $\varnothing$ nor $\epsilon$ can match such a string we return $\varnothing$. In the third case |
|
113 we have to make a case-distinction: In case the regular expression is $c$, then clearly it can recognise |
|
114 a string of the form $c\!::\!s$, just that $s$ is the empty string. Therefore we return the $\epsilon$-regular |
|
115 expression. In the other case we again return $\varnothing$ since no string of the $c\!::\!s$ can be matched. |
|
116 The $+$-case is relatively straightforward: all strings of the form $c\!::\!s$ are either matched by the |
|
117 regular expression $r_1$ or $r_2$. So we just have to recursively call $der$ with these two regular |
|
118 expressions and compose the results again with $+$. The $\cdot$-case is more complicated: |
|
119 if $r_1\cdot r_2$ matches a string of the form $c\!::\!s$, then the first part must be matched by $r_1$. |
|
120 Consequently, it makes sense to construct the regular expression for $s$ by calling $der$ with $r_1$ and |
|
121 ``appending'' $r_2$. There is however one exception to this simple rule: if $r_1$ can match the empty |
|
122 string, then all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is nullable (that is can match the |
|
123 empty string) we have to allow the choice $der\,c\,r_2$ for calculating the regular expression that can match |
|
124 $s$. The $*$-case is again simple: if $r^*$ matches a string of the form $c\!::\!s$, then the first part must be |
|
125 ``matched'' by a single copy of $r$. Therefore we call recursively $der\,c\,r$ and ``append'' $r^*$ in order to |
|
126 match the rest of $s$. |
|
127 |
|
128 Another way to rationalise the definition of $der$ is to consider the following operation on sets: |
|
129 |
|
130 \[ |
|
131 Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\} |
|
132 \] |
|
133 |
|
134 \noindent |
|
135 which essentially transforms a set of strings $A$ by filtering out all strings that do not start with $c$ and then |
|
136 strip off the $c$ from all the remaining strings. For example suppose $A = \{"f\!oo", "bar", "f\!rak"\}$ then |
|
137 \[ |
|
138 Der\,f\,A = \{"oo", "rak"\}\quad,\quad |
|
139 Der\,b\,A = \{"ar"\} \quad \text{and} \quad |
|
140 Der\,a\,A = \varnothing |
|
141 \] |
|
142 |
|
143 \noindent |
|
144 Note that in the last case $Der$ is empty, because no string in $A$ starts with $a$. With this operation we can |
|
145 state the following property about $der$: |
|
146 |
|
147 \[ |
|
148 L(der\,c\,r) = Der\,c\,(L(r)) |
|
149 \] |
|
150 |
|
151 \noindent |
|
152 This property clarifies what regular expression $der$ calculates, namely take the set of strings |
|
153 that $r$ can match ($L(r)$), filter out all strings not starting with $c$ and strip off the $c$ from the |
|
154 remaining strings---this is exactly the language that $der\,c\,r$ can match. |
|
155 |
|
156 For our matching algorithm we need to lift the notion of derivatives from characters to strings. This can be |
|
157 done using the following function, taking a string and regular expression as input and a regular expression |
|
158 as output. |
|
159 |
|
160 \begin{center} |
|
161 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}} |
|
162 $der\!s\, []\, r$ & $\dn$ & $r$ & \\ |
|
163 $der\!s\, (c\!::\!s)\, r$ & $\dn$ & $der\!s\,s\,(der\,c\,r)$ & \\ |
|
164 \end{tabular} |
|
165 \end{center} |
|
166 |
|
167 \noindent |
|
168 Having $ders$ in place, we can finally define our matching algorithm: |
|
169 |
|
170 \[ |
|
171 match\,s\,r = nullable(ders\,s\,r) |
|
172 \] |
|
173 |
|
174 \noindent |
|
175 We claim that |
|
176 |
|
177 \[ |
|
178 match\,s\,r\quad\text{if and only if}\quad s\in L(r) |
|
179 \] |
|
180 |
|
181 \noindent |
|
182 holds, which means our algorithm satisfies the specification. This algorithm was introduced by |
|
183 Janus Brzozowski in 1964. Its main attractions are simplicity and being fast, as well as |
|
184 being easily extendable for other regular expressions such as $r^{\{n\}}$, $r^?$, $\sim{}r$ and so on. |
|
185 |
|
186 \begin{figure}[p] |
|
187 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/app5.scala}}} |
|
188 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/app6.scala}}} |
|
189 \caption{Scala implementation of the nullable and derivatives functions.} |
|
190 \end{figure} |
|
191 |
|
192 \end{document} |
|
193 |
|
194 %%% Local Variables: |
|
195 %%% mode: latex |
|
196 %%% TeX-master: t |
|
197 %%% End: |