96 formally established their correctness. We have also not yet looked |
99 formally established their correctness. We have also not yet looked |
97 at extended regular expressions, such as bounded repetitions, |
100 at extended regular expressions, such as bounded repetitions, |
98 negation and back-references. |
101 negation and back-references. |
99 \end{abstract} |
102 \end{abstract} |
100 |
103 |
|
104 \section{Recapitulation of Concepts From the Last Report} |
|
105 \subsection{The Algorithm by Brzozowski based on Derivatives of Regular |
|
106 Expressions} |
|
107 |
|
108 Suppose (basic) regular expressions are given by the following grammar: |
|
109 \[ r ::= \ZERO \mid \ONE |
|
110 \mid c |
|
111 \mid r_1 \cdot r_2 |
|
112 \mid r_1 + r_2 |
|
113 \mid r^* |
|
114 \] |
|
115 |
|
116 \noindent |
|
117 |
|
118 The ingenious contribution by Brzozowski is the notion of |
|
119 \emph{derivatives} of regular expressions. |
|
120 \begin{center} |
|
121 \begin{tabular}{lcl} |
|
122 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
|
123 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
|
124 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
|
125 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
|
126 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
|
127 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
|
128 \end{tabular} |
|
129 \end{center} |
|
130 |
|
131 \begin{center} |
|
132 \begin{tabular}{lcl} |
|
133 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
134 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
135 $d \backslash c$ & $\dn$ & |
|
136 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
137 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
138 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ |
|
139 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
140 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
141 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
142 \end{tabular} |
|
143 \end{center} |
|
144 |
|
145 %Assuming the classic notion of a |
|
146 %\emph{language} of a regular expression, written $L(\_)$, t |
|
147 |
|
148 \noindent |
|
149 The main property of the derivative operation is that |
|
150 |
|
151 \begin{center} |
|
152 $c\!::\!s \in L(r)$ holds |
|
153 if and only if $s \in L(r\backslash c)$. |
|
154 \end{center} |
|
155 |
|
156 \noindent |
|
157 |
|
158 |
|
159 Now we can generalise the derivative operation to strings like this: |
|
160 |
|
161 \begin{center} |
|
162 \begin{tabular}{lcl} |
|
163 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
|
164 $r \backslash [\,] $ & $\dn$ & $r$ |
|
165 \end{tabular} |
|
166 \end{center} |
|
167 |
|
168 \noindent |
|
169 and then define as regular-expression matching algorithm: |
|
170 \[ |
|
171 match\;s\;r \;\dn\; nullable(r\backslash s) |
|
172 \] |
|
173 |
|
174 \noindent |
|
175 This algorithm looks graphically as follows: |
|
176 \begin{equation}\label{graph:*} |
|
177 \begin{tikzcd} |
|
178 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
|
179 \end{tikzcd} |
|
180 \end{equation} |
|
181 |
|
182 \noindent |
|
183 where we start with a regular expression $r_0$, build successive |
|
184 derivatives until we exhaust the string and then use \textit{nullable} |
|
185 to test whether the result can match the empty string. It can be |
|
186 relatively easily shown that this matcher is correct (that is given |
|
187 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
|
188 |
|
189 |
|
190 \subsection{Values and the Algorithm by Sulzmann and Lu} |
|
191 |
|
192 One limitation of Brzozowski's algorithm is that it only produces a |
|
193 YES/NO answer for whether a string is being matched by a regular |
|
194 expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm |
|
195 to allow generation of an actual matching, called a \emph{value} or |
|
196 sometimes also \emph{lexical value}. These values and regular |
|
197 expressions correspond to each other as illustrated in the following |
|
198 table: |
|
199 |
|
200 |
|
201 \begin{center} |
|
202 \begin{tabular}{c@{\hspace{20mm}}c} |
|
203 \begin{tabular}{@{}rrl@{}} |
|
204 \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |
|
205 $r$ & $::=$ & $\ZERO$\\ |
|
206 & $\mid$ & $\ONE$ \\ |
|
207 & $\mid$ & $c$ \\ |
|
208 & $\mid$ & $r_1 \cdot r_2$\\ |
|
209 & $\mid$ & $r_1 + r_2$ \\ |
|
210 \\ |
|
211 & $\mid$ & $r^*$ \\ |
|
212 \end{tabular} |
|
213 & |
|
214 \begin{tabular}{@{\hspace{0mm}}rrl@{}} |
|
215 \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ |
|
216 $v$ & $::=$ & \\ |
|
217 & & $\Empty$ \\ |
|
218 & $\mid$ & $\Char(c)$ \\ |
|
219 & $\mid$ & $\Seq\,v_1\, v_2$\\ |
|
220 & $\mid$ & $\Left(v)$ \\ |
|
221 & $\mid$ & $\Right(v)$ \\ |
|
222 & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ |
|
223 \end{tabular} |
|
224 \end{tabular} |
|
225 \end{center} |
|
226 |
|
227 \noindent |
|
228 |
|
229 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
|
230 algorithm by a second phase (the first phase being building successive |
|
231 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
|
232 is generated in case the regular expression matches the string. |
|
233 Pictorially, the Sulzmann and Lu algorithm is as follows: |
|
234 |
|
235 \begin{ceqn} |
|
236 \begin{equation}\label{graph:2} |
|
237 \begin{tikzcd} |
|
238 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
|
239 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
|
240 \end{tikzcd} |
|
241 \end{equation} |
|
242 \end{ceqn} |
|
243 |
|
244 \noindent |
|
245 For convenience, we shall employ the following notations: the regular |
|
246 expression we start with is $r_0$, and the given string $s$ is composed |
|
247 of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the |
|
248 left to right, we build the derivatives $r_1$, $r_2$, \ldots according |
|
249 to the characters $c_0$, $c_1$ until we exhaust the string and obtain |
|
250 the derivative $r_n$. We test whether this derivative is |
|
251 $\textit{nullable}$ or not. If not, we know the string does not match |
|
252 $r$ and no value needs to be generated. If yes, we start building the |
|
253 values incrementally by \emph{injecting} back the characters into the |
|
254 earlier values $v_n, \ldots, v_0$. This is the second phase of the |
|
255 algorithm from the right to left. For the first value $v_n$, we call the |
|
256 function $\textit{mkeps}$, which builds the lexical value |
|
257 for how the empty string has been matched by the (nullable) regular |
|
258 expression $r_n$. This function is defined as |
|
259 |
|
260 \begin{center} |
|
261 \begin{tabular}{lcl} |
|
262 $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
|
263 $\mkeps(r_{1}+r_{2})$ & $\dn$ |
|
264 & \textit{if} $\nullable(r_{1})$\\ |
|
265 & & \textit{then} $\Left(\mkeps(r_{1}))$\\ |
|
266 & & \textit{else} $\Right(\mkeps(r_{2}))$\\ |
|
267 $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ |
|
268 $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ |
|
269 \end{tabular} |
|
270 \end{center} |
|
271 |
|
272 |
|
273 \noindent |
|
274 After the $\mkeps$-call, we inject back the characters one by one in order to build |
|
275 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ |
|
276 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
|
277 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
|
278 matches $s$. For this Sulzmann and Lu defined a function that reverses |
|
279 the ``chopping off'' of characters during the derivative phase. The |
|
280 corresponding function is called \emph{injection}, written |
|
281 $\textit{inj}$; it takes three arguments: the first one is a regular |
|
282 expression ${r_{i-1}}$, before the character is chopped off, the second |
|
283 is a character ${c_{i-1}}$, the character we want to inject and the |
|
284 third argument is the value ${v_i}$, into which one wants to inject the |
|
285 character (it corresponds to the regular expression after the character |
|
286 has been chopped off). The result of this function is a new value. The |
|
287 definition of $\textit{inj}$ is as follows: |
|
288 |
|
289 \begin{center} |
|
290 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
291 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
|
292 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
|
293 $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ |
|
294 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
295 $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
296 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ |
|
297 $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ |
|
298 \end{tabular} |
|
299 \end{center} |
|
300 |
|
301 \noindent This definition is by recursion on the ``shape'' of regular |
|
302 expressions and values. |
|
303 |
|
304 |
|
305 \subsection{Simplification of Regular Expressions} |
|
306 |
|
307 The main drawback of building successive derivatives according |
|
308 to Brzozowski's definition is that they can grow very quickly in size. |
|
309 This is mainly due to the fact that the derivative operation generates |
|
310 often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, if |
|
311 implemented naively both algorithms by Brzozowski and by Sulzmann and Lu |
|
312 are excruciatingly slow. For example when starting with the regular |
|
313 expression $(a + aa)^*$ and building 12 successive derivatives |
|
314 w.r.t.~the character $a$, one obtains a derivative regular expression |
|
315 with more than 8000 nodes (when viewed as a tree). Operations like |
|
316 $\textit{der}$ and $\nullable$ need to traverse such trees and |
|
317 consequently the bigger the size of the derivative the slower the |
|
318 algorithm. |
|
319 |
|
320 Fortunately, one can simplify regular expressions after each derivative |
|
321 step. Various simplifications of regular expressions are possible, such |
|
322 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r |
|
323 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not |
|
324 affect the answer for whether a regular expression matches a string or |
|
325 not, but fortunately also do not affect the POSIX strategy of how |
|
326 regular expressions match strings---although the latter is much harder |
|
327 to establish. Some initial results in this regard have been |
|
328 obtained in \cite{AusafDyckhoffUrban2016}. |
|
329 |
|
330 If we want the size of derivatives in Sulzmann and Lu's algorithm to |
|
331 stay below this bound, we would need more aggressive simplifications. |
|
332 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
|
333 deleting duplicates whenever possible. For example, the parentheses in |
|
334 $(a+b) \cdot c + bc$ can be opened up to get $a\cdot c + b \cdot c + b |
|
335 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
|
336 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
|
337 $a^*+a+\ONE$. Adding these more aggressive simplification rules helps us |
|
338 to achieve the same size bound as that of the partial derivatives. |
|
339 |
|
340 In order to implement the idea of ``spilling out alternatives'' and to |
|
341 make them compatible with the $\text{inj}$-mechanism, we use |
|
342 \emph{bitcodes}. Bits and bitcodes (lists of bits) are just: |
|
343 |
|
344 |
|
345 \begin{center} |
|
346 $b ::= S \mid Z \qquad |
|
347 bs ::= [] \mid b:bs |
|
348 $ |
|
349 \end{center} |
|
350 |
|
351 \noindent |
|
352 The $S$ and $Z$ are arbitrary names for the bits in order to avoid |
|
353 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or |
|
354 bit-lists) can be used to encode values (or incomplete values) in a |
|
355 compact form. This can be straightforwardly seen in the following |
|
356 coding function from values to bitcodes: |
|
357 |
|
358 \begin{center} |
|
359 \begin{tabular}{lcl} |
|
360 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
|
361 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
|
362 $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\ |
|
363 $\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\ |
|
364 $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ |
|
365 $\textit{code}(\Stars\,[])$ & $\dn$ & $[\Z]$\\ |
|
366 $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\; |
|
367 code(\Stars\,vs)$ |
|
368 \end{tabular} |
|
369 \end{center} |
|
370 |
|
371 \noindent |
|
372 Here $\textit{code}$ encodes a value into a bitcodes by converting |
|
373 $\Left$ into $\Z$, $\Right$ into $\S$, the start point of a non-empty |
|
374 star iteration into $\S$, and the border where a local star terminates |
|
375 into $\Z$. This coding is lossy, as it throws away the information about |
|
376 characters, and also does not encode the ``boundary'' between two |
|
377 sequence values. Moreover, with only the bitcode we cannot even tell |
|
378 whether the $\S$s and $\Z$s are for $\Left/\Right$ or $\Stars$. The |
|
379 reason for choosing this compact way of storing information is that the |
|
380 relatively small size of bits can be easily manipulated and ``moved |
|
381 around'' in a regular expression. In order to recover values, we will |
|
382 need the corresponding regular expression as an extra information. This |
|
383 means the decoding function is defined as: |
|
384 |
|
385 |
|
386 %\begin{definition}[Bitdecoding of Values]\mbox{} |
|
387 \begin{center} |
|
388 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
|
389 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ |
|
390 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ |
|
391 $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
|
392 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; |
|
393 (\Left\,v, bs_1)$\\ |
|
394 $\textit{decode}'\,(\S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
|
395 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; |
|
396 (\Right\,v, bs_1)$\\ |
|
397 $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & |
|
398 $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ |
|
399 & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ |
|
400 & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ |
|
401 $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ |
|
402 $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & |
|
403 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
|
404 & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ |
|
405 & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ |
|
406 |
|
407 $\textit{decode}\,bs\,r$ & $\dn$ & |
|
408 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
|
409 & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; |
|
410 \textit{else}\;\textit{None}$ |
|
411 \end{tabular} |
|
412 \end{center} |
|
413 %\end{definition} |
|
414 |
|
415 Sulzmann and Lu's integrated the bitcodes into regular expressions to |
|
416 create annotated regular expressions \cite{Sulzmann2014}. |
|
417 \emph{Annotated regular expressions} are defined by the following |
|
418 grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$} |
|
419 |
|
420 \begin{center} |
|
421 \begin{tabular}{lcl} |
|
422 $\textit{a}$ & $::=$ & $\textit{ZERO}$\\ |
|
423 & $\mid$ & $\textit{ONE}\;\;bs$\\ |
|
424 & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\ |
|
425 & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\ |
|
426 & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\ |
|
427 & $\mid$ & $\textit{STAR}\;\;bs\,a$ |
|
428 \end{tabular} |
|
429 \end{center} |
|
430 %(in \textit{ALTS}) |
|
431 |
|
432 \noindent |
|
433 where $bs$ stands for bitcodes, $a$ for $\bold{a}$nnotated regular |
|
434 expressions and $as$ for a list of annotated regular expressions. |
|
435 The alternative constructor($\textit{ALTS}$) has been generalized to |
|
436 accept a list of annotated regular expressions rather than just 2. |
|
437 We will show that these bitcodes encode information about |
|
438 the (POSIX) value that should be generated by the Sulzmann and Lu |
|
439 algorithm. |
|
440 |
|
441 |
|
442 To do lexing using annotated regular expressions, we shall first |
|
443 transform the usual (un-annotated) regular expressions into annotated |
|
444 regular expressions. This operation is called \emph{internalisation} and |
|
445 defined as follows: |
|
446 |
|
447 %\begin{definition} |
|
448 \begin{center} |
|
449 \begin{tabular}{lcl} |
|
450 $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\ |
|
451 $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\ |
|
452 $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\ |
|
453 $(r_1 + r_2)^\uparrow$ & $\dn$ & |
|
454 $\textit{ALTS}\;[]\,List((\textit{fuse}\,[\Z]\,r_1^\uparrow),\, |
|
455 (\textit{fuse}\,[\S]\,r_2^\uparrow))$\\ |
|
456 $(r_1\cdot r_2)^\uparrow$ & $\dn$ & |
|
457 $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\ |
|
458 $(r^*)^\uparrow$ & $\dn$ & |
|
459 $\textit{STAR}\;[]\,r^\uparrow$\\ |
|
460 \end{tabular} |
|
461 \end{center} |
|
462 %\end{definition} |
|
463 |
|
464 \noindent |
|
465 We use up arrows here to indicate that the basic un-annotated regular |
|
466 expressions are ``lifted up'' into something slightly more complex. In the |
|
467 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to |
|
468 attach bits to the front of an annotated regular expression. Its |
|
469 definition is as follows: |
|
470 |
|
471 \begin{center} |
|
472 \begin{tabular}{lcl} |
|
473 $\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\ |
|
474 $\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ & |
|
475 $\textit{ONE}\,(bs\,@\,bs')$\\ |
|
476 $\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ & |
|
477 $\textit{CHAR}\,(bs\,@\,bs')\,c$\\ |
|
478 $\textit{fuse}\;bs\,(\textit{ALTS}\,bs'\,as)$ & $\dn$ & |
|
479 $\textit{ALTS}\,(bs\,@\,bs')\,as$\\ |
|
480 $\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ & |
|
481 $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\ |
|
482 $\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ & |
|
483 $\textit{STAR}\,(bs\,@\,bs')\,a$ |
|
484 \end{tabular} |
|
485 \end{center} |
|
486 |
|
487 \noindent |
|
488 After internalising the regular expression, we perform successive |
|
489 derivative operations on the annotated regular expressions. This |
|
490 derivative operation is the same as what we had previously for the |
|
491 basic regular expressions, except that we beed to take care of |
|
492 the bitcodes: |
|
493 |
|
494 %\begin{definition}{bder} |
|
495 \begin{center} |
|
496 \begin{tabular}{@{}lcl@{}} |
|
497 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
498 $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
499 $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & |
|
500 $\textit{if}\;c=d\; \;\textit{then}\; |
|
501 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ |
|
502 $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & |
|
503 $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\ |
|
504 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
|
505 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
506 & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ |
|
507 & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ |
|
508 & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ |
|
509 $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & |
|
510 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, |
|
511 (\textit{STAR}\,[]\,r)$ |
|
512 \end{tabular} |
|
513 \end{center} |
|
514 %\end{definition} |
|
515 |
|
516 \noindent |
|
517 For instance, when we unfold $\textit{STAR} \; bs \; a$ into a sequence, |
|
518 we need to attach an additional bit $Z$ to the front of $r \backslash c$ |
|
519 to indicate that there is one more star iteration. Also the $SEQ$ clause |
|
520 is more subtle---when $a_1$ is $\textit{bnullable}$ (here |
|
521 \textit{bnullable} is exactly the same as $\textit{nullable}$, except |
|
522 that it is for annotated regular expressions, therefore we omit the |
|
523 definition). Assume that $bmkeps$ correctly extracts the bitcode for how |
|
524 $a_1$ matches the string prior to character $c$ (more on this later), |
|
525 then the right branch of $ALTS$, which is $fuse \; bmkeps \; a_1 (a_2 |
|
526 \backslash c)$ will collapse the regular expression $a_1$(as it has |
|
527 already been fully matched) and store the parsing information at the |
|
528 head of the regular expression $a_2 \backslash c$ by fusing to it. The |
|
529 bitsequence $bs$, which was initially attached to the head of $SEQ$, has |
|
530 now been elevated to the top-level of $ALTS$, as this information will be |
|
531 needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs |
|
532 to $a_1$ or $ a_2$. After building these derivatives and maintaining all |
|
533 the lexing information, we complete the lexing by collecting the |
|
534 bitcodes using a generalised version of the $\textit{mkeps}$ function |
|
535 for annotated regular expressions, called $\textit{bmkeps}$: |
|
536 |
|
537 |
|
538 %\begin{definition}[\textit{bmkeps}]\mbox{} |
|
539 \begin{center} |
|
540 \begin{tabular}{lcl} |
|
541 $\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\ |
|
542 $\textit{bmkeps}\,(\textit{ALTS}\;bs\,a::as)$ & $\dn$ & |
|
543 $\textit{if}\;\textit{bnullable}\,a$\\ |
|
544 & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\ |
|
545 & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(\textit{ALTS}\;bs\,as)$\\ |
|
546 $\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & |
|
547 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ |
|
548 $\textit{bmkeps}\,(\textit{STAR}\;bs\,a)$ & $\dn$ & |
|
549 $bs \,@\, [\S]$ |
|
550 \end{tabular} |
|
551 \end{center} |
|
552 %\end{definition} |
|
553 |
|
554 \noindent |
|
555 This function completes the value information by travelling along the |
|
556 path of the regular expression that corresponds to a POSIX value and |
|
557 collecting all the bitcodes, and using $S$ to indicate the end of star |
|
558 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and |
|
559 decode them, we get the value we expect. The corresponding lexing |
|
560 algorithm looks as follows: |
|
561 |
|
562 \begin{center} |
|
563 \begin{tabular}{lcl} |
|
564 $\textit{blexer}\;r\,s$ & $\dn$ & |
|
565 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ |
|
566 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
567 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
568 & & $\;\;\textit{else}\;\textit{None}$ |
|
569 \end{tabular} |
|
570 \end{center} |
|
571 |
|
572 \noindent |
|
573 In this definition $\_\backslash s$ is the generalisation of the derivative |
|
574 operation from characters to strings (just like the derivatives for un-annotated |
|
575 regular expressions). |
|
576 |
|
577 The main point of the bitcodes and annotated regular expressions is that |
|
578 we can apply rather aggressive (in terms of size) simplification rules |
|
579 in order to keep derivatives small. We have developed such |
|
580 ``aggressive'' simplification rules and generated test data that show |
|
581 that the expected bound can be achieved. Obviously we could only |
|
582 partially cover the search space as there are infinitely many regular |
|
583 expressions and strings. |
|
584 |
|
585 One modification we introduced is to allow a list of annotated regular |
|
586 expressions in the \textit{ALTS} constructor. This allows us to not just |
|
587 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but |
|
588 also unnecessary ``copies'' of regular expressions (very similar to |
|
589 simplifying $r + r$ to just $r$, but in a more general setting). Another |
|
590 modification is that we use simplification rules inspired by Antimirov's |
|
591 work on partial derivatives. They maintain the idea that only the first |
|
592 ``copy'' of a regular expression in an alternative contributes to the |
|
593 calculation of a POSIX value. All subsequent copies can be pruned away from |
|
594 the regular expression. A recursive definition of our simplification function |
|
595 that looks somewhat similar to our Scala code is given below: |
|
596 %\comment{Use $\ZERO$, $\ONE$ and so on. |
|
597 %Is it $ALTS$ or $ALTS$?}\\ |
|
598 |
|
599 \begin{center} |
|
600 \begin{tabular}{@{}lcl@{}} |
|
601 |
|
602 $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
|
603 &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ |
|
604 &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ |
|
605 &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
|
606 &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
|
607 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\ |
|
608 |
|
609 $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\ |
|
610 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
|
611 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
|
612 &&$\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$\\ |
|
613 |
|
614 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
|
615 \end{tabular} |
|
616 \end{center} |
|
617 |
|
618 \noindent |
|
619 The simplification does a pattern matching on the regular expression. |
|
620 When it detected that the regular expression is an alternative or |
|
621 sequence, it will try to simplify its children regular expressions |
|
622 recursively and then see if one of the children turn into $\ZERO$ or |
|
623 $\ONE$, which might trigger further simplification at the current level. |
|
624 The most involved part is the $\textit{ALTS}$ clause, where we use two |
|
625 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested |
|
626 $\textit{ALTS}$ and reduce as many duplicates as possible. Function |
|
627 $\textit{distinct}$ keeps the first occurring copy only and remove all later ones |
|
628 when detected duplicates. Function $\textit{flatten}$ opens up nested \textit{ALTS}. |
|
629 Its recursive definition is given below: |
|
630 |
|
631 \begin{center} |
|
632 \begin{tabular}{@{}lcl@{}} |
|
633 $\textit{flatten} \; (\textit{ALTS}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \; |
|
634 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ |
|
635 $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \; as' $ \\ |
|
636 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) |
|
637 \end{tabular} |
|
638 \end{center} |
|
639 |
|
640 \noindent |
|
641 Here $\textit{flatten}$ behaves like the traditional functional programming flatten |
|
642 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it |
|
643 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. |
|
644 |
|
645 Suppose we apply simplification after each derivative step, and view |
|
646 these two operations as an atomic one: $a \backslash_{simp}\,c \dn |
|
647 \textit{simp}(a \backslash c)$. Then we can use the previous natural |
|
648 extension from derivative w.r.t.~character to derivative |
|
649 w.r.t.~string:%\comment{simp in the [] case?} |
|
650 |
|
651 \begin{center} |
|
652 \begin{tabular}{lcl} |
|
653 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ |
|
654 $r \backslash_{simp} [\,] $ & $\dn$ & $r$ |
|
655 \end{tabular} |
|
656 \end{center} |
|
657 |
|
658 \noindent |
|
659 we obtain an optimised version of the algorithm: |
|
660 |
|
661 \begin{center} |
|
662 \begin{tabular}{lcl} |
|
663 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
|
664 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
|
665 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
666 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
667 & & $\;\;\textit{else}\;\textit{None}$ |
|
668 \end{tabular} |
|
669 \end{center} |
|
670 |
|
671 \noindent |
|
672 This algorithm keeps the regular expression size small, for example, |
|
673 with this simplification our previous $(a + aa)^*$ example's 8000 nodes |
|
674 will be reduced to just 6 and stays constant, no matter how long the |
|
675 input string is. |
|
676 |
|
677 |
|
678 |
101 \section{Introduction} |
679 \section{Introduction} |
102 |
680 |
103 While we believe derivatives of regular expressions, written |
681 While we believe derivatives of regular expressions, written |
104 $r\backslash s$, are a beautiful concept (in terms of ease of |
682 $r\backslash s$, are a beautiful concept (in terms of ease of |
105 implementing them in functional programming languages and in terms of |
683 implementing them in functional programming languages and in terms of |
443 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
1087 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
444 \end{tabular} |
1088 \end{tabular} |
445 \end{center} |
1089 \end{center} |
446 |
1090 |
447 \noindent |
1091 \noindent |
448 If we call $\simp$ on $\rup$, just as $\backslash_{simp}$ |
1092 |
|
1093 and the difinition of $\flatten$: |
|
1094 \begin{center} |
|
1095 \begin{tabular}{c c c} |
|
1096 $\flatten \; []$ & $\dn$ & $[]$\\ |
|
1097 $\flatten \; \ZERO::rs$ & $\dn$ & $rs$\\ |
|
1098 $\flatten \;(_{\textit{bs}_1}\oplus \textit{rs}_1 ::rs)$ & $\dn$ & $(\map \, (\fuse \, \textit{bs}_1) \,\textit{rs}_1) ::: \flatten(rs)$\\ |
|
1099 $\flatten \; r :: rs$ & $\dn$ & $r::\flatten(rs)$ |
|
1100 \end{tabular} |
|
1101 \end{center} |
|
1102 |
|
1103 \noindent |
|
1104 If we call $\simp$ on $\rup\backslash a$, just as $\backslash_{simp}$ |
449 requires, then we would go throught the third clause of |
1105 requires, then we would go throught the third clause of |
450 the sequence case:$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$. |
1106 the sequence case:$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$. |
451 The $\ZERO$ of $(_0\ONE + \ZERO)$ is simplified away and |
1107 The $\ZERO$ of $(_0\ONE + \ZERO)$ is thrown away |
452 $_0\ONE$ merged into $_0a + _1a^*$ by simply |
1108 by $\flatten$ and |
|
1109 $_0\ONE$ merged into $(_0a + _1a^*)$ by simply |
453 putting its bits($_0$) to the front of the second component: |
1110 putting its bits($_0$) to the front of the second component: |
454 ${\bf_0}(_0a + _1a^*)$. |
1111 ${\bf_0}(_0a + _1a^*)$. |
455 After a second derivative operation, |
1112 After a second derivative operation, |
456 namely, $(_0(_0a + _1a^*))\backslash a$, we get |
1113 namely, $(_0(_0a + _1a^*))\backslash a$, we get |
457 $ |
1114 $ |
458 _0(_0 \ONE + _1(_1\ONE \cdot a^*)) |
1115 _0(_0 \ONE + _1(_1\ONE \cdot a^*)) |
459 $, and this simplifies to $_0(_0 \ONE + _{11} a^*)$ |
1116 $, and this simplifies to $_0(_0 \ONE + _{11} a^*)$ |
460 by the third clause of the alternative case: |
1117 by the third clause of the alternative case: |
461 $\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$. |
1118 $\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$. |
462 The outmost bit $_0$ remains unchanged and stays with |
1119 The outmost bit $_0$ stays with |
463 the outmost regular expression. However, things are a bit |
1120 the outmost regular expression, rather than being fused to |
464 different when it comes to $\simp(\rup\backslash \, s)$, because |
1121 its child regular expressions, as what we will later see happens |
465 without simplification, first term of the sequence |
1122 to $\simp(\rup\backslash \, s)$. |
466 $\rup\backslash a=(_0\ONE + \ZERO)(_0a + _1a^*)$ |
1123 If we choose to not simplify in the midst of derivative operations, |
467 is not merged into the second component |
1124 but only do it at the end after the string has been exhausted, |
468 and is nullable. |
1125 namely, $\simp(\rup\backslash \, s)=\simp((\rup\backslash a)\backslash a)$, |
|
1126 then at the {\bf second} derivative of |
|
1127 $(\rup\backslash a)\bf{\backslash a}$ |
|
1128 we will go throught the clause of $\backslash$: |
|
1129 \begin{center} |
|
1130 \begin{tabular}{lcl} |
|
1131 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
|
1132 $(when \; \textit{bnullable}\,a_1)$\\ |
|
1133 & &$\textit{ALTS}\,bs\,List(\;\;(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ |
|
1134 & &$(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))\;\;)$\\ |
|
1135 \end{tabular} |
|
1136 \end{center} |
|
1137 |
|
1138 because |
|
1139 $\rup\backslash a = (_0\ONE + \ZERO)(_0a + _1a^*)$ |
|
1140 is a sequence |
|
1141 with the first component being nullable |
|
1142 (unsimplified, unlike the first round of running$\backslash_{simp}$). |
469 Therefore $((_0\ONE + \ZERO)(_0a + _1a^*))\backslash a$ splits into |
1143 Therefore $((_0\ONE + \ZERO)(_0a + _1a^*))\backslash a$ splits into |
470 $([(\ZERO + \ZERO)\cdot(_0a + _1a^*)] + _0( _0\ONE + _1[_1\ONE \cdot a^*]))$. |
1144 $([(\ZERO + \ZERO)\cdot(_0a + _1a^*)] + _0( _0\ONE + _1[_1\ONE \cdot a^*]))$. |
471 After these two successive derivatives without simplification, |
1145 After these two successive derivatives without simplification, |
472 we apply $\simp$ to this regular expression, which goes through |
1146 we apply $\simp$ to this regular expression, which goes through |
473 the alternative clause, and each component of $([(\ZERO + \ZERO)\cdot(_0a + _1a^*)] + _0( _0\ONE + _1[_1\ONE \cdot a^*]))$ will be simplified, giving us the list:$[\ZERO, _0(_0\ONE + _{11}a^*]$,this |
1147 the alternative clause, and each component of |
474 list is then flattened--for |
1148 $([(\ZERO + \ZERO)\cdot(_0a + _1a^*)] + _0( _0\ONE + _1[_1\ONE \cdot a^*]))$ |
475 $([(\ZERO + \ZERO)\cdot(_0a + _1a^*)]$ it will be simplified into $\ZERO$ |
1149 will be simplified, giving us the list:$[\ZERO, _0(_0\ONE + _{11}a^*)]$ |
476 and then thrown away by $\textit{flatten}$; $ _0( _0\ONE + _1[_1\ONE \cdot a^*]))$ |
1150 This list is then "flattened"--$\ZERO$ will be |
477 becomes $ _{00}\ONE + _{011}a^*]))$ because flatten opens up the alternative |
1151 thrown away by $\textit{flatten}$; $ _0(_0\ONE + _{11}a^*)$ |
478 $\ONE + a^*$ and fuses the front bit(s) $_0$ to the front of $_0\ONE $ and $_{11}a^*$ |
1152 is opened up to make the list consisting of two separate elements |
479 and get $_{00}$ and $_{011}$. |
1153 $_{00}\ONE$ and $_{011}a^*$, note that $flatten$ |
480 %CONSTRUCTION SITE |
1154 $\fuse$s the bit(s) $_0$ to the front of $_0\ONE $ and $_{11}a^*$. |
|
1155 In a nutshell, the order of simplification causes |
|
1156 the bits to be moved differently. |
|
1157 |
|
1158 \subsubsection{A Failed Attempt To Remedy the Problem Above} |
|
1159 A simple class of regular expressions and strings |
|
1160 pairs can be deduced |
|
1161 from the above example which |
|
1162 trigger the difference between |
|
1163 $\rup\backslash_{simp} \, s$ |
|
1164 and $\simp(\rup\backslash s)$: |
|
1165 \[ |
|
1166 D = (c_1c_2, \{r_1\cdot r_2 \mid \text{$ c_1\in L(r_1), r_1 \; not \; \nullable, c_2 \in L(r_2), |
|
1167 r_2 \backslash c_2$ is of shape alternative and $c_1c_2 \notin L(r_1)$}\}) |
|
1168 \] |
481 and we can use this to construct a set of examples based |
1169 and we can use this to construct a set of examples based |
482 on this type of behaviour of two operations: |
1170 on this type of behaviour of two operations: |
483 $r_1$ |
1171 $r_1$ |
484 that is to say, despite the bits are being moved around on the regular expression |
1172 that is to say, despite the bits are being moved around on the regular expression |
485 (difference in bits), the structure of the (unannotated)regular expression |
1173 (difference in bits), the structure of the (unannotated)regular expression |
486 after one simplification is exactly the same after the |
1174 after one simplification is exactly the same after the |
487 same sequence of derivative operations |
1175 same sequence of derivative operations |
488 regardless of whether we did simplification |
1176 regardless of whether we did simplification |
489 along the way. |
1177 along the way. |
|
1178 One way would be to give a function that calls |
|
1179 %CONSTRUCTION SITE |
|
1180 fuse is the culprit: it causes the order in which alternatives |
|
1181 are opened up to be remembered and finally the difference |
|
1182 appear in $\simp(\rup \backslash s)$ and $\rup \backslash{simp} \,s$. |
|
1183 but we have to use them as they are essential in the simplification: |
|
1184 flatten needs them. |
|
1185 |
|
1186 |
|
1187 |
490 However, without erase the above equality does not hold: |
1188 However, without erase the above equality does not hold: |
491 for the regular expression |
1189 for the regular expression |
492 $(a+b)(a+a*)$, |
1190 $(a+b)(a+a*)$, |
493 if we do derivative with respect to string $aa$, |
1191 if we do derivative with respect to string $aa$, |
494 we get |
1192 we get |