213 ``fire''---so is it an identifier or a keyword? While in applications |
213 ``fire''---so is it an identifier or a keyword? While in applications |
214 there is a well-known strategy to decide these questions, called POSIX |
214 there is a well-known strategy to decide these questions, called POSIX |
215 matching, only relatively recently precise definitions of what POSIX |
215 matching, only relatively recently precise definitions of what POSIX |
216 matching actually means have been formalised |
216 matching actually means have been formalised |
217 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly, |
217 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly, |
218 POSIX matching means matching the longest initial substring and |
218 POSIX matching means matching the longest initial substring. |
219 in the case of a tie, the initial submatch is chosen according to some priorities attached to the |
219 In the case of a tie, the initial submatch is chosen according to some priorities attached to the |
220 regular expressions (e.g.~keywords have a higher priority than |
220 regular expressions (e.g.~keywords have a higher priority than |
221 identifiers). This sounds rather simple, but according to Grathwohl et |
221 identifiers). This sounds rather simple, but according to Grathwohl et |
222 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote: |
222 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote: |
223 |
223 |
224 \begin{quote} |
224 \begin{quote} |
289 \end{tabular} |
289 \end{tabular} |
290 \end{center} |
290 \end{center} |
291 |
291 |
292 \noindent |
292 \noindent |
293 |
293 |
294 |
|
295 |
|
296 %Assuming the classic notion of a |
|
297 %\emph{language} of a regular expression, written $L(\_)$, t |
|
298 The main |
|
299 property of the derivative operation is that |
|
300 |
|
301 \begin{center} |
|
302 $c\!::\!s \in L(r)$ holds |
|
303 if and only if $s \in L(r\backslash c)$. |
|
304 \end{center} |
|
305 |
|
306 \noindent |
|
307 So if we want to find out whether a string $s$ |
|
308 matches with a regular expression $r$, build the derivatives of $r$ |
|
309 w.r.t.\ (in succession) all the characters of the string $s$. Finally, |
|
310 test whether the resulting regular expression can match the empty |
|
311 string. If yes, then $r$ matches $s$, and no in the negative |
|
312 case.\\ |
|
313 We can generalise the derivative operation for strings like this: |
|
314 \begin{center} |
|
315 \begin{tabular}{lcl} |
|
316 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
|
317 $r \backslash \epsilon $ & $\dn$ & $r$ |
|
318 \end{tabular} |
|
319 \end{center} |
|
320 For us the main advantage is that derivatives can be |
294 For us the main advantage is that derivatives can be |
321 straightforwardly implemented in any functional programming language, |
295 straightforwardly implemented in any functional programming language, |
322 and are easily definable and reasoned about in theorem provers---the |
296 and are easily definable and reasoned about in theorem provers---the |
323 definitions just consist of inductive datatypes and simple recursive |
297 definitions just consist of inductive datatypes and simple recursive |
324 functions. Moreover, the notion of derivatives can be easily |
298 functions. Moreover, the notion of derivatives can be easily |
326 the not-regular expression, written $\neg\,r$, or bounded repetitions |
300 the not-regular expression, written $\neg\,r$, or bounded repetitions |
327 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so |
301 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so |
328 straightforwardly realised within the classic automata approach. |
302 straightforwardly realised within the classic automata approach. |
329 |
303 |
330 |
304 |
331 We obtain a simple and elegant regular |
305 |
|
306 %Assuming the classic notion of a |
|
307 %\emph{language} of a regular expression, written $L(\_)$, t |
|
308 The main |
|
309 property of the derivative operation is that |
|
310 |
|
311 \begin{center} |
|
312 $c\!::\!s \in L(r)$ holds |
|
313 if and only if $s \in L(r\backslash c)$. |
|
314 \end{center} |
|
315 |
|
316 \noindent |
|
317 So if we want to find out whether a string $s$ |
|
318 matches with a regular expression $r$, build the derivatives of $r$ |
|
319 w.r.t.\ (in succession) all the characters of the string $s$. Finally, |
|
320 test whether the resulting regular expression can match the empty |
|
321 string. If yes, then $r$ matches $s$, and no in the negative |
|
322 case. |
|
323 |
|
324 We can generalise the derivative operation for strings like this: |
|
325 \begin{center} |
|
326 \begin{tabular}{lcl} |
|
327 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
|
328 $r \backslash \epsilon $ & $\dn$ & $r$ |
|
329 \end{tabular} |
|
330 \end{center} |
|
331 \noindent |
|
332 Using the above definition we obtain a simple and elegant regular |
332 expression matching algorithm: |
333 expression matching algorithm: |
333 \begin{definition}{matcher} |
|
334 \[ |
334 \[ |
335 match\;s\;r \;\dn\; nullable(r\backslash s) |
335 match\;s\;r \;\dn\; nullable(r\backslash s) |
336 \] |
336 \] |
337 \end{definition} |
|
338 |
|
339 |
|
340 |
|
341 One limitation, however, of Brzozowski's algorithm is that it only |
337 One limitation, however, of Brzozowski's algorithm is that it only |
342 produces a YES/NO answer for whether a string is being matched by a |
338 produces a YES/NO answer for whether a string is being matched by a |
343 regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this |
339 regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this |
344 algorithm to allow generation of an actual matching, called a |
340 algorithm to allow generation of an actual matching, called a |
345 \emph{value}. |
341 \emph{value}. |
370 \end{tabular} |
366 \end{tabular} |
371 \end{center} |
367 \end{center} |
372 |
368 |
373 \noindent |
369 \noindent |
374 Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them. |
370 Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them. |
|
371 |
375 The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$. We omit this straightforward definition. |
372 The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$. We omit this straightforward definition. |
376 Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. |
373 Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. |
377 |
374 |
378 To give a concrete example of how value works, consider the string $xy$ and the |
375 To give a concrete example of how value works, consider the string $xy$ and the |
379 regular expression $(x + (y + xy))^*$. We can view this regular |
376 regular expression $(x + (y + xy))^*$. We can view this regular |
412 \end{tikzcd} |
409 \end{tikzcd} |
413 \begin{tikzcd} |
410 \begin{tikzcd} |
414 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n |
411 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n |
415 \end{tikzcd} |
412 \end{tikzcd} |
416 |
413 |
|
414 |
417 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$. |
415 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$. |
418 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ... until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for the empty word epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string: |
416 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ... until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for how the empty word matched the empty regular expression epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string: |
419 |
417 |
420 After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted. |
418 After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted. |
421 An inductive proof can be routinely established. |
419 An inductive proof can be routinely established. |
422 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. Rather, we shall focus next on the |
420 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. Rather, we shall focus next on the |
423 process of simplification of regular expressions, which is needed in |
421 process of simplification of regular expressions, which is needed in |
498 \end{tabular} |
496 \end{tabular} |
499 \end{center} |
497 \end{center} |
500 where $\Z$ and $\S$ are arbitrary names for the bits in the |
498 where $\Z$ and $\S$ are arbitrary names for the bits in the |
501 bitsequences. |
499 bitsequences. |
502 Here code encodes a value into a bitsequence by converting Left into $\Z$, Right into $\S$, the start point of a non-empty star iteration into $\S$, and the border where a local star terminates into $\Z$. This conversion is apparently lossy, as it throws away the character information, and does not decode the boundary between the two operands of the sequence constructor. Moreover, with only the bitcode we cannot even tell whether the $\S$s and $\Z$s are for $Left/Right$ or $Stars$. The reason for choosing this compact way of storing information is that the relatively small size of bits can be easily moved around during the lexing process. In order to recover the bitcode back into values, we will need the regular expression as the extra information and decode them back into value:\\ |
500 Here code encodes a value into a bitsequence by converting Left into $\Z$, Right into $\S$, the start point of a non-empty star iteration into $\S$, and the border where a local star terminates into $\Z$. This conversion is apparently lossy, as it throws away the character information, and does not decode the boundary between the two operands of the sequence constructor. Moreover, with only the bitcode we cannot even tell whether the $\S$s and $\Z$s are for $Left/Right$ or $Stars$. The reason for choosing this compact way of storing information is that the relatively small size of bits can be easily moved around during the lexing process. In order to recover the bitcode back into values, we will need the regular expression as the extra information and decode them back into value:\\ |
503 \begin{definition}[Bitdecoding of Values]\mbox{} |
501 %\begin{definition}[Bitdecoding of Values]\mbox{} |
504 \begin{center} |
502 \begin{center} |
505 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
503 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
506 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ |
504 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ |
507 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ |
505 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ |
508 $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
506 $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
525 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
523 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
526 & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; |
524 & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; |
527 \textit{else}\;\textit{None}$ |
525 \textit{else}\;\textit{None}$ |
528 \end{tabular} |
526 \end{tabular} |
529 \end{center} |
527 \end{center} |
530 \end{definition} |
528 %\end{definition} |
531 |
529 |
532 To do lexing using annotated regular expressions, we shall first transform the |
530 To do lexing using annotated regular expressions, we shall first transform the |
533 usual (un-annotated) regular expressions into annotated regular |
531 usual (un-annotated) regular expressions into annotated regular |
534 expressions:\\ |
532 expressions:\\ |
535 \begin{definition} |
533 %\begin{definition} |
536 \begin{center} |
534 \begin{center} |
537 \begin{tabular}{lcl} |
535 \begin{tabular}{lcl} |
538 $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\ |
536 $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\ |
539 $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\ |
537 $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\ |
540 $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\ |
538 $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\ |
545 $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\ |
543 $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\ |
546 $(r^*)^\uparrow$ & $\dn$ & |
544 $(r^*)^\uparrow$ & $\dn$ & |
547 $\textit{STAR}\;[]\,r^\uparrow$\\ |
545 $\textit{STAR}\;[]\,r^\uparrow$\\ |
548 \end{tabular} |
546 \end{tabular} |
549 \end{center} |
547 \end{center} |
550 \end{definition} |
548 %\end{definition} |
551 Then we do successive derivative operations on the annotated regular expression. This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits to store the parse tree information:\\ |
549 Then we do successive derivative operations on the annotated regular expression. This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits to store the parse tree information:\\ |
552 \begin{definition}{bder} |
550 %\begin{definition}{bder} |
553 \begin{center} |
551 \begin{center} |
554 \begin{tabular}{@{}lcl@{}} |
552 \begin{tabular}{@{}lcl@{}} |
555 $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
553 $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
556 $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
554 $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
557 $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ & |
555 $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ & |
567 $(\textit{STAR}\,bs\,a)\backslash c$ & $\dn$ & |
565 $(\textit{STAR}\,bs\,a)\backslash c$ & $\dn$ & |
568 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\, |
566 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\, |
569 (\textit{STAR}\,[]\,r)$ |
567 (\textit{STAR}\,[]\,r)$ |
570 \end{tabular} |
568 \end{tabular} |
571 \end{center} |
569 \end{center} |
572 \end{definition} |
570 %\end{definition} |
573 This way, we do not have to use an injection function and a second phase, but instead only need to collect the bits while running $mkeps$: |
571 This way, we do not have to use an injection function and a second phase, but instead only need to collect the bits while running $mkeps$: |
574 \begin{definition}[\textit{bmkeps}]\mbox{} |
572 %\begin{definition}[\textit{bmkeps}]\mbox{} |
575 \begin{center} |
573 \begin{center} |
576 \begin{tabular}{lcl} |
574 \begin{tabular}{lcl} |
577 $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\ |
575 $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\ |
578 $\textit{bmkeps}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ & |
576 $\textit{bmkeps}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ & |
579 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
577 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
583 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ |
581 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ |
584 $\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ & |
582 $\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ & |
585 $bs \,@\, [\S]$ |
583 $bs \,@\, [\S]$ |
586 \end{tabular} |
584 \end{tabular} |
587 \end{center} |
585 \end{center} |
588 \end{definition} |
586 %\end{definition} |
589 and then decode the bits using the regular expression. The whole process looks like this:\\ |
587 and then decode the bits using the regular expression. After putting these pieces together, the whole process looks like this:\\ |
590 r |
588 \begin{center} |
591 \\ |
589 \begin{tabular}{lcl} |
|
590 $\textit{blexer}\;r\,s$ & $\dn$ & |
|
591 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ |
|
592 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
593 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
594 & & $\;\;\textit{else}\;\textit{None}$ |
|
595 \end{tabular} |
|
596 \end{center} |
592 |
597 |
593 The main point of the bitsequences and annotated regular expressions |
598 The main point of the bitsequences and annotated regular expressions |
594 is that we can apply rather aggressive (in terms of size) |
599 is that we can apply rather aggressive (in terms of size) |
595 simplification rules in order to keep derivatives small. |
600 simplification rules in order to keep derivatives small. |
596 |
601 |