11 In this chapter, we define the basic notions |
11 In this chapter, we define the basic notions |
12 for regular languages and regular expressions. |
12 for regular languages and regular expressions. |
13 This is essentially a description in ``English'' |
13 This is essentially a description in ``English'' |
14 of our formalisation in Isabelle/HOL. |
14 of our formalisation in Isabelle/HOL. |
15 We also give the definition of what $\POSIX$ lexing means, |
15 We also give the definition of what $\POSIX$ lexing means, |
16 followed by a lexing algorithm by Sulzmanna and Lu\parencite{Sulzmann2014} |
16 followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} |
17 that produces the output conforming |
17 that produces the output conforming |
18 to the $\POSIX$ standard. |
18 to the $\POSIX$ standard\footnote{In what follows |
19 \footnote{In what follows |
|
20 we choose to use the Isabelle-style notation |
19 we choose to use the Isabelle-style notation |
21 for function applications, where |
20 for function applications, where |
22 the parameters of a function are not enclosed |
21 the parameters of a function are not enclosed |
23 inside a pair of parentheses (e.g. $f \;x \;y$ |
22 inside a pair of parentheses (e.g. $f \;x \;y$ |
24 instead of $f(x,\;y)$). This is mainly |
23 instead of $f(x,\;y)$). This is mainly |
25 to make the text visually more concise.} |
24 to make the text visually more concise.}. |
26 |
25 |
27 \section{Basic Concepts} |
26 \section{Basic Concepts} |
28 Usually, formal language theory starts with an alphabet |
27 Formal language theory usually starts with an alphabet |
29 denoting a set of characters. |
28 denoting a set of characters. |
30 Here we just use the datatype of characters from Isabelle, |
29 Here we just use the datatype of characters from Isabelle, |
31 which roughly corresponds to the ASCII characters. |
30 which roughly corresponds to the ASCII characters. |
32 In what follows, we shall leave the information about the alphabet |
31 In what follows, we shall leave the information about the alphabet |
33 implicit. |
32 implicit. |
34 Then using the usual bracket notation for lists, |
33 Then using the usual bracket notation for lists, |
35 we can define strings made up of characters: |
34 we can define strings made up of characters as: |
36 \begin{center} |
35 \begin{center} |
37 \begin{tabular}{lcl} |
36 \begin{tabular}{lcl} |
38 $\textit{s}$ & $\dn$ & $[] \; |\; c :: s$ |
37 $\textit{s}$ & $\dn$ & $[] \; |\; c :: s$ |
39 \end{tabular} |
38 \end{tabular} |
40 \end{center} |
39 \end{center} |
41 where $c$ is a variable ranging over characters. |
40 where $c$ is a variable ranging over characters. |
|
41 The $::$ stands for list cons and $[]$ for the empty |
|
42 list. |
|
43 A singleton list is sometimes written as $[c]$ for brevity. |
42 Strings can be concatenated to form longer strings in the same |
44 Strings can be concatenated to form longer strings in the same |
43 way as we concatenate two lists, which we shall write as $s_1 @ s_2$. |
45 way as we concatenate two lists, which we shall write as $s_1 @ s_2$. |
44 We omit the precise |
46 We omit the precise |
45 recursive definition here. |
47 recursive definition here. |
46 We overload this concatenation operator for two sets of strings: |
48 We overload this concatenation operator for two sets of strings: |
96 $\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\ |
98 $\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\ |
97 \end{tabular} |
99 \end{tabular} |
98 \end{center} |
100 \end{center} |
99 \noindent |
101 \noindent |
100 which is essentially the left quotient $A \backslash L$ of $A$ against |
102 which is essentially the left quotient $A \backslash L$ of $A$ against |
101 the singleton language with $L = \{s\}$ |
103 the singleton language with $L = \{s\}$. |
102 in formal language theory. |
104 However, for our purposes here, the $\textit{Ders}$ definition with |
103 However, for the purposes here, the $\textit{Ders}$ definition with |
|
104 a single string is sufficient. |
105 a single string is sufficient. |
105 |
106 |
106 The reason for defining derivatives |
107 The reason for defining derivatives |
107 is that it provides a different approach |
108 is that they provide another approach |
108 to test membership of a string in |
109 to test membership of a string in |
109 a set of strings. |
110 a set of strings. |
110 For example, to test whether the string |
111 For example, to test whether the string |
111 $bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with |
112 $bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with |
112 respect to the string $bar$: |
113 respect to the string $bar$: |
113 \begin{center} |
114 \begin{center} |
114 \begin{tabular}{lclll} |
115 \begin{tabular}{lll} |
115 $S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & |
116 $S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & |
116 $\{ar, rak\}$ & |
117 $\{ar, rak\}$ \\ |
117 $\stackrel{\backslash a}{\rightarrow}$ & |
118 & $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\ |
118 \\ |
119 & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\ |
119 $\{r \}$ & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$ & |
120 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\ |
120 $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\ |
|
121 \end{tabular} |
121 \end{tabular} |
122 \end{center} |
122 \end{center} |
123 \noindent |
123 \noindent |
124 and in the end test whether the set |
124 and in the end test whether the set |
125 has the empty string \footnote{ we use the infix notation $A\backslash c$ |
125 has the empty string.\footnote{We use the infix notation $A\backslash c$ |
126 instead of $\Der \; c \; A$ for brevity, as it is clear we are operating |
126 instead of $\Der \; c \; A$ for brevity, as it is clear we are operating |
127 on languages rather than regular expressions }. |
127 on languages rather than regular expressions.} |
128 In general, if we have a language $S_{start}$, |
128 |
129 then we can test whether $s$ is in $S_{start}$ |
129 In general, if we have a language $S$, |
|
130 then we can test whether $s$ is in $S$ |
130 by testing whether $[] \in S \backslash s$. |
131 by testing whether $[] \in S \backslash s$. |
131 |
|
132 With the sequencing, Kleene star, and $\textit{Der}$ operator on languages, |
132 With the sequencing, Kleene star, and $\textit{Der}$ operator on languages, |
133 we have a few properties of how the language derivative can be defined using |
133 we have a few properties of how the language derivative can be defined using |
134 sub-languages. |
134 sub-languages. |
135 For example, for the sequence operator, we have |
135 For example, for the sequence operator, we have |
136 something similar to the ``chain rule'' of the calculus derivative: |
136 something similar to a ``chain rule'': |
137 \begin{lemma} |
137 \begin{lemma} |
138 \[ |
138 \[ |
139 \Der \; c \; (A @ B) = |
139 \Der \; c \; (A @ B) = |
140 \begin{cases} |
140 \begin{cases} |
141 ((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\ |
141 ((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\ |
211 \begin{center} |
213 \begin{center} |
212 \begin{tabular}{lcl} |
214 \begin{tabular}{lcl} |
213 $L \; \ZERO$ & $\dn$ & $\phi$\\ |
215 $L \; \ZERO$ & $\dn$ & $\phi$\\ |
214 $L \; \ONE$ & $\dn$ & $\{[]\}$\\ |
216 $L \; \ONE$ & $\dn$ & $\{[]\}$\\ |
215 $L \; c$ & $\dn$ & $\{[c]\}$\\ |
217 $L \; c$ & $\dn$ & $\{[c]\}$\\ |
216 $L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\ |
218 $L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\ |
217 $L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\ |
219 $L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\ |
218 $L \; r^*$ & $\dn$ & $ (L\;r)*$ |
220 $L \; (r^*)$ & $\dn$ & $ (L\;r)*$ |
219 \end{tabular} |
221 \end{tabular} |
220 \end{center} |
222 \end{center} |
221 \noindent |
223 \noindent |
222 Now with language derivatives of a language and regular expressions and |
224 %Now with language derivatives of a language and regular expressions and |
223 their language interpretations in place, we are ready to define derivatives on regular expressions. |
225 %their language interpretations in place, we are ready to define derivatives on regular expressions. |
|
226 With $L$ we are ready to introduce Brzozowski derivatives on regular expressions. |
|
227 We do so by first introducing what properties it should satisfy. |
|
228 |
224 \subsection{Brzozowski Derivatives and a Regular Expression Matcher} |
229 \subsection{Brzozowski Derivatives and a Regular Expression Matcher} |
225 %Recall, the language derivative acts on a set of strings |
230 %Recall, the language derivative acts on a set of strings |
226 %and essentially chops off a particular character from |
231 %and essentially chops off a particular character from |
227 %all strings in that set, Brzozowski defined a derivative operation on regular expressions |
232 %all strings in that set, Brzozowski defined a derivative operation on regular expressions |
228 %so that after derivative $L(r\backslash c)$ |
233 %so that after derivative $L(r\backslash c)$ |
229 %will look as if it was obtained by doing a language derivative on $L(r)$: |
234 %will look as if it was obtained by doing a language derivative on $L(r)$: |
230 Recall that the language derivative acts on a |
235 %Recall that the language derivative acts on a |
231 language (set of strings). |
236 %language (set of strings). |
232 One can decide whether a string $s$ belongs |
237 %One can decide whether a string $s$ belongs |
233 to a language $S$ by taking derivative with respect to |
238 %to a language $S$ by taking derivative with respect to |
234 that string and then checking whether the empty |
239 %that string and then checking whether the empty |
235 string is in the derivative: |
240 %string is in the derivative: |
236 \begin{center} |
241 %\begin{center} |
237 \parskip \baselineskip |
242 %\parskip \baselineskip |
238 \def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}} |
243 %\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}} |
239 \def\rlwd{.5pt} |
244 %\def\rlwd{.5pt} |
240 \newcommand\notate[3]{% |
245 %\newcommand\notate[3]{% |
241 \unskip\def\useanchorwidth{T}% |
246 % \unskip\def\useanchorwidth{T}% |
242 \setbox0=\hbox{#1}% |
247 % \setbox0=\hbox{#1}% |
243 \def\stackalignment{c}\stackunder[-6pt]{% |
248 % \def\stackalignment{c}\stackunder[-6pt]{% |
244 \def\stackalignment{c}\stackunder[-1.5pt]{% |
249 % \def\stackalignment{c}\stackunder[-1.5pt]{% |
245 \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{% |
250 % \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{% |
246 \rule{\rlwd}{#2\baselineskip}}}{% |
251 % \rule{\rlwd}{#2\baselineskip}}}{% |
247 \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces% |
252 % \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces% |
248 } |
253 %} |
249 \Longstack{ |
254 %\Longstack{ |
250 \notate{$\{ \ldots ,\;$ |
255 %\notate{$\{ \ldots ,\;$ |
251 \notate{s}{1}{$(c_1 :: s_1)$} |
256 % \notate{s}{1}{$(c_1 :: s_1)$} |
252 $, \; \ldots \}$ |
257 % $, \; \ldots \}$ |
253 }{1}{$S_{start}$} |
258 %}{1}{$S_{start}$} |
254 } |
259 %} |
255 \Longstack{ |
260 %\Longstack{ |
256 $\stackrel{\backslash c_1}{\longrightarrow}$ |
261 % $\stackrel{\backslash c_1}{\longrightarrow}$ |
257 } |
262 %} |
258 \Longstack{ |
263 %\Longstack{ |
259 $\{ \ldots,\;$ \notate{$s_1$}{1}{$(c_2::s_2)$} |
264 % $\{ \ldots,\;$ \notate{$s_1$}{1}{$(c_2::s_2)$} |
260 $,\; \ldots \}$ |
265 % $,\; \ldots \}$ |
261 } |
266 %} |
262 \Longstack{ |
267 %\Longstack{ |
263 $\stackrel{\backslash c_2}{\longrightarrow}$ |
268 % $\stackrel{\backslash c_2}{\longrightarrow}$ |
264 } |
269 %} |
265 \Longstack{ |
270 %\Longstack{ |
266 $\{ \ldots,\; s_2 |
271 % $\{ \ldots,\; s_2 |
267 ,\; \ldots \}$ |
272 % ,\; \ldots \}$ |
268 } |
273 %} |
269 \Longstack{ |
274 %\Longstack{ |
270 $ \xdashrightarrow{\backslash c_3\ldots\ldots} $ |
275 % $ \xdashrightarrow{\backslash c_3\ldots\ldots} $ |
271 } |
276 %} |
272 \Longstack{ |
277 %\Longstack{ |
273 \notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} = |
278 % \notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} = |
274 S_{start}\backslash s$} |
279 % S_{start}\backslash s$} |
275 } |
280 %} |
276 \end{center} |
281 %\end{center} |
277 \begin{center} |
282 %\begin{center} |
278 $s \in S_{start} \iff [] \in S_{end}$ |
283 % $s \in S_{start} \iff [] \in S_{end}$ |
279 \end{center} |
284 %\end{center} |
280 \noindent |
285 %\noindent |
281 Brzozowski noticed that this operation |
286 Brzozowski noticed that $\Der$ |
282 can be ``mirrored'' on regular expressions which |
287 can be ``mirrored'' on regular expressions which |
283 he calls the derivative of a regular expression $r$ |
288 he calls the derivative of a regular expression $r$ |
284 with respect to a character $c$, written |
289 with respect to a character $c$, written |
285 $r \backslash c$. This infix operator |
290 $r \backslash c$. This infix operator |
286 takes an original regular expression $r$ as input |
291 takes an original regular expression $r$ as input |
558 If we implement the above algorithm naively, however, |
561 If we implement the above algorithm naively, however, |
559 the algorithm can be excruciatingly slow, as shown in |
562 the algorithm can be excruciatingly slow, as shown in |
560 \ref{NaiveMatcher}. |
563 \ref{NaiveMatcher}. |
561 Note that both axes are in logarithmic scale. |
564 Note that both axes are in logarithmic scale. |
562 Around two dozens characters |
565 Around two dozens characters |
563 would already explode the matcher on the regular expression |
566 would already ``explode'' the matcher with the regular expression |
564 $(a^*)^*b$. |
567 $(a^*)^*b$. |
565 Too improve this situation, we need to introduce simplification |
568 To improve this situation, we need to introduce simplification |
566 rewrite rules for the intermediate results, |
569 rules for the intermediate results, |
567 such as $r + r \rightarrow r$, |
570 such as $r + r \rightarrow r$, |
568 and make sure those rules do not change the |
571 and make sure those rules do not change the |
569 language of the regular expression. |
572 language of the regular expression. |
570 One simpled-minded simplification function |
573 One simpled-minded simplification function |
571 that achieves these requirements is given below: |
574 that achieves these requirements |
|
575 is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}): |
572 \begin{center} |
576 \begin{center} |
573 \begin{tabular}{lcl} |
577 \begin{tabular}{lcl} |
574 $\simp \; r_1 \cdot r_2 $ & $ \dn$ & |
578 $\simp \; r_1 \cdot r_2 $ & $ \dn$ & |
575 $(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\ |
579 $(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\ |
576 & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\ |
580 & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\ |
618 The running time of $\textit{ders}\_\textit{simp}$ |
622 The running time of $\textit{ders}\_\textit{simp}$ |
619 on the same example of Figure \ref{NaiveMatcher} |
623 on the same example of Figure \ref{NaiveMatcher} |
620 is now ``tame'' in terms of the length of inputs, |
624 is now ``tame'' in terms of the length of inputs, |
621 as shown in Figure \ref{BetterMatcher}. |
625 as shown in Figure \ref{BetterMatcher}. |
622 |
626 |
623 So far the story is use Brzozowski derivatives, |
627 So far the story is use Brzozowski derivatives and |
624 simplifying where possible and at the end test |
628 simplify as much as possible and at the end test |
625 whether the empty string is recognised by the final derivative. |
629 whether the empty string is recognised |
|
630 by the final derivative. |
626 But what if we want to |
631 But what if we want to |
627 do lexing instead of just getting a true/false answer? |
632 do lexing instead of just getting a true/false answer? |
628 Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and |
633 Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and |
629 elegant (arguably as beautiful as the definition of the original derivative) solution for this. |
634 elegant (arguably as beautiful as the definition of the |
|
635 Brzozowski derivative) solution for this. |
630 |
636 |
631 \section{Values and the Lexing Algorithm by Sulzmann and Lu} |
637 \section{Values and the Lexing Algorithm by Sulzmann and Lu} |
632 In this section, we present a two-phase regular expression lexing |
638 In this section, we present a two-phase regular expression lexing |
633 algorithm. |
639 algorithm. |
634 The first phase takes successive derivatives with |
640 The first phase takes successive derivatives with |
705 is to make sure that for a given pair of regular |
712 is to make sure that for a given pair of regular |
706 expression $r$ and string $s$, the number of values |
713 expression $r$ and string $s$, the number of values |
707 satisfying $|v| = s$ and $\vdash v:r$ is finite. |
714 satisfying $|v| = s$ and $\vdash v:r$ is finite. |
708 This additional condition was |
715 This additional condition was |
709 imposed by Ausaf and Urban to make their proofs easier. |
716 imposed by Ausaf and Urban to make their proofs easier. |
710 Given the same string and regular expression, there can be |
717 Given a string and a regular expression, there can be |
711 multiple values for it. For example, both |
718 multiple values for it. For example, both |
712 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and |
719 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and |
713 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold |
720 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold |
714 and the values both flatten to $abc$. |
721 and the values both flatten to $abc$. |
715 Lexers therefore have to disambiguate and choose only |
722 Lexers therefore have to disambiguate and choose only |
716 one of the values to output. $\POSIX$ is one of the |
723 one of the values be generated. $\POSIX$ is one of the |
717 disambiguation strategies that is widely adopted. |
724 disambiguation strategies that is widely adopted. |
718 |
725 |
719 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} |
726 Ausaf et al.\parencite{AusafDyckhoffUrban2016} |
720 formalised the property |
727 formalised the property |
721 as a ternary relation. |
728 as a ternary relation. |
722 The $\POSIX$ value $v$ for a regular expression |
729 The $\POSIX$ value $v$ for a regular expression |
723 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified |
730 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified |
724 in the following set of rules\footnote{The names of the rules are used |
731 in the following rules\footnote{The names of the rules are used |
725 as they were originally given in \cite{AusafDyckhoffUrban2016} }: |
732 as they were originally given in \cite{AusafDyckhoffUrban2016}.}: |
726 \begin{figure}[H] |
733 \begin{figure}[p] |
727 \begin{mathpar} |
734 \begin{mathpar} |
728 \inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty} |
735 \inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty} |
729 |
736 |
730 \inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c} |
737 \inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c} |
731 |
738 |
743 \inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\ |
750 \inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\ |
744 |v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land |
751 |v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land |
745 s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \; |
752 s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \; |
746 (v::vs)} |
753 (v::vs)} |
747 \end{mathpar} |
754 \end{mathpar} |
748 \caption{The inductive POSIX Lexing Rules defined by Ausaf, Dyckhoff and Urban \cite{AusafDyckhoffUrban2016}. |
755 \caption{The inductive POSIX rules given by Ausaf et al. |
749 The ternary relation, written $(s, r) \rightarrow v$, formalises the POSIX constraints on the |
756 \cite{AusafDyckhoffUrban2016}. |
|
757 This ternary relation, written $(s, r) \rightarrow v$, |
|
758 formalises the POSIX constraints on the |
750 value $v$ given a string $s$ and |
759 value $v$ given a string $s$ and |
751 regular expression $r$. |
760 regular expression $r$. |
752 For example, this specification says that all matches for an alternative |
|
753 must always prefer a left value to a right one. |
|
754 } |
761 } |
755 \end{figure} |
762 \end{figure}\afterpage{\clearpage} |
756 \noindent |
763 \noindent |
757 |
764 |
758 \begin{figure} |
765 %\begin{figure} |
759 \begin{tikzpicture}[] |
766 %\begin{tikzpicture}[] |
760 \node [minimum width = 6cm, rectangle split, rectangle split horizontal, |
767 % \node [minimum width = 6cm, rectangle split, rectangle split horizontal, |
761 rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}] |
768 % rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}] |
762 (node1) |
769 % (node1) |
763 {$r_{token1}$ |
770 % {$r_{token1}$ |
764 \nodepart{two} $\;\;\; \quad r_{token2}\;\;\;\quad$ }; |
771 % \nodepart{two} $\;\;\; \quad r_{token2}\;\;\;\quad$ }; |
765 %\node [left = 6.0cm of node1] (start1) {hi}; |
772 % %\node [left = 6.0cm of node1] (start1) {hi}; |
766 \node [left = 0.2cm of node1] (middle) {$v.s.$}; |
773 % \node [left = 0.2cm of node1] (middle) {$v.s.$}; |
767 \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal, |
774 % \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal, |
768 rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}] |
775 % rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}] |
769 (node2) |
776 % (node2) |
770 {$\quad\;\;\;r_{token1}\quad\;\;\;$ |
777 % {$\quad\;\;\;r_{token1}\quad\;\;\;$ |
771 \nodepart{two} $r_{token2}$ }; |
778 % \nodepart{two} $r_{token2}$ }; |
772 \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX}; |
779 % \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX}; |
773 \node [above = 1.5cm of middle, minimum width = 6cm, |
780 % \node [above = 1.5cm of middle, minimum width = 6cm, |
774 rectangle, style={draw, rounded corners, inner sep=10pt}] |
781 % rectangle, style={draw, rounded corners, inner sep=10pt}] |
775 (topNode) {$s$}; |
782 % (topNode) {$s$}; |
776 \path[->,draw] |
783 % \path[->,draw] |
777 (topNode) edge node {split $A$} (node2) |
784 % (topNode) edge node {split $A$} (node2) |
778 (topNode) edge node {split $B$} (node1) |
785 % (topNode) edge node {split $B$} (node1) |
779 ; |
786 % ; |
780 |
787 % |
781 |
788 % |
782 \end{tikzpicture} |
789 %\end{tikzpicture} |
783 \caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch} |
790 %\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch} |
784 \end{figure} |
791 %\end{figure} |
785 The above $\POSIX$ rules follows the intuition described below: |
792 The above $\POSIX$ rules follows the intuition described below: |
786 \begin{itemize} |
793 \begin{itemize} |
787 \item (Left Priority)\\ |
794 \item (Left Priority)\\ |
788 Match the leftmost regular expression when multiple options of matching |
795 Match the leftmost regular expression when multiple options of matching |
789 are available. |
796 are available. See P+L and P+R where in P+R $s$ cannot |
|
797 be in the language of $L \; r_1$. |
790 \item (Maximum munch)\\ |
798 \item (Maximum munch)\\ |
791 Always match a subpart as much as possible before proceeding |
799 Always match a subpart as much as possible before proceeding |
792 to the next token. |
800 to the next part of the string. |
793 For example, when the string $s$ matches |
801 For example, when the string $s$ matches |
794 $r_{token1}\cdot r_{token2}$, and we have two ways $s$ can be split: |
802 $r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split: |
795 Then the split that matches a longer string for the first token |
803 Then the split that matches a longer string for the first part |
796 $r_{token1}$ is preferred by this maximum munch rule (See |
804 $r_{part1}$ is preferred by this maximum munch rule. |
797 \ref{munch} for an illustration). |
805 This is caused by the side-condition |
798 \noindent |
806 \begin{center} |
799 |
807 $\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land |
|
808 s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$ |
|
809 \end{center} |
|
810 in PS. |
|
811 %(See |
|
812 %\ref{munch} for an illustration). |
800 \end{itemize} |
813 \end{itemize} |
801 \noindent |
814 \noindent |
802 These disambiguation strategies can be |
815 These disambiguation strategies can be |
803 quite practical. |
816 quite practical. |
804 For instance, when lexing a code snippet |
817 For instance, when lexing a code snippet |
805 \[ |
818 \[ |
806 \textit{iffoo} = 3 |
819 \textit{iffoo} = 3 |
807 \] |
820 \] |
808 using the regular expression (with |
821 using a regular expression |
809 keyword and identifier having their |
822 for keywords and |
810 usualy definitions on any formal |
823 identifiers: |
811 language textbook, for instance |
824 %(for example, keyword is a nonempty string starting with letters |
812 keyword is a nonempty string starting with letters |
825 %followed by alphanumeric characters or underscores): |
813 followed by alphanumeric characters or underscores): |
826 \[ |
814 \[ |
827 r_{keyword} + r_{identifier}. |
815 \textit{keyword} + \textit{identifier}, |
828 \] |
816 \] |
829 If we want $\textit{iffoo}$ to be recognized |
817 we want $\textit{iffoo}$ to be recognized |
830 as an identifier |
818 as an identifier rather than a keyword (if) |
831 where identifiers are defined as usual (letters |
|
832 followed by letters, numbers or underscores), |
|
833 then a match with a keyword (if) |
819 followed by |
834 followed by |
820 an identifier (foo). |
835 an identifier (foo) would be incorrect. |
821 POSIX lexing achieves this. |
836 POSIX lexing would generate what we want. |
822 |
837 |
823 We know that a $\POSIX$ value is also a normal underlying |
838 \noindent |
824 value: |
839 We know that a POSIX |
|
840 value for regular expression $r$ is inhabited by $r$. |
825 \begin{lemma} |
841 \begin{lemma} |
826 $(r, s) \rightarrow v \implies \vdash v: r$ |
842 $(r, s) \rightarrow v \implies \vdash v: r$ |
827 \end{lemma} |
843 \end{lemma} |
828 \noindent |
844 \noindent |
829 The good property about a $\POSIX$ value is that |
845 The main property about a $\POSIX$ value is that |
830 given the same regular expression $r$ and string $s$, |
846 given the same regular expression $r$ and string $s$, |
831 one can always uniquely determine the $\POSIX$ value for it: |
847 one can always uniquely determine the $\POSIX$ value for it: |
832 \begin{lemma} |
848 \begin{lemma} |
833 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$ |
849 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$ |
834 \end{lemma} |
850 \end{lemma} |
875 then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, |
891 then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, |
876 which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$ |
892 which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$ |
877 is the same as $\Seq(v_1, v_2)$. |
893 is the same as $\Seq(v_1, v_2)$. |
878 \end{proof} |
894 \end{proof} |
879 \noindent |
895 \noindent |
880 Now we know what a $\POSIX$ value is and why it is unique; |
896 We have now defined what a POSIX value is and shown that it is unique, |
881 the problem is generating |
897 the problem is to generate |
882 such a value in a lexing algorithm using derivatives. |
898 such a value in a lexing algorithm using derivatives. |
883 |
899 |
884 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm} |
900 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm} |
885 |
901 |
886 Sulzmann and Lu extended Brzozowski's |
902 Sulzmann and Lu extended Brzozowski's |
887 derivative-based matching |
903 derivative-based matching |
888 to a lexing algorithm by a second pass |
904 to a lexing algorithm by a second phase |
889 after the initial phase of successive derivatives. |
905 after the initial phase of successive derivatives. |
890 This second phase generates a POSIX value |
906 This second phase generates a POSIX value |
891 if the regular expression matches the string. |
907 if the regular expression matches the string. |
892 Two functions are involved: $\inj$ and $\mkeps$. |
908 Two functions are involved: $\inj$ and $\mkeps$. |
893 The first one used is $\mkeps$, which constructs a POSIX value from the last |
909 The function $\mkeps$ constructs a POSIX value from the last |
894 derivative $r_n$: |
910 derivative $r_n$: |
895 \begin{ceqn} |
911 \begin{ceqn} |
896 \begin{equation}\label{graph:mkeps} |
912 \begin{equation}\label{graph:mkeps} |
897 \begin{tikzcd} |
913 \begin{tikzcd} |
898 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\ |
914 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\ |
924 \end{tabular} |
938 \end{tabular} |
925 \end{center} |
939 \end{center} |
926 |
940 |
927 |
941 |
928 \noindent |
942 \noindent |
929 We favour the left child $r_1$ of $r_1 + r_2$ |
943 The function prefers the left child $r_1$ of $r_1 + r_2$ |
930 to match an empty string if there is a choice. |
944 to match an empty string if there is a choice. |
931 When there is a star for us to match the empty string, |
945 When there is a star to match the empty string, |
932 we give the $\Stars$ constructor an empty list, meaning |
946 we give the $\Stars$ constructor an empty list, meaning |
933 no iteration is taken. |
947 no iteration is taken. |
934 The result of a call to $\mkeps$ on a $\nullable$ $r$ would |
948 The result of $\mkeps$ on a $\nullable$ $r$ |
935 be a $\POSIX$ value corresponding to $r$: |
949 is a POSIX value for $r$ and the empty string: |
936 \begin{lemma}\label{mePosix} |
950 \begin{lemma}\label{mePosix} |
937 $\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$ |
951 $\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$ |
938 \end{lemma} |
952 \end{lemma} |
939 \begin{proof} |
953 \begin{proof} |
940 By induction on the shape of $r$. |
954 By induction on $r$. |
941 \end{proof} |
955 \end{proof} |
942 \noindent |
956 \noindent |
943 After the $\mkeps$-call, we inject back the characters one by one |
957 After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one |
944 in reverse order as they were chopped off in the derivative phase. |
958 in reverse order as they were chopped off in the derivative phase. |
945 The fucntion for this is called $\inj$. $\inj$ and $\backslash$ |
959 The fucntion for this is called $\inj$. This function |
946 are not exactly reverse operations of one another, as $\inj$ |
960 operates on values, unlike $\backslash$ which operates on regular expressions. |
947 operates on values instead of regular |
|
948 expressions. |
|
949 In the diagram below, $v_i$ stands for the (POSIX) value |
961 In the diagram below, $v_i$ stands for the (POSIX) value |
950 for how the regular expression |
962 for how the regular expression |
951 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters |
963 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters |
952 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
964 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
953 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
965 After injecting back $n$ characters, we get the lexical value for how $r_0$ |