6 %and notations we |
6 %and notations we |
7 %use for describing the lexing algorithm by Sulzmann and Lu, |
7 %use for describing the lexing algorithm by Sulzmann and Lu, |
8 %and then give the algorithm and its variant, and discuss |
8 %and then give the algorithm and its variant, and discuss |
9 %why more aggressive simplifications are needed. |
9 %why more aggressive simplifications are needed. |
10 |
10 |
11 |
11 In this chapter, we define the basic notions |
12 \section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions} |
12 for regular languages and regular expressions. |
13 We have a primitive datatype char, denoting characters. |
13 We also give the definition of what $\POSIX$ lexing means. |
14 \[ char ::= a |
14 |
15 \mid b |
15 \section{Basic Concepts} |
16 \mid c |
16 Usually in formal language theory there is an alphabet |
17 \mid \ldots |
17 denoting a set of characters. |
18 \mid z |
18 Here we only use the datatype of characters from Isabelle, |
19 \] |
19 which roughly corresponds to the ASCII character. |
20 (which one is better?) |
20 Then using the usual $[]$ notation for lists, |
21 \begin{center} |
21 we can define strings using chars: |
22 \begin{tabular}{lcl} |
|
23 $\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\ |
|
24 \end{tabular} |
|
25 \end{center} |
|
26 They can form strings by lists: |
|
27 \begin{center} |
22 \begin{center} |
28 \begin{tabular}{lcl} |
23 \begin{tabular}{lcl} |
29 $\textit{string}$ & $\dn$ & $[] | c :: cs$\\ |
24 $\textit{string}$ & $\dn$ & $[] | c :: cs$\\ |
30 & & $(c\; \text{has char type})$ |
25 & & $(c\; \text{has char type})$ |
31 \end{tabular} |
26 \end{tabular} |
32 \end{center} |
27 \end{center} |
33 And strings can be concatenated to form longer strings: |
28 And strings can be concatenated to form longer strings, |
34 \begin{center} |
29 in the same way as we concatenate two lists, |
35 \begin{tabular}{lcl} |
30 which we denote as $@$. We omit the precise |
36 $[] @ s_2$ & $\dn$ & $s_2$\\ |
31 recursive definition here. |
37 $(c :: s_1) @ s_2$ & $\dn$ & $c :: (s_1 @ s_2)$ |
32 We overload this concatenation operator for two sets of strings: |
38 \end{tabular} |
|
39 \end{center} |
|
40 |
|
41 A set of strings can operate with another set of strings: |
|
42 \begin{center} |
33 \begin{center} |
43 \begin{tabular}{lcl} |
34 \begin{tabular}{lcl} |
44 $A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\ |
35 $A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\ |
45 \end{tabular} |
36 \end{tabular} |
46 \end{center} |
37 \end{center} |
47 We also call the above "language concatenation". |
38 We also call the above \emph{language concatenation}. |
48 The power of a language is defined recursively, using the |
39 The power of a language is defined recursively, using the |
49 concatenation operator $@$: |
40 concatenation operator $@$: |
50 \begin{center} |
41 \begin{center} |
51 \begin{tabular}{lcl} |
42 \begin{tabular}{lcl} |
52 $A^0 $ & $\dn$ & $\{ [] \}$\\ |
43 $A^0 $ & $\dn$ & $\{ [] \}$\\ |
53 $A^{n+1}$ & $\dn$ & $A^n @ A$ |
44 $A^{n+1}$ & $\dn$ & $A^n @ A$ |
54 \end{tabular} |
45 \end{tabular} |
55 \end{center} |
46 \end{center} |
56 The union of all the natural number powers of a language |
47 The union of all the natural number powers of a language |
57 is denoted by the Kleene star operator: |
48 is defined as the Kleene star operator: |
58 \begin{center} |
49 \begin{center} |
59 \begin{tabular}{lcl} |
50 \begin{tabular}{lcl} |
60 $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\ |
51 $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\ |
61 \end{tabular} |
52 \end{tabular} |
62 \end{center} |
53 \end{center} |
63 |
54 |
64 To get an induction principle in Isabelle/HOL, |
55 \noindent |
|
56 However, to obtain a convenient induction principle |
|
57 in Isabelle/HOL, |
65 we instead define the Kleene star |
58 we instead define the Kleene star |
66 as an inductive set: |
59 as an inductive set: |
67 \begin{center} |
60 |
68 \begin{tabular}{lcl} |
61 \begin{center} |
69 $[] \in A*$ & &\\ |
62 \begin{mathpar} |
70 $s_1 \in A \land \; s_2 \in A* $ & $\implies$ & $s_1 @ s_2 \in A*$\\ |
63 \inferrule{}{[] \in A*\\} |
71 \end{tabular} |
64 |
72 \end{center} |
65 \inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*} |
73 \subsection{Language Derivatives} |
66 \end{mathpar} |
74 We also define an operation of chopping off a character from |
67 \end{center} |
75 a language: |
68 |
|
69 We also define an operation of "chopping of" a character from |
|
70 a language, which we call $\Der$, meaning "Derivative for a language": |
76 \begin{center} |
71 \begin{center} |
77 \begin{tabular}{lcl} |
72 \begin{tabular}{lcl} |
78 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\ |
73 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\ |
79 \end{tabular} |
74 \end{tabular} |
80 \end{center} |
75 \end{center} |
81 |
76 \noindent |
82 This can be generalised to chopping off a string from all strings within set $A$: |
77 This can be generalised to "chopping off" a string from all strings within set $A$, |
|
78 with the help of the concatenation operator: |
83 \begin{center} |
79 \begin{center} |
84 \begin{tabular}{lcl} |
80 \begin{tabular}{lcl} |
85 $\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\ |
81 $\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\ |
86 \end{tabular} |
82 \end{tabular} |
87 \end{center} |
83 \end{center} |
88 |
84 \noindent |
89 which is essentially the left quotient $A \backslash L'$ of $A$ against |
85 which is essentially the left quotient $A \backslash L'$ of $A$ against |
90 the singleton language $L' = \{w\}$ |
86 the singleton language $L' = \{w\}$ |
91 in formal language theory. |
87 in formal language theory. |
92 For this dissertation the $\textit{Ders}$ definition with |
88 For this dissertation the $\textit{Ders}$ definition with |
93 a single string suffices. |
89 a single string suffices. |
97 sub-languages. |
93 sub-languages. |
98 \begin{lemma} |
94 \begin{lemma} |
99 \[ |
95 \[ |
100 \Der \; c \; (A @ B) = |
96 \Der \; c \; (A @ B) = |
101 \begin{cases} |
97 \begin{cases} |
102 ((\Der \; c \; A) @ B ) \cup \Der \; c\; B, & \text{if} \; [] \in A \\ |
98 ((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\ |
103 (\Der \; c \; A) @ B, & \text{otherwise} |
99 (\Der \; c \; A) \, @ \, B, & \text{otherwise} |
104 \end{cases} |
100 \end{cases} |
105 \] |
101 \] |
106 \end{lemma} |
102 \end{lemma} |
107 \noindent |
103 \noindent |
108 This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it |
104 This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it |
109 and get to $B$. |
105 and get to $B$. |
110 |
|
111 The language $A*$'s derivative can be described using the language derivative |
106 The language $A*$'s derivative can be described using the language derivative |
112 of $A$: |
107 of $A$: |
113 \begin{lemma} |
108 \begin{lemma} |
114 $\textit{Der} \;c \;A* = (\textit{Der}\; c A) @ (A*)$\\ |
109 $\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\ |
115 \end{lemma} |
110 \end{lemma} |
116 \begin{proof} |
111 \begin{proof} |
117 \begin{itemize} |
112 \begin{itemize} |
118 \item{$\subseteq$} |
113 \item{$\subseteq$} |
|
114 \noindent |
119 The set |
115 The set |
120 \[ \{s \mid c :: s \in A*\} \] |
116 \[ \{s \mid c :: s \in A*\} \] |
121 is enclosed in the set |
117 is enclosed in the set |
122 \[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \] |
118 \[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \] |
123 because whenever you have a string starting with a character |
119 because whenever you have a string starting with a character |
124 in the language of a Kleene star $A*$, then that character together with some sub-string |
120 in the language of a Kleene star $A*$, |
125 immediately after it will form the first iteration, and the rest of the string will |
121 then that character together with some sub-string |
|
122 immediately after it will form the first iteration, |
|
123 and the rest of the string will |
126 be still in $A*$. |
124 be still in $A*$. |
127 \item{$\supseteq$} |
125 \item{$\supseteq$} |
128 Note that |
126 Note that |
129 \[ \Der \; c \; A* = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \] |
127 \[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \] |
130 and |
128 and |
131 \[ \Der \; c \; (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \] |
129 \[ \Der \; c \; (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \] |
132 where the $\textit{RHS}$ of the above equatioin can be rewritten |
130 where the $\textit{RHS}$ of the above equatioin can be rewritten |
133 as \[ (\Der \; c\; A) @ A* \cup A' \], $A'$ being a possibly empty set. |
131 as \[ (\Der \; c\; A) @ A* \cup A' \], $A'$ being a possibly empty set. |
134 \end{itemize} |
132 \end{itemize} |
135 \end{proof} |
133 \end{proof} |
|
134 |
|
135 \noindent |
136 Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart |
136 Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart |
137 for regular languages, we need to first give definitions for regular expressions. |
137 for regular languages, we need to first give definitions for regular expressions. |
138 |
138 |
139 \subsection{Regular Expressions and Their Meaning} |
139 \subsection{Regular Expressions and Their Meaning} |
140 Suppose we have an alphabet $\Sigma$, the strings whose characters |
|
141 are from $\Sigma$ |
|
142 can be expressed as $\Sigma^*$. |
|
143 |
|
144 We use patterns to define a set of strings concisely. Regular expressions |
|
145 are one of such patterns systems: |
|
146 The basic regular expressions are defined inductively |
140 The basic regular expressions are defined inductively |
147 by the following grammar: |
141 by the following grammar: |
148 \[ r ::= \ZERO \mid \ONE |
142 \[ r ::= \ZERO \mid \ONE |
149 \mid c |
143 \mid c |
150 \mid r_1 \cdot r_2 |
144 \mid r_1 \cdot r_2 |
151 \mid r_1 + r_2 |
145 \mid r_1 + r_2 |
152 \mid r^* |
146 \mid r^* |
153 \] |
147 \] |
|
148 \noindent |
|
149 We call them basic because we might introduce |
|
150 more constructs later such as negation |
|
151 and bounded repetitions. |
|
152 We defined the regular expression containing |
|
153 nothing as $\ZERO$, note that some authors |
|
154 also use $\phi$ for that. |
|
155 Similarly, the regular expression denoting the |
|
156 singleton set with only $[]$ is sometimes |
|
157 denoted by $\epsilon$, but we use $\ONE$ here. |
154 |
158 |
155 The language or set of strings denoted |
159 The language or set of strings denoted |
156 by regular expressions are defined as |
160 by regular expressions are defined as |
157 %TODO: FILL in the other defs |
161 %TODO: FILL in the other defs |
158 \begin{center} |
162 \begin{center} |
314 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
330 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
315 \end{tikzcd} |
331 \end{tikzcd} |
316 \end{equation} |
332 \end{equation} |
317 |
333 |
318 \noindent |
334 \noindent |
|
335 |
|
336 |
|
337 Building derivatives and then testing the existence |
|
338 of empty string in the resulting regular expression's language. |
|
339 So far, so good. But what if we want to |
|
340 do lexing instead of just getting a YES/NO answer? |
|
341 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and |
|
342 elegant (arguably as beautiful as the definition of the original derivative) solution for this. |
|
343 |
|
344 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu} |
|
345 Here we present the hybrid phases of a regular expression lexing |
|
346 algorithm using the function $\inj$, as given by Sulzmann and Lu. |
|
347 They first defined the datatypes for storing the |
|
348 lexing information called a \emph{value} or |
|
349 sometimes also \emph{lexical value}. These values and regular |
|
350 expressions correspond to each other as illustrated in the following |
|
351 table: |
|
352 |
|
353 \begin{center} |
|
354 \begin{tabular}{c@{\hspace{20mm}}c} |
|
355 \begin{tabular}{@{}rrl@{}} |
|
356 \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |
|
357 $r$ & $::=$ & $\ZERO$\\ |
|
358 & $\mid$ & $\ONE$ \\ |
|
359 & $\mid$ & $c$ \\ |
|
360 & $\mid$ & $r_1 \cdot r_2$\\ |
|
361 & $\mid$ & $r_1 + r_2$ \\ |
|
362 \\ |
|
363 & $\mid$ & $r^*$ \\ |
|
364 \end{tabular} |
|
365 & |
|
366 \begin{tabular}{@{\hspace{0mm}}rrl@{}} |
|
367 \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ |
|
368 $v$ & $::=$ & \\ |
|
369 & & $\Empty$ \\ |
|
370 & $\mid$ & $\Char(c)$ \\ |
|
371 & $\mid$ & $\Seq\,v_1\, v_2$\\ |
|
372 & $\mid$ & $\Left(v)$ \\ |
|
373 & $\mid$ & $\Right(v)$ \\ |
|
374 & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ |
|
375 \end{tabular} |
|
376 \end{tabular} |
|
377 \end{center} |
|
378 |
|
379 \noindent |
|
380 We have a formal binary relation for telling whether the structure |
|
381 of a regular expression agrees with the value. |
|
382 \begin{mathpar} |
|
383 \inferrule{}{\vdash \Char(c) : \mathbf{c}} \hspace{2em} |
|
384 \inferrule{}{\vdash \Empty : \ONE} \hspace{2em} |
|
385 \inferrule{\vdash v_1 : r_1 \\ \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)} |
|
386 \end{mathpar} |
|
387 |
|
388 Building on top of Sulzmann and Lu's attempt to formalise the |
|
389 notion of POSIX lexing rules \parencite{Sulzmann2014}, |
|
390 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled |
|
391 POSIX matching as a ternary relation recursively defined in a |
|
392 natural deduction style. |
|
393 The formal definition of a $\POSIX$ value $v$ for a regular expression |
|
394 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified |
|
395 in the following set of rules: |
|
396 \ChristianComment{Will complete later} |
|
397 \newcommand*{\inference}[3][t]{% |
|
398 \begingroup |
|
399 \def\and{\\}% |
|
400 \begin{tabular}[#1]{@{\enspace}c@{\enspace}} |
|
401 #2 \\ |
|
402 \hline |
|
403 #3 |
|
404 \end{tabular}% |
|
405 \endgroup |
|
406 } |
|
407 \begin{center} |
|
408 \inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$ }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ } |
|
409 \end{center} |
|
410 \noindent |
|
411 The above $\POSIX$ rules could be explained intuitionally as |
|
412 \begin{itemize} |
|
413 \item |
|
414 match the leftmost regular expression when multiple options of matching |
|
415 are available |
|
416 \item |
|
417 always match a subpart as much as possible before proceeding |
|
418 to the next token. |
|
419 \end{itemize} |
|
420 |
|
421 The reason why we are interested in $\POSIX$ values is that they can |
|
422 be practically used in the lexing phase of a compiler front end. |
|
423 For instance, when lexing a code snippet |
|
424 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized |
|
425 as an identifier rather than a keyword. |
|
426 |
|
427 The good property about a $\POSIX$ value is that |
|
428 given the same regular expression $r$ and string $s$, |
|
429 one can always uniquely determine the $\POSIX$ value for it: |
|
430 \begin{lemma} |
|
431 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$ |
|
432 \end{lemma} |
|
433 Now we know what a $\POSIX$ value is, the problem is how do we achieve |
|
434 such a value in a lexing algorithm, using derivatives? |
|
435 |
|
436 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm} |
|
437 |
|
438 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
|
439 algorithm by a second phase (the first phase being building successive |
|
440 derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value |
|
441 is generated if the regular expression matches the string. |
|
442 Two functions are involved: $\inj$ and $\mkeps$. |
|
443 The function $\mkeps$ constructs a value from the last |
|
444 one of all the successive derivatives: |
|
445 \begin{ceqn} |
|
446 \begin{equation}\label{graph:mkeps} |
|
447 \begin{tikzcd} |
|
448 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\ |
|
449 & & & v_n |
|
450 \end{tikzcd} |
|
451 \end{equation} |
|
452 \end{ceqn} |
|
453 |
|
454 It tells us how can an empty string be matched by a |
|
455 regular expression, in a $\POSIX$ way: |
|
456 |
|
457 \begin{center} |
|
458 \begin{tabular}{lcl} |
|
459 $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
|
460 $\mkeps(r_{1}+r_{2})$ & $\dn$ |
|
461 & \textit{if} $\nullable(r_{1})$\\ |
|
462 & & \textit{then} $\Left(\mkeps(r_{1}))$\\ |
|
463 & & \textit{else} $\Right(\mkeps(r_{2}))$\\ |
|
464 $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ |
|
465 $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ |
|
466 \end{tabular} |
|
467 \end{center} |
|
468 |
|
469 |
|
470 \noindent |
|
471 We favour the left to match an empty string if there is a choice. |
|
472 When there is a star for us to match the empty string, |
|
473 we give the $\Stars$ constructor an empty list, meaning |
|
474 no iterations are taken. |
|
475 The result of a call to $\mkeps$ on a $\nullable$ $r$ would |
|
476 be a $\POSIX$ value corresponding to $r$: |
|
477 \begin{lemma} |
|
478 $\nullable(r) \implies (r, []) \rightarrow (\mkeps\; v)$ |
|
479 \end{lemma}\label{mePosix} |
|
480 |
|
481 |
|
482 After the $\mkeps$-call, we inject back the characters one by one in order to build |
|
483 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ |
|
484 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
|
485 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
|
486 matches $s$. |
|
487 To do this, Sulzmann and Lu defined a function that reverses |
|
488 the ``chopping off'' of characters during the derivative phase. The |
|
489 corresponding function is called \emph{injection}, written |
|
490 $\textit{inj}$; it takes three arguments: the first one is a regular |
|
491 expression ${r_{i-1}}$, before the character is chopped off, the second |
|
492 is a character ${c_{i-1}}$, the character we want to inject and the |
|
493 third argument is the value ${v_i}$, into which one wants to inject the |
|
494 character (it corresponds to the regular expression after the character |
|
495 has been chopped off). The result of this function is a new value. |
|
496 \begin{ceqn} |
|
497 \begin{equation}\label{graph:inj} |
|
498 \begin{tikzcd} |
|
499 r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
|
500 v_1 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed] |
|
501 \end{tikzcd} |
|
502 \end{equation} |
|
503 \end{ceqn} |
|
504 |
|
505 |
|
506 \noindent |
|
507 The |
|
508 definition of $\textit{inj}$ is as follows: |
|
509 |
|
510 \begin{center} |
|
511 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
512 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
|
513 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
|
514 $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ |
|
515 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
516 $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
517 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ |
|
518 $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ |
|
519 \end{tabular} |
|
520 \end{center} |
|
521 |
|
522 \noindent |
|
523 This definition is by recursion on the ``shape'' of regular |
|
524 expressions and values. |
|
525 The clauses do one thing--identifying the ``hole'' on a |
|
526 value to inject the character back into. |
|
527 For instance, in the last clause for injecting back to a value |
|
528 that would turn into a new star value that corresponds to a star, |
|
529 we know it must be a sequence value. And we know that the first |
|
530 value of that sequence corresponds to the child regex of the star |
|
531 with the first character being chopped off--an iteration of the star |
|
532 that had just been unfolded. This value is followed by the already |
|
533 matched star iterations we collected before. So we inject the character |
|
534 back to the first value and form a new value with this latest iteration |
|
535 being added to the previous list of iterations, all under the $\Stars$ |
|
536 top level. |
|
537 The POSIX value is maintained throughout the process. |
|
538 \begin{lemma} |
|
539 $(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$ |
|
540 \end{lemma}\label{injPosix} |
|
541 |
|
542 |
|
543 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together, |
|
544 and taking into consideration the possibility of a non-match, |
|
545 we have a lexer with the following recursive definition: |
|
546 \begin{center} |
|
547 \begin{tabular}{lcr} |
|
548 $\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\ |
|
549 $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\ |
|
550 & & $\None \implies \None$\\ |
|
551 & & $\mid \Some(v) \implies \Some(\inj \; r\; c\; v)$ |
|
552 \end{tabular} |
|
553 \end{center} |
|
554 \noindent |
|
555 The central property of the $\lexer$ is that it gives the correct result by |
|
556 $\POSIX$ standards: |
|
557 \begin{lemma} |
|
558 \begin{tabular}{l} |
|
559 $s \in L(r) \Longleftrightarrow (\exists v. \; r \; s = \Some(v) \land (r, \; s) \rightarrow v)$\\ |
|
560 $s \notin L(r) \Longleftrightarrow (\lexer \; r\; s = \None)$ |
|
561 \end{tabular} |
|
562 \end{lemma} |
|
563 |
|
564 |
|
565 \begin{proof} |
|
566 By induction on $s$. $r$ is allowed to be an arbitrary regular expression. |
|
567 The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case |
|
568 by lemma \ref{injPosix}. |
|
569 \end{proof} |
|
570 |
|
571 For convenience, we shall employ the following notations: the regular |
|
572 expression we start with is $r_0$, and the given string $s$ is composed |
|
573 of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the |
|
574 left to right, we build the derivatives $r_1$, $r_2$, \ldots according |
|
575 to the characters $c_0$, $c_1$ until we exhaust the string and obtain |
|
576 the derivative $r_n$. We test whether this derivative is |
|
577 $\textit{nullable}$ or not. If not, we know the string does not match |
|
578 $r$, and no value needs to be generated. If yes, we start building the |
|
579 values incrementally by \emph{injecting} back the characters into the |
|
580 earlier values $v_n, \ldots, v_0$. |
|
581 Pictorially, the algorithm is as follows: |
|
582 |
|
583 \begin{ceqn} |
|
584 \begin{equation}\label{graph:2} |
|
585 \begin{tikzcd} |
|
586 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
|
587 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
|
588 \end{tikzcd} |
|
589 \end{equation} |
|
590 \end{ceqn} |
|
591 |
|
592 |
|
593 \noindent |
|
594 This is the second phase of the |
|
595 algorithm from right to left. For the first value $v_n$, we call the |
|
596 function $\textit{mkeps}$, which builds a POSIX lexical value |
|
597 for how the empty string has been matched by the (nullable) regular |
|
598 expression $r_n$. This function is defined as |
|
599 |
|
600 |
|
601 |
|
602 We have mentioned before that derivatives without simplification |
|
603 can get clumsy, and this is true for values as well--they reflect |
|
604 the size of the regular expression by definition. |
|
605 |
|
606 One can introduce simplification on the regex and values but have to |
|
607 be careful not to break the correctness, as the injection |
|
608 function heavily relies on the structure of the regexes and values |
|
609 being correct and matching each other. |
|
610 It can be achieved by recording some extra rectification functions |
|
611 during the derivatives step, and applying these rectifications in |
|
612 each run during the injection phase. |
|
613 And we can prove that the POSIX value of how |
|
614 regular expressions match strings will not be affected---although it is much harder |
|
615 to establish. |
|
616 Some initial results in this regard have been |
|
617 obtained in \cite{AusafDyckhoffUrban2016}. |
|
618 |
|
619 |
|
620 |
|
621 %Brzozowski, after giving the derivatives and simplification, |
|
622 %did not explore lexing with simplification, or he may well be |
|
623 %stuck on an efficient simplification with proof. |
|
624 %He went on to examine the use of derivatives together with |
|
625 %automaton, and did not try lexing using products. |
|
626 |
|
627 We want to get rid of the complex and fragile rectification of values. |
|
628 Can we not create those intermediate values $v_1,\ldots v_n$, |
|
629 and get the lexing information that should be already there while |
|
630 doing derivatives in one pass, without a second injection phase? |
|
631 In the meantime, can we make sure that simplifications |
|
632 are easily handled without breaking the correctness of the algorithm? |
|
633 |
|
634 Sulzmann and Lu solved this problem by |
|
635 introducing additional information to the |
|
636 regular expressions called \emph{bitcodes}. |
|
637 |
|
638 |
|
639 |
|
640 |
|
641 |
|
642 With the formally-specified rules for what a POSIX matching is, |
|
643 they proved in Isabelle/HOL that the algorithm gives correct results. |
|
644 But having a correct result is still not enough, |
|
645 we want at least some degree of $\mathbf{efficiency}$. |
|
646 |
|
647 |
|
648 A pair of regular expression and string can have multiple lexical values. |
|
649 Take the example where $r= (a^*\cdot a^*)^*$ and the string |
|
650 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
|
651 If we do not allow any empty iterations in its lexical values, |
|
652 there will be $n - 1$ "splitting points" on $s$ we can choose to |
|
653 split or not so that each sub-string |
|
654 segmented by those chosen splitting points will form different iterations: |
|
655 \begin{center} |
|
656 \begin{tabular}{lcr} |
|
657 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$\\ |
|
658 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$\\ |
|
659 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$\\ |
|
660 & $\textit{etc}.$ & |
|
661 \end{tabular} |
|
662 \end{center} |
|
663 \noindent |
|
664 And for each iteration, there are still multiple ways to split |
|
665 between the two $a^*$s. |
|
666 It is not surprising there are exponentially many lexical values |
|
667 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$ and |
|
668 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
|
669 A lexer to keep all the possible values will naturally |
|
670 have an exponential runtime on ambiguous regular expressions. |
|
671 With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values |
|
672 of a match. This means Sulzmann and Lu's injection-based algorithm |
|
673 will be exponential by nature. |
|
674 Somehow one has to decide which lexical value to keep and |
|
675 output in a lexing algorithm. |
|
676 |
|
677 |
|
678 For example, the above $r= (a^*\cdot a^*)^*$ and |
|
679 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value |
|
680 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
|
681 The output of an algorithm we want would be a POSIX matching |
|
682 encoded as a value. |
|
683 |
|
684 |
|
685 |
|
686 |
|
687 |
|
688 %kind of redundant material |
|
689 |
|
690 |
319 where we start with a regular expression $r_0$, build successive |
691 where we start with a regular expression $r_0$, build successive |
320 derivatives until we exhaust the string and then use \textit{nullable} |
692 derivatives until we exhaust the string and then use \textit{nullable} |
321 to test whether the result can match the empty string. It can be |
693 to test whether the result can match the empty string. It can be |
322 relatively easily shown that this matcher is correct (that is given |
694 relatively easily shown that this matcher is correct (that is given |
323 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
695 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
403 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
775 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
404 $a^*+a+\ONE$. These more aggressive simplification rules are for |
776 $a^*+a+\ONE$. These more aggressive simplification rules are for |
405 a very tight size bound, possibly as low |
777 a very tight size bound, possibly as low |
406 as that of the \emph{partial derivatives}\parencite{Antimirov1995}. |
778 as that of the \emph{partial derivatives}\parencite{Antimirov1995}. |
407 |
779 |
408 Building derivatives and then simplifying them. |
|
409 So far, so good. But what if we want to |
|
410 do lexing instead of just getting a YES/NO answer? |
|
411 This requires us to go back again to the world |
|
412 without simplification first for a moment. |
|
413 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and |
|
414 elegant(arguably as beautiful as the original |
|
415 derivatives definition) solution for this. |
|
416 |
|
417 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu} |
|
418 |
|
419 |
|
420 They first defined the datatypes for storing the |
|
421 lexing information called a \emph{value} or |
|
422 sometimes also \emph{lexical value}. These values and regular |
|
423 expressions correspond to each other as illustrated in the following |
|
424 table: |
|
425 |
|
426 \begin{center} |
|
427 \begin{tabular}{c@{\hspace{20mm}}c} |
|
428 \begin{tabular}{@{}rrl@{}} |
|
429 \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |
|
430 $r$ & $::=$ & $\ZERO$\\ |
|
431 & $\mid$ & $\ONE$ \\ |
|
432 & $\mid$ & $c$ \\ |
|
433 & $\mid$ & $r_1 \cdot r_2$\\ |
|
434 & $\mid$ & $r_1 + r_2$ \\ |
|
435 \\ |
|
436 & $\mid$ & $r^*$ \\ |
|
437 \end{tabular} |
|
438 & |
|
439 \begin{tabular}{@{\hspace{0mm}}rrl@{}} |
|
440 \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ |
|
441 $v$ & $::=$ & \\ |
|
442 & & $\Empty$ \\ |
|
443 & $\mid$ & $\Char(c)$ \\ |
|
444 & $\mid$ & $\Seq\,v_1\, v_2$\\ |
|
445 & $\mid$ & $\Left(v)$ \\ |
|
446 & $\mid$ & $\Right(v)$ \\ |
|
447 & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ |
|
448 \end{tabular} |
|
449 \end{tabular} |
|
450 \end{center} |
|
451 |
|
452 \noindent |
|
453 |
|
454 Building on top of Sulzmann and Lu's attempt to formalise the |
|
455 notion of POSIX lexing rules \parencite{Sulzmann2014}, |
|
456 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled |
|
457 POSIX matching as a ternary relation recursively defined in a |
|
458 natural deduction style. |
|
459 With the formally-specified rules for what a POSIX matching is, |
|
460 they proved in Isabelle/HOL that the algorithm gives correct results. |
|
461 |
|
462 But having a correct result is still not enough, |
|
463 we want at least some degree of $\mathbf{efficiency}$. |
|
464 |
|
465 |
|
466 |
|
467 One regular expression can have multiple lexical values. For example |
|
468 for the regular expression $(a+b)^*$, it has a infinite list of |
|
469 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$, |
|
470 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$, |
|
471 $\ldots$, and vice versa. |
|
472 Even for the regular expression matching a particular string, there could |
|
473 be more than one value corresponding to it. |
|
474 Take the example where $r= (a^*\cdot a^*)^*$ and the string |
|
475 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
|
476 If we do not allow any empty iterations in its lexical values, |
|
477 there will be $n - 1$ "splitting points" on $s$ we can choose to |
|
478 split or not so that each sub-string |
|
479 segmented by those chosen splitting points will form different iterations: |
|
480 \begin{center} |
|
481 \begin{tabular}{lcr} |
|
482 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$\\ |
|
483 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$\\ |
|
484 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$\\ |
|
485 & $\textit{etc}.$ & |
|
486 \end{tabular} |
|
487 \end{center} |
|
488 |
|
489 And for each iteration, there are still multiple ways to split |
|
490 between the two $a^*$s. |
|
491 It is not surprising there are exponentially many lexical values |
|
492 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$ and |
|
493 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
|
494 |
|
495 A lexer to keep all the possible values will naturally |
|
496 have an exponential runtime on ambiguous regular expressions. |
|
497 Somehow one has to decide which lexical value to keep and |
|
498 output in a lexing algorithm. |
|
499 In practice, we are usually |
|
500 interested in POSIX values, which by intuition always |
|
501 \begin{itemize} |
|
502 \item |
|
503 match the leftmost regular expression when multiple options of matching |
|
504 are available |
|
505 \item |
|
506 always match a subpart as much as possible before proceeding |
|
507 to the next token. |
|
508 \end{itemize} |
|
509 The formal definition of a $\POSIX$ value $v$ for a regular expression |
|
510 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified |
|
511 in the following set of rules: |
|
512 (TODO: write the entire set of inference rules for POSIX ) |
|
513 \newcommand*{\inference}[3][t]{% |
|
514 \begingroup |
|
515 \def\and{\\}% |
|
516 \begin{tabular}[#1]{@{\enspace}c@{\enspace}} |
|
517 #2 \\ |
|
518 \hline |
|
519 #3 |
|
520 \end{tabular}% |
|
521 \endgroup |
|
522 } |
|
523 \begin{center} |
|
524 \inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$ }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ } |
|
525 \end{center} |
|
526 |
|
527 The reason why we are interested in $\POSIX$ values is that they can |
|
528 be practically used in the lexing phase of a compiler front end. |
|
529 For instance, when lexing a code snippet |
|
530 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized |
|
531 as an identifier rather than a keyword. |
|
532 |
|
533 For example, the above $r= (a^*\cdot a^*)^*$ and |
|
534 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value |
|
535 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
|
536 The output of an algorithm we want would be a POSIX matching |
|
537 encoded as a value. |
|
538 |
|
539 |
|
540 |
|
541 |
|
542 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
|
543 algorithm by a second phase (the first phase being building successive |
|
544 derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value |
|
545 is generated if the regular expression matches the string. |
|
546 How can we construct a value out of regular expressions and character |
|
547 sequences only? |
|
548 Two functions are involved: $\inj$ and $\mkeps$. |
|
549 The function $\mkeps$ constructs a value from the last |
|
550 one of all the successive derivatives: |
|
551 \begin{ceqn} |
|
552 \begin{equation}\label{graph:mkeps} |
|
553 \begin{tikzcd} |
|
554 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\ |
|
555 & & & v_n |
|
556 \end{tikzcd} |
|
557 \end{equation} |
|
558 \end{ceqn} |
|
559 |
|
560 It tells us how can an empty string be matched by a |
|
561 regular expression, in a $\POSIX$ way: |
|
562 |
|
563 \begin{center} |
|
564 \begin{tabular}{lcl} |
|
565 $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
|
566 $\mkeps(r_{1}+r_{2})$ & $\dn$ |
|
567 & \textit{if} $\nullable(r_{1})$\\ |
|
568 & & \textit{then} $\Left(\mkeps(r_{1}))$\\ |
|
569 & & \textit{else} $\Right(\mkeps(r_{2}))$\\ |
|
570 $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ |
|
571 $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ |
|
572 \end{tabular} |
|
573 \end{center} |
|
574 |
|
575 |
|
576 \noindent |
|
577 We favour the left to match an empty string if there is a choice. |
|
578 When there is a star for us to match the empty string, |
|
579 we give the $\Stars$ constructor an empty list, meaning |
|
580 no iterations are taken. |
|
581 |
|
582 |
|
583 After the $\mkeps$-call, we inject back the characters one by one in order to build |
|
584 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ |
|
585 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
|
586 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
|
587 matches $s$. The POSIX value is maintained throughout the process. |
|
588 For this Sulzmann and Lu defined a function that reverses |
|
589 the ``chopping off'' of characters during the derivative phase. The |
|
590 corresponding function is called \emph{injection}, written |
|
591 $\textit{inj}$; it takes three arguments: the first one is a regular |
|
592 expression ${r_{i-1}}$, before the character is chopped off, the second |
|
593 is a character ${c_{i-1}}$, the character we want to inject and the |
|
594 third argument is the value ${v_i}$, into which one wants to inject the |
|
595 character (it corresponds to the regular expression after the character |
|
596 has been chopped off). The result of this function is a new value. |
|
597 \begin{ceqn} |
|
598 \begin{equation}\label{graph:inj} |
|
599 \begin{tikzcd} |
|
600 r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
|
601 v_1 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed] |
|
602 \end{tikzcd} |
|
603 \end{equation} |
|
604 \end{ceqn} |
|
605 |
|
606 |
|
607 \noindent |
|
608 The |
|
609 definition of $\textit{inj}$ is as follows: |
|
610 |
|
611 \begin{center} |
|
612 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
613 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
|
614 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
|
615 $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ |
|
616 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
617 $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
618 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ |
|
619 $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ |
|
620 \end{tabular} |
|
621 \end{center} |
|
622 |
|
623 \noindent This definition is by recursion on the ``shape'' of regular |
|
624 expressions and values. |
|
625 The clauses do one thing--identifying the ``hole'' on a |
|
626 value to inject the character back into. |
|
627 For instance, in the last clause for injecting back to a value |
|
628 that would turn into a new star value that corresponds to a star, |
|
629 we know it must be a sequence value. And we know that the first |
|
630 value of that sequence corresponds to the child regex of the star |
|
631 with the first character being chopped off--an iteration of the star |
|
632 that had just been unfolded. This value is followed by the already |
|
633 matched star iterations we collected before. So we inject the character |
|
634 back to the first value and form a new value with this latest iteration |
|
635 being added to the previous list of iterations, all under the $\Stars$ |
|
636 top level. |
|
637 |
|
638 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together, |
|
639 we have a lexer with the following recursive definition: |
|
640 \begin{center} |
|
641 \begin{tabular}{lcr} |
|
642 $\lexer \; r \; [] $ & $=$ & $\mkeps \; r$\\ |
|
643 $\lexer \; r \;c::s$ & $=$ & $\inj \; r \; c (\lexer (r\backslash c) s)$ |
|
644 \end{tabular} |
|
645 \end{center} |
|
646 |
|
647 Pictorially, the algorithm is as follows: |
|
648 |
|
649 \begin{ceqn} |
|
650 \begin{equation}\label{graph:2} |
|
651 \begin{tikzcd} |
|
652 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
|
653 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
|
654 \end{tikzcd} |
|
655 \end{equation} |
|
656 \end{ceqn} |
|
657 |
|
658 |
|
659 \noindent |
|
660 For convenience, we shall employ the following notations: the regular |
|
661 expression we start with is $r_0$, and the given string $s$ is composed |
|
662 of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the |
|
663 left to right, we build the derivatives $r_1$, $r_2$, \ldots according |
|
664 to the characters $c_0$, $c_1$ until we exhaust the string and obtain |
|
665 the derivative $r_n$. We test whether this derivative is |
|
666 $\textit{nullable}$ or not. If not, we know the string does not match |
|
667 $r$, and no value needs to be generated. If yes, we start building the |
|
668 values incrementally by \emph{injecting} back the characters into the |
|
669 earlier values $v_n, \ldots, v_0$. This is the second phase of the |
|
670 algorithm from right to left. For the first value $v_n$, we call the |
|
671 function $\textit{mkeps}$, which builds a POSIX lexical value |
|
672 for how the empty string has been matched by the (nullable) regular |
|
673 expression $r_n$. This function is defined as |
|
674 |
|
675 |
|
676 |
|
677 We have mentioned before that derivatives without simplification |
|
678 can get clumsy, and this is true for values as well--they reflect |
|
679 the size of the regular expression by definition. |
|
680 |
|
681 One can introduce simplification on the regex and values but have to |
|
682 be careful not to break the correctness, as the injection |
|
683 function heavily relies on the structure of the regexes and values |
|
684 being correct and matching each other. |
|
685 It can be achieved by recording some extra rectification functions |
|
686 during the derivatives step, and applying these rectifications in |
|
687 each run during the injection phase. |
|
688 And we can prove that the POSIX value of how |
|
689 regular expressions match strings will not be affected---although it is much harder |
|
690 to establish. |
|
691 Some initial results in this regard have been |
|
692 obtained in \cite{AusafDyckhoffUrban2016}. |
|
693 |
|
694 |
|
695 |
|
696 %Brzozowski, after giving the derivatives and simplification, |
|
697 %did not explore lexing with simplification, or he may well be |
|
698 %stuck on an efficient simplification with proof. |
|
699 %He went on to examine the use of derivatives together with |
|
700 %automaton, and did not try lexing using products. |
|
701 |
|
702 We want to get rid of the complex and fragile rectification of values. |
|
703 Can we not create those intermediate values $v_1,\ldots v_n$, |
|
704 and get the lexing information that should be already there while |
|
705 doing derivatives in one pass, without a second injection phase? |
|
706 In the meantime, can we make sure that simplifications |
|
707 are easily handled without breaking the correctness of the algorithm? |
|
708 |
|
709 Sulzmann and Lu solved this problem by |
|
710 introducing additional information to the |
|
711 regular expressions called \emph{bitcodes}. |
|
712 |
|
713 |
|
714 |
|
715 |
|
716 |
|
717 |
780 |
718 |
781 |