ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 622 4b1149fb5aec
parent 608 37b6fd310a16
child 623 c0c1ebe09c7d
equal deleted inserted replaced
621:17c7611fb0a9 622:4b1149fb5aec
    11 In this chapter, we define the basic notions 
    11 In this chapter, we define the basic notions 
    12 for regular languages and regular expressions.
    12 for regular languages and regular expressions.
    13 This is essentially a description in ``English''
    13 This is essentially a description in ``English''
    14 of our formalisation in Isabelle/HOL.
    14 of our formalisation in Isabelle/HOL.
    15 We also give the definition of what $\POSIX$ lexing means, 
    15 We also give the definition of what $\POSIX$ lexing means, 
    16 followed by a lexing algorithm by Sulzmanna and Lu\parencite{Sulzmann2014} 
    16 followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
    17 that produces the output conforming
    17 that produces the output conforming
    18 to the $\POSIX$ standard.
    18 to the $\POSIX$ standard\footnote{In what follows 
    19 \footnote{In what follows 
       
    20 we choose to use the Isabelle-style notation
    19 we choose to use the Isabelle-style notation
    21 for function applications, where
    20 for function applications, where
    22 the parameters of a function are not enclosed
    21 the parameters of a function are not enclosed
    23 inside a pair of parentheses (e.g. $f \;x \;y$
    22 inside a pair of parentheses (e.g. $f \;x \;y$
    24 instead of $f(x,\;y)$). This is mainly
    23 instead of $f(x,\;y)$). This is mainly
    25 to make the text visually more concise.}
    24 to make the text visually more concise.}.
    26 
    25 
    27 \section{Basic Concepts}
    26 \section{Basic Concepts}
    28 Usually, formal language theory starts with an alphabet 
    27 Formal language theory usually starts with an alphabet 
    29 denoting a set of characters.
    28 denoting a set of characters.
    30 Here we just use the datatype of characters from Isabelle,
    29 Here we just use the datatype of characters from Isabelle,
    31 which roughly corresponds to the ASCII characters.
    30 which roughly corresponds to the ASCII characters.
    32 In what follows, we shall leave the information about the alphabet
    31 In what follows, we shall leave the information about the alphabet
    33 implicit.
    32 implicit.
    34 Then using the usual bracket notation for lists,
    33 Then using the usual bracket notation for lists,
    35 we can define strings made up of characters: 
    34 we can define strings made up of characters as: 
    36 \begin{center}
    35 \begin{center}
    37 \begin{tabular}{lcl}
    36 \begin{tabular}{lcl}
    38 $\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
    37 $\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
    39 \end{tabular}
    38 \end{tabular}
    40 \end{center}
    39 \end{center}
    41 where $c$ is a variable ranging over characters.
    40 where $c$ is a variable ranging over characters.
       
    41 The $::$ stands for list cons and $[]$ for the empty
       
    42 list.
       
    43 A singleton list is sometimes written as $[c]$ for brevity.
    42 Strings can be concatenated to form longer strings in the same
    44 Strings can be concatenated to form longer strings in the same
    43 way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
    45 way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
    44 We omit the precise 
    46 We omit the precise 
    45 recursive definition here.
    47 recursive definition here.
    46 We overload this concatenation operator for two sets of strings:
    48 We overload this concatenation operator for two sets of strings:
    96 $\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
    98 $\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
    97 \end{tabular}
    99 \end{tabular}
    98 \end{center}
   100 \end{center}
    99 \noindent
   101 \noindent
   100 which is essentially the left quotient $A \backslash L$ of $A$ against 
   102 which is essentially the left quotient $A \backslash L$ of $A$ against 
   101 the singleton language with $L = \{s\}$
   103 the singleton language with $L = \{s\}$.
   102 in formal language theory.
   104 However, for our purposes here, the $\textit{Ders}$ definition with 
   103 However, for the purposes here, the $\textit{Ders}$ definition with 
       
   104 a single string is sufficient.
   105 a single string is sufficient.
   105 
   106 
   106 The reason for defining derivatives
   107 The reason for defining derivatives
   107 is that it provides a different approach
   108 is that they provide another approach
   108 to test membership of a string in 
   109 to test membership of a string in 
   109 a set of strings. 
   110 a set of strings. 
   110 For example, to test whether the string
   111 For example, to test whether the string
   111 $bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with
   112 $bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with
   112 respect to the string $bar$:
   113 respect to the string $bar$:
   113 \begin{center}
   114 \begin{center}
   114 \begin{tabular}{lclll}
   115 \begin{tabular}{lll}
   115 	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & 
   116 	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & 
   116 	$\{ar, rak\}$ & 
   117 	$\{ar, rak\}$ \\
   117 	$\stackrel{\backslash a}{\rightarrow}$ &
   118 				 & $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\
   118 	\\
   119 				 & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
   119 	$\{r \}$ & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$ &
   120 				 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
   120 	$\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
       
   121 \end{tabular}	
   121 \end{tabular}	
   122 \end{center}
   122 \end{center}
   123 \noindent
   123 \noindent
   124 and in the end test whether the set
   124 and in the end test whether the set
   125 has the empty string \footnote{ we use the infix notation $A\backslash c$
   125 has the empty string.\footnote{We use the infix notation $A\backslash c$
   126 	instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
   126 	instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
   127 on languages rather than regular expressions }.
   127 on languages rather than regular expressions.}
   128 In general, if we have a language $S_{start}$,
   128 
   129 then we can test whether $s$ is in $S_{start}$
   129 In general, if we have a language $S$,
       
   130 then we can test whether $s$ is in $S$
   130 by testing whether $[] \in S \backslash s$.
   131 by testing whether $[] \in S \backslash s$.
   131 
       
   132 With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
   132 With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
   133 we have a  few properties of how the language derivative can be defined using 
   133 we have a  few properties of how the language derivative can be defined using 
   134 sub-languages.
   134 sub-languages.
   135 For example, for the sequence operator, we have
   135 For example, for the sequence operator, we have
   136 something similar to the ``chain rule'' of the calculus derivative:
   136 something similar to a ``chain rule'':
   137 \begin{lemma}
   137 \begin{lemma}
   138 \[
   138 \[
   139 	\Der \; c \; (A @ B) =
   139 	\Der \; c \; (A @ B) =
   140 	\begin{cases}
   140 	\begin{cases}
   141 	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , &  \text{if} \;  [] \in A  \\
   141 	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , &  \text{if} \;  [] \in A  \\
   176 which of course contains $\Der \; c \; A @ A*$.
   176 which of course contains $\Der \; c \; A @ A*$.
   177 \end{itemize}
   177 \end{itemize}
   178 \end{proof}
   178 \end{proof}
   179 
   179 
   180 \noindent
   180 \noindent
   181 Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
   181 The clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$
   182 for regular languages, we need to first give definitions for regular expressions.
   182 for regular expressions.
       
   183 To introduce them, we need to first give definitions for regular expressions,
       
   184 which we shall do next.
   183 
   185 
   184 \subsection{Regular Expressions and Their Meaning}
   186 \subsection{Regular Expressions and Their Meaning}
   185 The \emph{basic regular expressions} are defined inductively
   187 The \emph{basic regular expressions} are defined inductively
   186  by the following grammar:
   188  by the following grammar:
   187 \[			r ::=   \ZERO \mid  \ONE
   189 \[			r ::=   \ZERO \mid  \ONE
   194 We call them basic because we will introduce
   196 We call them basic because we will introduce
   195 additional constructors in later chapters such as negation
   197 additional constructors in later chapters such as negation
   196 and bounded repetitions.
   198 and bounded repetitions.
   197 We use $\ZERO$ for the regular expression that
   199 We use $\ZERO$ for the regular expression that
   198 matches no string, and $\ONE$ for the regular
   200 matches no string, and $\ONE$ for the regular
   199 expression that matches only the empty string\footnote{
   201 expression that matches only the empty string.\footnote{
   200 some authors
   202 Some authors
   201 also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
   203 also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
   202 but we prefer our notation}. 
   204 but we prefer our notation.} 
   203 The sequence regular expression is written $r_1\cdot r_2$
   205 The sequence regular expression is written $r_1\cdot r_2$
   204 and sometimes we omit the dot if it is clear which
   206 and sometimes we omit the dot if it is clear which
   205 regular expression is meant; the alternative
   207 regular expression is meant; the alternative
   206 is written $r_1 + r_2$.
   208 is written $r_1 + r_2$.
   207 The \emph{language} or meaning of 
   209 The \emph{language} or meaning of 
   211 \begin{center}
   213 \begin{center}
   212 \begin{tabular}{lcl}
   214 \begin{tabular}{lcl}
   213 $L \; \ZERO$ & $\dn$ & $\phi$\\
   215 $L \; \ZERO$ & $\dn$ & $\phi$\\
   214 $L \; \ONE$ & $\dn$ & $\{[]\}$\\
   216 $L \; \ONE$ & $\dn$ & $\{[]\}$\\
   215 $L \; c$ & $\dn$ & $\{[c]\}$\\
   217 $L \; c$ & $\dn$ & $\{[c]\}$\\
   216 $L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
   218 $L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
   217 $L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
   219 $L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
   218 $L \; r^*$ & $\dn$ & $ (L\;r)*$
   220 $L \; (r^*)$ & $\dn$ & $ (L\;r)*$
   219 \end{tabular}
   221 \end{tabular}
   220 \end{center}
   222 \end{center}
   221 \noindent
   223 \noindent
   222 Now with language derivatives of a language and regular expressions and
   224 %Now with language derivatives of a language and regular expressions and
   223 their language interpretations in place, we are ready to define derivatives on regular expressions.
   225 %their language interpretations in place, we are ready to define derivatives on regular expressions.
       
   226 With $L$ we are ready to introduce Brzozowski derivatives on regular expressions.
       
   227 We do so by first introducing what properties it should satisfy.
       
   228 
   224 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
   229 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
   225 %Recall, the language derivative acts on a set of strings
   230 %Recall, the language derivative acts on a set of strings
   226 %and essentially chops off a particular character from
   231 %and essentially chops off a particular character from
   227 %all strings in that set, Brzozowski defined a derivative operation on regular expressions
   232 %all strings in that set, Brzozowski defined a derivative operation on regular expressions
   228 %so that after derivative $L(r\backslash c)$ 
   233 %so that after derivative $L(r\backslash c)$ 
   229 %will look as if it was obtained by doing a language derivative on $L(r)$:
   234 %will look as if it was obtained by doing a language derivative on $L(r)$:
   230 Recall that the language derivative acts on a 
   235 %Recall that the language derivative acts on a 
   231 language (set of strings).
   236 %language (set of strings).
   232 One can decide whether a string $s$ belongs
   237 %One can decide whether a string $s$ belongs
   233 to a language $S$ by taking derivative with respect to
   238 %to a language $S$ by taking derivative with respect to
   234 that string and then checking whether the empty 
   239 %that string and then checking whether the empty 
   235 string is in the derivative:
   240 %string is in the derivative:
   236 \begin{center}
   241 %\begin{center}
   237 \parskip \baselineskip
   242 %\parskip \baselineskip
   238 \def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
   243 %\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
   239 \def\rlwd{.5pt}
   244 %\def\rlwd{.5pt}
   240 \newcommand\notate[3]{%
   245 %\newcommand\notate[3]{%
   241   \unskip\def\useanchorwidth{T}%
   246 %  \unskip\def\useanchorwidth{T}%
   242   \setbox0=\hbox{#1}%
   247 %  \setbox0=\hbox{#1}%
   243   \def\stackalignment{c}\stackunder[-6pt]{%
   248 %  \def\stackalignment{c}\stackunder[-6pt]{%
   244     \def\stackalignment{c}\stackunder[-1.5pt]{%
   249 %    \def\stackalignment{c}\stackunder[-1.5pt]{%
   245       \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
   250 %      \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
   246     \rule{\rlwd}{#2\baselineskip}}}{%
   251 %    \rule{\rlwd}{#2\baselineskip}}}{%
   247   \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
   252 %  \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
   248 }
   253 %}
   249 \Longstack{
   254 %\Longstack{
   250 \notate{$\{ \ldots ,\;$ 
   255 %\notate{$\{ \ldots ,\;$ 
   251 	\notate{s}{1}{$(c_1 :: s_1)$} 
   256 %	\notate{s}{1}{$(c_1 :: s_1)$} 
   252 	$, \; \ldots \}$ 
   257 %	$, \; \ldots \}$ 
   253 }{1}{$S_{start}$} 
   258 %}{1}{$S_{start}$} 
   254 }
   259 %}
   255 \Longstack{
   260 %\Longstack{
   256 	$\stackrel{\backslash c_1}{\longrightarrow}$
   261 %	$\stackrel{\backslash c_1}{\longrightarrow}$
   257 }
   262 %}
   258 \Longstack{
   263 %\Longstack{
   259 	$\{ \ldots,\;$  \notate{$s_1$}{1}{$(c_2::s_2)$} 
   264 %	$\{ \ldots,\;$  \notate{$s_1$}{1}{$(c_2::s_2)$} 
   260 	$,\; \ldots \}$
   265 %	$,\; \ldots \}$
   261 }
   266 %}
   262 \Longstack{
   267 %\Longstack{
   263 	$\stackrel{\backslash c_2}{\longrightarrow}$ 
   268 %	$\stackrel{\backslash c_2}{\longrightarrow}$ 
   264 }
   269 %}
   265 \Longstack{
   270 %\Longstack{
   266 	$\{ \ldots,\;  s_2
   271 %	$\{ \ldots,\;  s_2
   267 	,\; \ldots \}$
   272 %	,\; \ldots \}$
   268 }
   273 %}
   269 \Longstack{
   274 %\Longstack{
   270 	$ \xdashrightarrow{\backslash c_3\ldots\ldots} $	
   275 %	$ \xdashrightarrow{\backslash c_3\ldots\ldots} $	
   271 }
   276 %}
   272 \Longstack{
   277 %\Longstack{
   273 	\notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} = 
   278 %	\notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} = 
   274 	S_{start}\backslash s$}
   279 %	S_{start}\backslash s$}
   275 }
   280 %}
   276 \end{center}
   281 %\end{center}
   277 \begin{center}
   282 %\begin{center}
   278 	$s \in S_{start} \iff [] \in S_{end}$
   283 %	$s \in S_{start} \iff [] \in S_{end}$
   279 \end{center}
   284 %\end{center}
   280 \noindent
   285 %\noindent
   281 Brzozowski noticed that this operation
   286 Brzozowski noticed that $\Der$
   282 can be ``mirrored'' on regular expressions which
   287 can be ``mirrored'' on regular expressions which
   283 he calls the derivative of a regular expression $r$
   288 he calls the derivative of a regular expression $r$
   284 with respect to a character $c$, written
   289 with respect to a character $c$, written
   285 $r \backslash c$. This infix operator
   290 $r \backslash c$. This infix operator
   286 takes an original regular expression $r$ as input
   291 takes an original regular expression $r$ as input
   296 \[
   301 \[
   297 	L \; (r \backslash c) = \Der \; c \; (L \; r)
   302 	L \; (r \backslash c) = \Der \; c \; (L \; r)
   298 \]
   303 \]
   299 \end{property}
   304 \end{property}
   300 \noindent
   305 \noindent
   301 Pictorially, this looks as follows:
   306 Pictorially, this looks as follows:\\
       
   307 \vspace{3mm}
   302 
   308 
   303 \parskip \baselineskip
   309 \parskip \baselineskip
   304 \def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
   310 \def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
   305 \def\rlwd{.5pt}
   311 \def\rlwd{.5pt}
   306 \newcommand\notate[3]{%
   312 \newcommand\notate[3]{%
   320 	$\stackrel{\backslash c}{\xrightarrow{\hspace*{8cm}}}$
   326 	$\stackrel{\backslash c}{\xrightarrow{\hspace*{8cm}}}$
   321 }
   327 }
   322 \Longstack{
   328 \Longstack{
   323 	\notate{$r\backslash c$}{1}{$L \; (r\backslash c)=
   329 	\notate{$r\backslash c$}{1}{$L \; (r\backslash c)=
   324 	\{\ldots,\;s_1,\;\ldots\}$}
   330 	\{\ldots,\;s_1,\;\ldots\}$}
   325 }
   331 }\\ 
   326 \\
   332 \vspace{ 3mm }
       
   333 
   327 The derivatives on regular expression can again be 
   334 The derivatives on regular expression can again be 
   328 generalised to a string.
   335 generalised to a string.
   329 One could compute $r\backslash s$  and test membership of $s$
   336 One could compute $r_{start} \backslash s$  and test membership of $s$
   330 in $L \; r$ by checking
   337 in $L \; r_{start}$ by checking
   331 whether the empty string is in the language of 
   338 whether the empty string is in the language of 
   332 $r\backslash s$.
   339 $r_{end}$ ($r_{start}\backslash s$).\\
   333 
   340 
   334 
   341 \vspace{2mm}
   335 \Longstack{
   342 \Longstack{
   336 	\notate{$r_{start}$}{4}{
   343 	\notate{$r_{start}$}{4}{
   337 		\Longstack{$L \; r_{start} = \{\ldots, \;$ 
   344 		\Longstack{$L \; r_{start} = \{\ldots, \;$ 
   338 			\notate{$s$}{1}{$c_1::s_1$} 
   345 			\notate{$s$}{1}{$c_1::s_1$} 
   339 		$, \ldots\} $
   346 		$, \ldots\} $
   371 
   378 
   372 \begin{property}
   379 \begin{property}
   373 	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
   380 	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
   374 \end{property}
   381 \end{property}
   375 \noindent
   382 \noindent
   376 Now we give the recursive definition of derivative on
   383 Next we give the recursive definition of derivative on
   377 regular expressions, so that it satisfies the properties above.
   384 regular expressions, so that it satisfies the properties above.
   378 The derivative function, written $r\backslash c$, 
   385 The derivative function, written $r\backslash c$, 
   379 takes a regular expression $r$ and character $c$, and
   386 takes a regular expression $r$ and character $c$, and
   380 returns a new regular expression representing
   387 returns a new regular expression representing
   381 the original regular expression's language $L \; r$
   388 the original regular expression's language $L \; r$
   483 \end{lemma}
   490 \end{lemma}
   484 \begin{proof}
   491 \begin{proof}
   485 	By induction on $r$.
   492 	By induction on $r$.
   486 \end{proof}
   493 \end{proof}
   487 \noindent
   494 \noindent
   488 The main property of the derivative operation
   495 which is the main property of derivatives
   489 (that enables us to reason about the correctness of
   496 that enables us to reason about the correctness of
   490 derivative-based matching)
   497 derivative-based matching.
   491 is 
       
   492 
       
   493 
       
   494 \noindent
       
   495 We can generalise the derivative operation shown above for single characters
   498 We can generalise the derivative operation shown above for single characters
   496 to strings as follows:
   499 to strings as follows:
   497 
   500 
   498 \begin{center}
   501 \begin{center}
   499 \begin{tabular}{lcl}
   502 \begin{tabular}{lcl}
   505 \noindent
   508 \noindent
   506 When there is no ambiguity, we will 
   509 When there is no ambiguity, we will 
   507 omit the subscript and use $\backslash$ instead
   510 omit the subscript and use $\backslash$ instead
   508 of $\backslash_s$ to denote
   511 of $\backslash_s$ to denote
   509 string derivatives for brevity.
   512 string derivatives for brevity.
   510 Brzozowski's  regular-expression matching algorithm can then be described as:
   513 Brzozowski's derivative-based
       
   514 regular-expression matching algorithm can then be described as:
   511 
   515 
   512 \begin{definition}
   516 \begin{definition}
   513 $\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
   517 $\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
   514 \end{definition}
   518 \end{definition}
   515 
   519 
   524 \;\textrm{true}/\textrm{false}
   528 \;\textrm{true}/\textrm{false}
   525 \end{tikzcd}
   529 \end{tikzcd}
   526 \end{equation}
   530 \end{equation}
   527 
   531 
   528 \noindent
   532 \noindent
   529  It can  be
   533  It is relatively
   530 relatively  easily shown that this matcher is correct:
   534 easy to show that this matcher is correct, namely
   531 \begin{lemma}
   535 \begin{lemma}
   532 	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
   536 	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
   533 \end{lemma}
   537 \end{lemma}
   534 \begin{proof}
   538 \begin{proof}
   535 	By induction on $s$ using property of derivatives:
   539 	By induction on $s$ using property of derivatives:
   536 	lemma \ref{derDer}.
   540 	lemma \ref{derDer}.
   537 \end{proof}
   541 \end{proof}
   538 \noindent
       
   539 \begin{figure}
   542 \begin{figure}
   540 \begin{center}
   543 \begin{center}
   541 \begin{tikzpicture}
   544 \begin{tikzpicture}
   542 \begin{axis}[
   545 \begin{axis}[
   543     xlabel={$n$},
   546     xlabel={$n$},
   558 If we implement the above algorithm naively, however,
   561 If we implement the above algorithm naively, however,
   559 the algorithm can be excruciatingly slow, as shown in 
   562 the algorithm can be excruciatingly slow, as shown in 
   560 \ref{NaiveMatcher}.
   563 \ref{NaiveMatcher}.
   561 Note that both axes are in logarithmic scale.
   564 Note that both axes are in logarithmic scale.
   562 Around two dozens characters
   565 Around two dozens characters
   563 would already explode the matcher on the regular expression 
   566 would already ``explode'' the matcher with the regular expression 
   564 $(a^*)^*b$.
   567 $(a^*)^*b$.
   565 Too improve this situation, we need to introduce simplification
   568 To improve this situation, we need to introduce simplification
   566 rewrite rules for the intermediate results,
   569 rules for the intermediate results,
   567 such as $r + r \rightarrow r$,
   570 such as $r + r \rightarrow r$,
   568 and make sure those rules do not change the 
   571 and make sure those rules do not change the 
   569 language of the regular expression.
   572 language of the regular expression.
   570 One simpled-minded simplification function
   573 One simpled-minded simplification function
   571 that achieves these requirements is given below:
   574 that achieves these requirements 
       
   575 is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
   572 \begin{center}
   576 \begin{center}
   573 	\begin{tabular}{lcl}
   577 	\begin{tabular}{lcl}
   574 		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
   578 		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
   575 		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
   579 		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
   576 					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
   580 					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
   618 The running time of $\textit{ders}\_\textit{simp}$
   622 The running time of $\textit{ders}\_\textit{simp}$
   619 on the same example of Figure \ref{NaiveMatcher}
   623 on the same example of Figure \ref{NaiveMatcher}
   620 is now ``tame''  in terms of the length of inputs,
   624 is now ``tame''  in terms of the length of inputs,
   621 as shown in Figure \ref{BetterMatcher}.
   625 as shown in Figure \ref{BetterMatcher}.
   622 
   626 
   623 So far the story is use Brzozowski derivatives,
   627 So far the story is use Brzozowski derivatives and
   624 simplifying where possible and at the end test
   628 simplify as much as possible and at the end test
   625 whether the empty string is recognised by the final derivative.
   629 whether the empty string is recognised 
       
   630 by the final derivative.
   626 But what if we want to 
   631 But what if we want to 
   627 do lexing instead of just getting a true/false answer?
   632 do lexing instead of just getting a true/false answer?
   628 Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and 
   633 Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and 
   629 elegant (arguably as beautiful as the definition of the original derivative) solution for this.
   634 elegant (arguably as beautiful as the definition of the 
       
   635 Brzozowski derivative) solution for this.
   630 
   636 
   631 \section{Values and the Lexing Algorithm by Sulzmann and Lu}
   637 \section{Values and the Lexing Algorithm by Sulzmann and Lu}
   632 In this section, we present a two-phase regular expression lexing 
   638 In this section, we present a two-phase regular expression lexing 
   633 algorithm.
   639 algorithm.
   634 The first phase takes successive derivatives with 
   640 The first phase takes successive derivatives with 
   684 		$|\Stars \; v::vs|$ &  $\dn$ & $ |v| @ |\Stars(vs)|$
   690 		$|\Stars \; v::vs|$ &  $\dn$ & $ |v| @ |\Stars(vs)|$
   685 	\end{tabular}
   691 	\end{tabular}
   686 \end{center}
   692 \end{center}
   687 Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
   693 Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
   688 to indicate that a value $v$ could be generated from a lexing algorithm
   694 to indicate that a value $v$ could be generated from a lexing algorithm
   689 with input $r$. They call it the value inhabitation relation. 
   695 with input $r$. They call it the value inhabitation relation,
       
   696 defined by the rules.
   690 \begin{mathpar}
   697 \begin{mathpar}
   691 	\inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em}
   698 	\inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em}
   692 
   699 
   693 	\inferrule{\mbox{}}{\vdash \Empty :  \ONE} \hspace{2em}
   700 	\inferrule{\mbox{}}{\vdash \Empty :  \ONE} \hspace{2em}
   694 
   701 
   705 is to make sure that for a given pair of regular 
   712 is to make sure that for a given pair of regular 
   706 expression $r$ and string $s$, the number of values 
   713 expression $r$ and string $s$, the number of values 
   707 satisfying $|v| = s$ and $\vdash v:r$ is finite.
   714 satisfying $|v| = s$ and $\vdash v:r$ is finite.
   708 This additional condition was
   715 This additional condition was
   709 imposed by Ausaf and Urban to make their proofs easier.
   716 imposed by Ausaf and Urban to make their proofs easier.
   710 Given the same string and regular expression, there can be
   717 Given a string and a regular expression, there can be
   711 multiple values for it. For example, both
   718 multiple values for it. For example, both
   712 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
   719 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
   713 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
   720 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
   714 and the values both flatten to $abc$.
   721 and the values both flatten to $abc$.
   715 Lexers therefore have to disambiguate and choose only
   722 Lexers therefore have to disambiguate and choose only
   716 one of the values to output. $\POSIX$ is one of the
   723 one of the values be generated. $\POSIX$ is one of the
   717 disambiguation strategies that is widely adopted.
   724 disambiguation strategies that is widely adopted.
   718 
   725 
   719 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} 
   726 Ausaf et al.\parencite{AusafDyckhoffUrban2016} 
   720 formalised the property 
   727 formalised the property 
   721 as a ternary relation.
   728 as a ternary relation.
   722 The $\POSIX$ value $v$ for a regular expression
   729 The $\POSIX$ value $v$ for a regular expression
   723 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
   730 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
   724 in the following set of rules\footnote{The names of the rules are used
   731 in the following rules\footnote{The names of the rules are used
   725 as they were originally given in \cite{AusafDyckhoffUrban2016} }:
   732 as they were originally given in \cite{AusafDyckhoffUrban2016}.}:
   726 \begin{figure}[H]
   733 \begin{figure}[p]
   727 \begin{mathpar}
   734 \begin{mathpar}
   728 	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
   735 	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
   729 		
   736 		
   730 	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
   737 	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
   731 
   738 
   743 	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
   750 	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
   744 		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
   751 		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
   745 		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
   752 		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
   746 	(v::vs)}
   753 	(v::vs)}
   747 \end{mathpar}
   754 \end{mathpar}
   748 \caption{The inductive POSIX Lexing Rules defined by Ausaf, Dyckhoff and Urban \cite{AusafDyckhoffUrban2016}.
   755 \caption{The inductive POSIX rules given by Ausaf et al.
   749 The ternary relation, written $(s, r) \rightarrow v$, formalises the POSIX constraints on the
   756 	\cite{AusafDyckhoffUrban2016}.
       
   757 This ternary relation, written $(s, r) \rightarrow v$, 
       
   758 formalises the POSIX constraints on the
   750 value $v$ given a string $s$ and 
   759 value $v$ given a string $s$ and 
   751 regular expression $r$.
   760 regular expression $r$.
   752 For example, this specification says that all matches for an alternative 
       
   753 must always prefer a left value to a right one.
       
   754 }
   761 }
   755 \end{figure}
   762 \end{figure}\afterpage{\clearpage}
   756 \noindent
   763 \noindent
   757 
   764 
   758 \begin{figure}
   765 %\begin{figure}
   759 \begin{tikzpicture}[]
   766 %\begin{tikzpicture}[]
   760     \node [minimum width = 6cm, rectangle split, rectangle split horizontal, 
   767 %    \node [minimum width = 6cm, rectangle split, rectangle split horizontal, 
   761 	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
   768 %	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
   762 	    (node1)
   769 %	    (node1)
   763 	    {$r_{token1}$
   770 %	    {$r_{token1}$
   764 	    \nodepart{two}  $\;\;\; \quad r_{token2}\;\;\;\quad$ };
   771 %	    \nodepart{two}  $\;\;\; \quad r_{token2}\;\;\;\quad$ };
   765 	    %\node [left = 6.0cm of node1] (start1) {hi};
   772 %	    %\node [left = 6.0cm of node1] (start1) {hi};
   766 	    \node [left = 0.2cm of node1] (middle) {$v.s.$};
   773 %	    \node [left = 0.2cm of node1] (middle) {$v.s.$};
   767     \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal, 
   774 %    \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal, 
   768 	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
   775 %	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
   769 	    (node2)
   776 %	    (node2)
   770 	    {$\quad\;\;\;r_{token1}\quad\;\;\;$
   777 %	    {$\quad\;\;\;r_{token1}\quad\;\;\;$
   771 	    \nodepart{two}  $r_{token2}$ };
   778 %	    \nodepart{two}  $r_{token2}$ };
   772 	    \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
   779 %	    \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
   773 		\node [above = 1.5cm of middle, minimum width = 6cm, 
   780 %		\node [above = 1.5cm of middle, minimum width = 6cm, 
   774 			rectangle, style={draw, rounded corners, inner sep=10pt}] 
   781 %			rectangle, style={draw, rounded corners, inner sep=10pt}] 
   775 			(topNode) {$s$};
   782 %			(topNode) {$s$};
   776 	    \path[->,draw]
   783 %	    \path[->,draw]
   777 	        (topNode) edge node {split $A$} (node2)
   784 %	        (topNode) edge node {split $A$} (node2)
   778 	        (topNode) edge node {split $B$} (node1)
   785 %	        (topNode) edge node {split $B$} (node1)
   779 		;
   786 %		;
   780 			
   787 %			
   781 
   788 %
   782 \end{tikzpicture}
   789 %\end{tikzpicture}
   783 \caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
   790 %\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
   784 \end{figure}
   791 %\end{figure}
   785 The above $\POSIX$ rules follows the intuition described below: 
   792 The above $\POSIX$ rules follows the intuition described below: 
   786 \begin{itemize}
   793 \begin{itemize}
   787 	\item (Left Priority)\\
   794 	\item (Left Priority)\\
   788 		Match the leftmost regular expression when multiple options of matching
   795 		Match the leftmost regular expression when multiple options of matching
   789 		are available.
   796 		are available. See P+L and P+R where in P+R $s$ cannot
       
   797 		be in the language of $L \; r_1$.
   790 	\item (Maximum munch)\\
   798 	\item (Maximum munch)\\
   791 		Always match a subpart as much as possible before proceeding
   799 		Always match a subpart as much as possible before proceeding
   792 		to the next token.
   800 		to the next part of the string.
   793 		For example, when the string $s$ matches 
   801 		For example, when the string $s$ matches 
   794 		$r_{token1}\cdot r_{token2}$, and we have two ways $s$ can be split:
   802 		$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
   795 		Then the split that matches a longer string for the first token
   803 		Then the split that matches a longer string for the first part
   796 		$r_{token1}$ is preferred by this maximum munch rule (See
   804 		$r_{part1}$ is preferred by this maximum munch rule.
   797 		\ref{munch} for an illustration).
   805 		This is caused by the side-condition 
   798 \noindent
   806 		\begin{center}
   799 
   807 		$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
       
   808 		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
       
   809 		\end{center}
       
   810 		in PS.
       
   811 		%(See
       
   812 		%\ref{munch} for an illustration).
   800 \end{itemize}
   813 \end{itemize}
   801 \noindent
   814 \noindent
   802 These disambiguation strategies can be 
   815 These disambiguation strategies can be 
   803 quite practical.
   816 quite practical.
   804 For instance, when lexing a code snippet 
   817 For instance, when lexing a code snippet 
   805 \[ 
   818 \[ 
   806 	\textit{iffoo} = 3
   819 	\textit{iffoo} = 3
   807 \]
   820 \]
   808 using the regular expression (with 
   821 using a regular expression 
   809 keyword and identifier having their 
   822 for keywords and 
   810 usualy definitions on any formal
   823 identifiers:
   811 language textbook, for instance
   824 %(for example, keyword is a nonempty string starting with letters 
   812 keyword is a nonempty string starting with letters 
   825 %followed by alphanumeric characters or underscores):
   813 followed by alphanumeric characters or underscores):
   826 \[
   814 \[
   827 	r_{keyword} + r_{identifier}.
   815 	\textit{keyword} + \textit{identifier},
   828 \]
   816 \]
   829 If we want $\textit{iffoo}$ to be recognized
   817 we want $\textit{iffoo}$ to be recognized
   830 as an identifier
   818 as an identifier rather than a keyword (if)
   831 where identifiers are defined as usual (letters
       
   832 followed by letters, numbers or underscores),
       
   833 then a match with a keyword (if)
   819 followed by
   834 followed by
   820 an identifier (foo).
   835 an identifier (foo) would be incorrect.
   821 POSIX lexing achieves this.
   836 POSIX lexing would generate what we want.
   822 
   837 
   823 We know that a $\POSIX$ value is also a normal underlying
   838 \noindent
   824 value:
   839 We know that a POSIX 
       
   840 value for regular expression $r$ is inhabited by $r$.
   825 \begin{lemma}
   841 \begin{lemma}
   826 $(r, s) \rightarrow v \implies \vdash v: r$
   842 $(r, s) \rightarrow v \implies \vdash v: r$
   827 \end{lemma}
   843 \end{lemma}
   828 \noindent
   844 \noindent
   829 The good property about a $\POSIX$ value is that 
   845 The main property about a $\POSIX$ value is that 
   830 given the same regular expression $r$ and string $s$,
   846 given the same regular expression $r$ and string $s$,
   831 one can always uniquely determine the $\POSIX$ value for it:
   847 one can always uniquely determine the $\POSIX$ value for it:
   832 \begin{lemma}
   848 \begin{lemma}
   833 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
   849 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
   834 \end{lemma}
   850 \end{lemma}
   875 then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, 
   891 then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, 
   876 which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
   892 which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
   877 is the same as $\Seq(v_1, v_2)$. 
   893 is the same as $\Seq(v_1, v_2)$. 
   878 \end{proof}
   894 \end{proof}
   879 \noindent
   895 \noindent
   880 Now we know what a $\POSIX$ value is and why it is unique;
   896 We have now defined what a POSIX value is and shown that it is unique,
   881 the problem is generating 
   897 the problem is to generate
   882 such a value in a lexing algorithm using derivatives.
   898 such a value in a lexing algorithm using derivatives.
   883 
   899 
   884 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
   900 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
   885 
   901 
   886 Sulzmann and Lu extended Brzozowski's 
   902 Sulzmann and Lu extended Brzozowski's 
   887 derivative-based matching
   903 derivative-based matching
   888 to a lexing algorithm by a second pass
   904 to a lexing algorithm by a second phase 
   889 after the initial phase of successive derivatives.
   905 after the initial phase of successive derivatives.
   890 This second phase generates a POSIX value 
   906 This second phase generates a POSIX value 
   891 if the regular expression matches the string.
   907 if the regular expression matches the string.
   892 Two functions are involved: $\inj$ and $\mkeps$.
   908 Two functions are involved: $\inj$ and $\mkeps$.
   893 The first one used is $\mkeps$, which constructs a POSIX value from the last
   909 The function $\mkeps$ constructs a POSIX value from the last
   894 derivative $r_n$:
   910 derivative $r_n$:
   895 \begin{ceqn}
   911 \begin{ceqn}
   896 \begin{equation}\label{graph:mkeps}
   912 \begin{equation}\label{graph:mkeps}
   897 \begin{tikzcd}
   913 \begin{tikzcd}
   898 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
   914 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
   901 \end{equation}
   917 \end{equation}
   902 \end{ceqn}
   918 \end{ceqn}
   903 \noindent
   919 \noindent
   904 In the above diagram, again we assume that
   920 In the above diagram, again we assume that
   905 the input string $s$ is made of $n$ characters
   921 the input string $s$ is made of $n$ characters
   906 $c_0c_1 \ldots c_{n-1}$, and the input regular expression $r$
   922 $c_0c_1 \ldots c_{n-1}$ 
   907 is given label $0$ and after each character $c_i$ is taken off
   923 The last derivative operation 
   908 by the derivative operation the resulting derivative regular 
   924 $\backslash c_{n-1}$ generates the derivative $r_n$, for which
   909 expressioin is $r_{i+1}$.The last derivative operation 
   925 $\mkeps$ produces the value $v_n$. This value
   910 $\backslash c_{n-1}$ gives back $r_n$, which is transformed into 
   926 tells us how the empty string is matched by the (nullable)
   911 a value $v_n$ by $\mkeps$.
   927 regular expression $r_n$, in a POSIX way.
   912 $v_n$ tells us how an empty string is matched by the (nullable)
       
   913 regular expression $r_n$, in a $\POSIX$ way.
       
   914 The definition of $\mkeps$ is
   928 The definition of $\mkeps$ is
   915 	\begin{center}
   929 	\begin{center}
   916 		\begin{tabular}{lcl}
   930 		\begin{tabular}{lcl}
   917 			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
   931 			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
   918 			$\mkeps \; (r_{1}+r_{2})$	& $\dn$ 
   932 			$\mkeps \; (r_{1}+r_{2})$	& $\dn$ 
   924 		\end{tabular}
   938 		\end{tabular}
   925 	\end{center}
   939 	\end{center}
   926 
   940 
   927 
   941 
   928 \noindent 
   942 \noindent 
   929 We favour the left child $r_1$ of $r_1 + r_2$ 
   943 The function prefers the left child $r_1$ of $r_1 + r_2$ 
   930 to match an empty string if there is a choice.
   944 to match an empty string if there is a choice.
   931 When there is a star for us to match the empty string,
   945 When there is a star to match the empty string,
   932 we give the $\Stars$ constructor an empty list, meaning
   946 we give the $\Stars$ constructor an empty list, meaning
   933 no iteration is taken.
   947 no iteration is taken.
   934 The result of a call to $\mkeps$ on a $\nullable$ $r$ would
   948 The result of $\mkeps$ on a $\nullable$ $r$ 
   935 be a $\POSIX$ value corresponding to $r$:
   949 is a POSIX value for $r$ and the empty string:
   936 \begin{lemma}\label{mePosix}
   950 \begin{lemma}\label{mePosix}
   937 $\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$
   951 $\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$
   938 \end{lemma}
   952 \end{lemma}
   939 \begin{proof}
   953 \begin{proof}
   940 	By induction on the shape of $r$.
   954 	By induction on $r$.
   941 \end{proof}
   955 \end{proof}
   942 \noindent
   956 \noindent
   943 After the $\mkeps$-call, we inject back the characters one by one
   957 After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
   944 in reverse order as they were chopped off in the derivative phase.
   958 in reverse order as they were chopped off in the derivative phase.
   945 The fucntion for this is called $\inj$. $\inj$ and $\backslash$
   959 The fucntion for this is called $\inj$. This function 
   946 are not exactly reverse operations of one another, as $\inj$ 
   960 operates on values, unlike $\backslash$ which operates on regular expressions.
   947 operates on values instead of regular
       
   948 expressions.
       
   949 In the diagram below, $v_i$ stands for the (POSIX) value 
   961 In the diagram below, $v_i$ stands for the (POSIX) value 
   950 for how the regular expression 
   962 for how the regular expression 
   951 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
   963 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
   952 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
   964 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
   953 After injecting back $n$ characters, we get the lexical value for how $r_0$
   965 After injecting back $n$ characters, we get the lexical value for how $r_0$