ninems/ninems.tex
changeset 37 17d8e7599a01
parent 36 97c9ac95194d
child 38 b5363c0dcd99
equal deleted inserted replaced
36:97c9ac95194d 37:17d8e7599a01
   213 ``fire''---so is it an identifier or a keyword?  While in applications
   213 ``fire''---so is it an identifier or a keyword?  While in applications
   214 there is a well-known strategy to decide these questions, called POSIX
   214 there is a well-known strategy to decide these questions, called POSIX
   215 matching, only relatively recently precise definitions of what POSIX
   215 matching, only relatively recently precise definitions of what POSIX
   216 matching actually means have been formalised
   216 matching actually means have been formalised
   217 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly,
   217 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly,
   218 POSIX matching means matching the longest initial substring and
   218 POSIX matching means matching the longest initial substring.
   219 in the case of a tie, the initial submatch is chosen according to some priorities attached to the
   219 In the case of a tie, the initial submatch is chosen according to some priorities attached to the
   220 regular expressions (e.g.~keywords have a higher priority than
   220 regular expressions (e.g.~keywords have a higher priority than
   221 identifiers). This sounds rather simple, but according to Grathwohl et
   221 identifiers). This sounds rather simple, but according to Grathwohl et
   222 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
   222 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
   223 
   223 
   224 \begin{quote}
   224 \begin{quote}
   289 \end{tabular}
   289 \end{tabular}
   290 \end{center}
   290 \end{center}
   291 
   291 
   292 \noindent
   292 \noindent
   293 
   293 
   294 
       
   295 
       
   296  %Assuming the classic notion of a
       
   297 %\emph{language} of a regular expression, written $L(\_)$, t
       
   298 The main
       
   299 property of the derivative operation is that
       
   300 
       
   301 \begin{center}
       
   302 $c\!::\!s \in L(r)$ holds
       
   303 if and only if $s \in L(r\backslash c)$.
       
   304 \end{center}
       
   305 
       
   306 \noindent
       
   307 So if we want to find out whether a string $s$
       
   308 matches with a regular expression $r$, build the derivatives of $r$
       
   309 w.r.t.\ (in succession) all the characters of the string $s$. Finally,
       
   310 test whether the resulting regular expression can match the empty
       
   311 string.  If yes, then $r$ matches $s$, and no in the negative
       
   312 case.\\
       
   313 We can generalise the derivative operation for strings like this:
       
   314 \begin{center}
       
   315 \begin{tabular}{lcl}
       
   316 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
       
   317 $r \backslash \epsilon $ & $\dn$ & $r$
       
   318 \end{tabular}
       
   319 \end{center}
       
   320  For us the main advantage is that derivatives can be
   294  For us the main advantage is that derivatives can be
   321 straightforwardly implemented in any functional programming language,
   295 straightforwardly implemented in any functional programming language,
   322 and are easily definable and reasoned about in theorem provers---the
   296 and are easily definable and reasoned about in theorem provers---the
   323 definitions just consist of inductive datatypes and simple recursive
   297 definitions just consist of inductive datatypes and simple recursive
   324 functions. Moreover, the notion of derivatives can be easily
   298 functions. Moreover, the notion of derivatives can be easily
   326 the not-regular expression, written $\neg\,r$, or bounded repetitions
   300 the not-regular expression, written $\neg\,r$, or bounded repetitions
   327 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
   301 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
   328 straightforwardly realised within the classic automata approach.
   302 straightforwardly realised within the classic automata approach.
   329 
   303 
   330 
   304 
   331 We obtain a simple and elegant regular
   305 
       
   306  %Assuming the classic notion of a
       
   307 %\emph{language} of a regular expression, written $L(\_)$, t
       
   308 The main
       
   309 property of the derivative operation is that
       
   310 
       
   311 \begin{center}
       
   312 $c\!::\!s \in L(r)$ holds
       
   313 if and only if $s \in L(r\backslash c)$.
       
   314 \end{center}
       
   315 
       
   316 \noindent
       
   317 So if we want to find out whether a string $s$
       
   318 matches with a regular expression $r$, build the derivatives of $r$
       
   319 w.r.t.\ (in succession) all the characters of the string $s$. Finally,
       
   320 test whether the resulting regular expression can match the empty
       
   321 string.  If yes, then $r$ matches $s$, and no in the negative
       
   322 case.
       
   323 
       
   324 We can generalise the derivative operation for strings like this:
       
   325 \begin{center}
       
   326 \begin{tabular}{lcl}
       
   327 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
       
   328 $r \backslash \epsilon $ & $\dn$ & $r$
       
   329 \end{tabular}
       
   330 \end{center}
       
   331 \noindent
       
   332 Using the above definition we obtain a simple and elegant regular
   332 expression matching algorithm: 
   333 expression matching algorithm: 
   333 \begin{definition}{matcher}
       
   334 \[
   334 \[
   335 match\;s\;r \;\dn\; nullable(r\backslash s)
   335 match\;s\;r \;\dn\; nullable(r\backslash s)
   336 \]
   336 \]
   337 \end{definition}
       
   338 
       
   339 
       
   340 
       
   341 One limitation, however, of Brzozowski's algorithm is that it only
   337 One limitation, however, of Brzozowski's algorithm is that it only
   342 produces a YES/NO answer for whether a string is being matched by a
   338 produces a YES/NO answer for whether a string is being matched by a
   343 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
   339 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
   344 algorithm to allow generation of an actual matching, called a
   340 algorithm to allow generation of an actual matching, called a
   345 \emph{value}.  
   341 \emph{value}.  
   370 	\end{tabular}
   366 	\end{tabular}
   371 \end{center}
   367 \end{center}
   372 
   368 
   373 \noindent
   369 \noindent
   374  Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them. 
   370  Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them. 
       
   371  
   375  The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$. We omit this straightforward definition.
   372  The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$. We omit this straightforward definition.
   376  Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. 
   373  Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. 
   377 
   374 
   378  To give a concrete example of how value works, consider the string $xy$ and the
   375  To give a concrete example of how value works, consider the string $xy$ and the
   379 regular expression $(x + (y + xy))^*$. We can view this regular
   376 regular expression $(x + (y + xy))^*$. We can view this regular
   412 \end{tikzcd}
   409 \end{tikzcd}
   413 \begin{tikzcd}
   410 \begin{tikzcd}
   414 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n   
   411 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n   
   415 \end{tikzcd}
   412 \end{tikzcd}
   416 
   413 
       
   414 
   417 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$.
   415 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$.
   418 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ...  until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for the empty word epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
   416 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ...  until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for how the empty word matched the empty regular expression epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
   419 
   417 
   420  After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
   418  After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
   421 An inductive proof can be routinely established.
   419 An inductive proof can be routinely established.
   422 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. Rather, we shall focus next on the
   420 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. Rather, we shall focus next on the
   423 process of simplification of regular expressions, which is needed in
   421 process of simplification of regular expressions, which is needed in
   498 \end{tabular}    
   496 \end{tabular}    
   499 \end{center} 
   497 \end{center} 
   500 where $\Z$ and $\S$ are arbitrary names for the bits in the
   498 where $\Z$ and $\S$ are arbitrary names for the bits in the
   501 bitsequences. 
   499 bitsequences. 
   502 Here code encodes a value into a bitsequence by converting Left into $\Z$, Right into $\S$, the start point of a non-empty star iteration into $\S$, and the border where a local star terminates into $\Z$. This conversion is apparently lossy, as it throws away the character information, and does not decode the boundary between the two operands of the sequence constructor. Moreover, with only the bitcode we cannot even tell whether the $\S$s and $\Z$s are for $Left/Right$ or $Stars$. The reason for choosing this compact way of storing information is that the relatively small size of bits can be easily moved around during the lexing process. In order to recover the bitcode back into values, we will need the regular expression as the extra information and decode them back into value:\\
   500 Here code encodes a value into a bitsequence by converting Left into $\Z$, Right into $\S$, the start point of a non-empty star iteration into $\S$, and the border where a local star terminates into $\Z$. This conversion is apparently lossy, as it throws away the character information, and does not decode the boundary between the two operands of the sequence constructor. Moreover, with only the bitcode we cannot even tell whether the $\S$s and $\Z$s are for $Left/Right$ or $Stars$. The reason for choosing this compact way of storing information is that the relatively small size of bits can be easily moved around during the lexing process. In order to recover the bitcode back into values, we will need the regular expression as the extra information and decode them back into value:\\
   503 \begin{definition}[Bitdecoding of Values]\mbox{}
   501 %\begin{definition}[Bitdecoding of Values]\mbox{}
   504 \begin{center}
   502 \begin{center}
   505 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
   503 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
   506   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
   504   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
   507   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
   505   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
   508   $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
   506   $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
   525      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
   523      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
   526   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
   524   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
   527        \textit{else}\;\textit{None}$                       
   525        \textit{else}\;\textit{None}$                       
   528 \end{tabular}    
   526 \end{tabular}    
   529 \end{center}    
   527 \end{center}    
   530 \end{definition}
   528 %\end{definition}
   531 
   529 
   532 To do lexing using annotated regular expressions, we shall first transform the
   530 To do lexing using annotated regular expressions, we shall first transform the
   533 usual (un-annotated) regular expressions into annotated regular
   531 usual (un-annotated) regular expressions into annotated regular
   534 expressions:\\
   532 expressions:\\
   535 \begin{definition}
   533 %\begin{definition}
   536 \begin{center}
   534 \begin{center}
   537 \begin{tabular}{lcl}
   535 \begin{tabular}{lcl}
   538   $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
   536   $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
   539   $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
   537   $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
   540   $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
   538   $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
   545          $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\
   543          $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\
   546   $(r^*)^\uparrow$ & $\dn$ &
   544   $(r^*)^\uparrow$ & $\dn$ &
   547          $\textit{STAR}\;[]\,r^\uparrow$\\
   545          $\textit{STAR}\;[]\,r^\uparrow$\\
   548 \end{tabular}    
   546 \end{tabular}    
   549 \end{center}    
   547 \end{center}    
   550 \end{definition}
   548 %\end{definition}
   551 Then we do successive derivative operations on the annotated regular expression. This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits to store the parse tree information:\\
   549 Then we do successive derivative operations on the annotated regular expression. This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits to store the parse tree information:\\
   552 \begin{definition}{bder}
   550 %\begin{definition}{bder}
   553 \begin{center}
   551 \begin{center}
   554   \begin{tabular}{@{}lcl@{}}
   552   \begin{tabular}{@{}lcl@{}}
   555   $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   553   $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   556   $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   554   $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   557   $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ &
   555   $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ &
   567   $(\textit{STAR}\,bs\,a)\backslash c$ & $\dn$ &
   565   $(\textit{STAR}\,bs\,a)\backslash c$ & $\dn$ &
   568       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\,
   566       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\,
   569        (\textit{STAR}\,[]\,r)$
   567        (\textit{STAR}\,[]\,r)$
   570 \end{tabular}    
   568 \end{tabular}    
   571 \end{center}    
   569 \end{center}    
   572 \end{definition}
   570 %\end{definition}
   573 This way, we do not have to use an injection function and a second phase, but instead only need to collect the bits while running $mkeps$:
   571 This way, we do not have to use an injection function and a second phase, but instead only need to collect the bits while running $mkeps$:
   574 \begin{definition}[\textit{bmkeps}]\mbox{}
   572 %\begin{definition}[\textit{bmkeps}]\mbox{}
   575 \begin{center}
   573 \begin{center}
   576 \begin{tabular}{lcl}
   574 \begin{tabular}{lcl}
   577   $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\
   575   $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\
   578   $\textit{bmkeps}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ &
   576   $\textit{bmkeps}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ &
   579      $\textit{if}\;\textit{bnullable}\,a_1$\\
   577      $\textit{if}\;\textit{bnullable}\,a_1$\\
   583      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
   581      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
   584   $\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ &
   582   $\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ &
   585      $bs \,@\, [\S]$
   583      $bs \,@\, [\S]$
   586 \end{tabular}    
   584 \end{tabular}    
   587 \end{center}    
   585 \end{center}    
   588 \end{definition}
   586 %\end{definition}
   589 and then decode the bits using the regular expression. The whole process looks like this:\\
   587 and then decode the bits using the regular expression. After putting these pieces together, the whole process looks like this:\\
   590 r
   588 \begin{center}
   591 \\
   589 \begin{tabular}{lcl}
       
   590   $\textit{blexer}\;r\,s$ & $\dn$ &
       
   591       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
       
   592   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
   593   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   594   & & $\;\;\textit{else}\;\textit{None}$
       
   595 \end{tabular}
       
   596 \end{center}
   592 
   597 
   593 The main point of the bitsequences and annotated regular expressions
   598 The main point of the bitsequences and annotated regular expressions
   594 is that we can apply rather aggressive (in terms of size)
   599 is that we can apply rather aggressive (in terms of size)
   595 simplification rules in order to keep derivatives small.  
   600 simplification rules in order to keep derivatives small.  
   596 
   601