ninems/ninems.tex
changeset 38 b5363c0dcd99
parent 37 17d8e7599a01
child 39 7d18745dd7c9
equal deleted inserted replaced
37:17d8e7599a01 38:b5363c0dcd99
    68   algorithm to not just give a YES/NO answer for whether or not a regular
    68   algorithm to not just give a YES/NO answer for whether or not a regular
    69   expression matches a string, but in case it matches also \emph{how}
    69   expression matches a string, but in case it matches also \emph{how}
    70   it matches the string.  This is important for applications such as
    70   it matches the string.  This is important for applications such as
    71   lexing (tokenising a string). The problem is to make the algorithm
    71   lexing (tokenising a string). The problem is to make the algorithm
    72   by Sulzmann and Lu fast on all inputs without breaking its
    72   by Sulzmann and Lu fast on all inputs without breaking its
    73   correctness.
    73   correctness. We have already developed some simplification rules, but have not shown that they 
       
    74   preserve the correctness. We also have not yet looked at extended regular expressions.
    74 \end{abstract}
    75 \end{abstract}
    75 
    76 
    76 \section{Introduction}
    77 \section{Introduction}
    77 
    78 
    78 This PhD-project is about regular expression matching and
    79 This PhD-project is about regular expression matching and
   239 regular expressions \cite{Brzozowski1964}. We shall briefly explain
   240 regular expressions \cite{Brzozowski1964}. We shall briefly explain
   240 the algorithms next.
   241 the algorithms next.
   241 
   242 
   242 \section{The Algorithms by  Brzozowski, and Sulzmann and Lu}
   243 \section{The Algorithms by  Brzozowski, and Sulzmann and Lu}
   243 
   244 
   244 Suppose regular expressions are given by the following grammar:\\
   245 Suppose basic regular expressions are given by the following grammar:\\
   245 			$r$  $::=$   $\ZERO$ $\mid$  $\ONE$   
   246 \[			r ::=   \ZERO \mid  \ONE
   246 			 $\mid$  $c$          
   247 			 \mid  c  
   247 			 $\mid$  $r_1 \cdot r_2$
   248 			 \mid  r_1 \cdot r_2
   248 			 $\mid$  $r_1 + r_2$   
   249 			 \mid  r_1 + r_2   
   249 			 $\mid$  $r^*$         
   250 			 \mid r^*         
   250 
   251 \]
   251 
       
   252 
   252 
   253 \noindent
   253 \noindent
   254 The intended meaning of the regular expressions is as usual: $\ZERO$
   254 The intended meaning of the regular expressions is as usual: $\ZERO$
   255 cannot match any string, $\ONE$ can match the empty string, the
   255 cannot match any string, $\ONE$ can match the empty string, the
   256 character regular expression $c$ can match the character $c$, and so
   256 character regular expression $c$ can match the character $c$, and so
   268 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
   268 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
   269 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
   269 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
   270 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
   270 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
   271 		\end{tabular}
   271 		\end{tabular}
   272 	\end{center}
   272 	\end{center}
   273 The function tests whether the empty string is in $L(r)$.
   273 This function simply tests whether the empty string is in $L(r)$.
   274 He then defined
   274 He then defined
   275 the following operation on regular expressions, written
   275 the following operation on regular expressions, written
   276 $r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$):
   276 $r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$):
   277 
   277 
   278 \begin{center}
   278 \begin{center}
   289 \end{tabular}
   289 \end{tabular}
   290 \end{center}
   290 \end{center}
   291 
   291 
   292 \noindent
   292 \noindent
   293 
   293 
       
   294 
       
   295 
       
   296  %Assuming the classic notion of a
       
   297 %\emph{language} of a regular expression, written $L(\_)$, t
       
   298 The main
       
   299 property of the derivative operation is that
       
   300 
       
   301 \begin{center}
       
   302 $c\!::\!s \in L(r)$ holds
       
   303 if and only if $s \in L(r\backslash c)$.
       
   304 \end{center}
       
   305 
       
   306 \noindent
   294  For us the main advantage is that derivatives can be
   307  For us the main advantage is that derivatives can be
   295 straightforwardly implemented in any functional programming language,
   308 straightforwardly implemented in any functional programming language,
   296 and are easily definable and reasoned about in theorem provers---the
   309 and are easily definable and reasoned about in theorem provers---the
   297 definitions just consist of inductive datatypes and simple recursive
   310 definitions just consist of inductive datatypes and simple recursive
   298 functions. Moreover, the notion of derivatives can be easily
   311 functions. Moreover, the notion of derivatives can be easily
   299 generalised to cover extended regular expression constructors such as
   312 generalised to cover extended regular expression constructors such as
   300 the not-regular expression, written $\neg\,r$, or bounded repetitions
   313 the not-regular expression, written $\neg\,r$, or bounded repetitions
   301 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
   314 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
   302 straightforwardly realised within the classic automata approach.
   315 straightforwardly realised within the classic automata approach.
   303 
   316 For the moment however, we focus only on the usual basic regular expressions.
   304 
   317 
   305 
   318 
   306  %Assuming the classic notion of a
   319 Now if we want to find out whether a string $s$
   307 %\emph{language} of a regular expression, written $L(\_)$, t
       
   308 The main
       
   309 property of the derivative operation is that
       
   310 
       
   311 \begin{center}
       
   312 $c\!::\!s \in L(r)$ holds
       
   313 if and only if $s \in L(r\backslash c)$.
       
   314 \end{center}
       
   315 
       
   316 \noindent
       
   317 So if we want to find out whether a string $s$
       
   318 matches with a regular expression $r$, build the derivatives of $r$
   320 matches with a regular expression $r$, build the derivatives of $r$
   319 w.r.t.\ (in succession) all the characters of the string $s$. Finally,
   321 w.r.t.\ (in succession) all the characters of the string $s$. Finally,
   320 test whether the resulting regular expression can match the empty
   322 test whether the resulting regular expression can match the empty
   321 string.  If yes, then $r$ matches $s$, and no in the negative
   323 string.  If yes, then $r$ matches $s$, and no in the negative
   322 case.
   324 case.
   323 
   325 
   324 We can generalise the derivative operation for strings like this:
   326 For this we can generalise the derivative operation for strings like this:
   325 \begin{center}
   327 \begin{center}
   326 \begin{tabular}{lcl}
   328 \begin{tabular}{lcl}
   327 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
   329 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
   328 $r \backslash \epsilon $ & $\dn$ & $r$
   330 $r \backslash \epsilon $ & $\dn$ & $r$
   329 \end{tabular}
   331 \end{tabular}
   332 Using the above definition we obtain a simple and elegant regular
   334 Using the above definition we obtain a simple and elegant regular
   333 expression matching algorithm: 
   335 expression matching algorithm: 
   334 \[
   336 \[
   335 match\;s\;r \;\dn\; nullable(r\backslash s)
   337 match\;s\;r \;\dn\; nullable(r\backslash s)
   336 \]
   338 \]
       
   339 This algorithm can be illustrated as follows:
       
   340 \begin{tikzcd}\label{graph:*} 
       
   341 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  
       
   342 \end{tikzcd}
       
   343 
       
   344 
   337 One limitation, however, of Brzozowski's algorithm is that it only
   345 One limitation, however, of Brzozowski's algorithm is that it only
   338 produces a YES/NO answer for whether a string is being matched by a
   346 produces a YES/NO answer for whether a string is being matched by a
   339 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
   347 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
   340 algorithm to allow generation of an actual matching, called a
   348 algorithm to allow generation of an actual matching, called a
   341 \emph{value}.  
   349 \emph{value}.  
   400 expression.
   408 expression.
   401 
   409 
   402 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   410 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   403 algorithm by a second phase (the first phase being building successive
   411 algorithm by a second phase (the first phase being building successive
   404 derivatives). In this second phase, for every successful match the
   412 derivatives). In this second phase, for every successful match the
   405 corresponding POSIX value is computed. The whole process looks like the following diagram(the working flow of the simple matching algorithm that just gives a $YES/NO$ answer is provided on the right for the purpose of comparison):\\
   413 corresponding POSIX value is computed. The whole process looks like the following diagram(the working flow of the simple matching algorithm that just gives a $YES/NO$ answer is given before \ref{graph:*}):\\
   406 \begin{tikzcd}
   414 \begin{tikzcd}
   407 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
   415 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
   408 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
   416 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
   409 \end{tikzcd}
   417 \end{tikzcd}
   410 \begin{tikzcd}
   418 \begin{tikzcd}