ninems/ninems.tex
changeset 40 8063792920ef
parent 39 7d18745dd7c9
child 41 a1f90febbc7f
equal deleted inserted replaced
39:7d18745dd7c9 40:8063792920ef
    63 
    63 
    64 \begin{abstract}
    64 \begin{abstract}
    65   Brzozowski introduced in 1964 a beautifully simple algorithm for
    65   Brzozowski introduced in 1964 a beautifully simple algorithm for
    66   regular expression matching based on the notion of derivatives of
    66   regular expression matching based on the notion of derivatives of
    67   regular expressions. In 2014, Sulzmann and Lu extended this
    67   regular expressions. In 2014, Sulzmann and Lu extended this
    68   algorithm to not just give a YES/NO answer for whether or not a regular
    68   algorithm to not just give a YES/NO answer for whether or not a
    69   expression matches a string, but in case it matches also \emph{how}
    69   regular expression matches a string, but in case it matches also
    70   it matches the string.  This is important for applications such as
    70   \emph{how} it matches the string.  This is important for
    71   lexing (tokenising a string). The problem is to make the algorithm
    71   applications such as lexing (tokenising a string). The problem is to
    72   by Sulzmann and Lu fast on all inputs without breaking its
    72   make the algorithm by Sulzmann and Lu fast on all inputs without
    73   correctness. We have already developed some simplification rules, but have not shown that they 
    73   breaking its correctness. We have already developed some
    74   preserve the correctness. We also have not yet looked at extended regular expressions.
    74   simplification rules for this, but have not proved that they
       
    75   preserve the correctness of the algorithm. We also have not yet
       
    76   looked at extended regular expressions, such as bounded repetitions,
       
    77   negation and back-references.
    75 \end{abstract}
    78 \end{abstract}
    76 
    79 
    77 \section{Introduction}
    80 \section{Introduction}
    78 
    81 
    79 This PhD-project is about regular expression matching and
    82 This PhD-project is about regular expression matching and
   240 regular expressions \cite{Brzozowski1964}. We shall briefly explain
   243 regular expressions \cite{Brzozowski1964}. We shall briefly explain
   241 the algorithms next.
   244 the algorithms next.
   242 
   245 
   243 \section{The Algorithms by  Brzozowski, and Sulzmann and Lu}
   246 \section{The Algorithms by  Brzozowski, and Sulzmann and Lu}
   244 
   247 
   245 Suppose basic regular expressions are given by the following grammar:\\
   248 Suppose (basic) regular expressions are given by the following grammar:
   246 \[			r ::=   \ZERO \mid  \ONE
   249 \[			r ::=   \ZERO \mid  \ONE
   247 			 \mid  c  
   250 			 \mid  c  
   248 			 \mid  r_1 \cdot r_2
   251 			 \mid  r_1 \cdot r_2
   249 			 \mid  r_1 + r_2   
   252 			 \mid  r_1 + r_2   
   250 			 \mid r^*         
   253 			 \mid r^*         
   251 \]
   254 \]
   252 
   255 
   253 \noindent
   256 \noindent
   254 The intended meaning of the regular expressions is as usual: $\ZERO$
   257 The intended meaning of the constructors is as usual: $\ZERO$
   255 cannot match any string, $\ONE$ can match the empty string, the
   258 cannot match any string, $\ONE$ can match the empty string, the
   256 character regular expression $c$ can match the character $c$, and so
   259 character regular expression $c$ can match the character $c$, and so
   257 on. The brilliant contribution by Brzozowski is the notion of
   260 on.
       
   261 
       
   262 The brilliant contribution by Brzozowski is the notion of
   258 \emph{derivatives} of regular expressions.  The idea behind this
   263 \emph{derivatives} of regular expressions.  The idea behind this
   259 notion is as follows: suppose a regular expression $r$ can match a
   264 notion is as follows: suppose a regular expression $r$ can match a
   260 string of the form $c\!::\! s$ (that is a list of characters starting
   265 string of the form $c\!::\! s$ (that is a list of characters starting
   261 with $c$), what does the regular expression look like that can match
   266 with $c$), what does the regular expression look like that can match
   262 just $s$? Brzozowski gave a neat answer to this question. He started with the definition of $nullable$:
   267 just $s$? Brzozowski gave a neat answer to this question. He started
       
   268 with the definition of $nullable$:
   263 \begin{center}
   269 \begin{center}
   264 		\begin{tabular}{lcl}
   270 		\begin{tabular}{lcl}
   265 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
   271 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
   266 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
   272 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
   267 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
   273 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
   293 
   299 
   294 
   300 
   295 
   301 
   296  %Assuming the classic notion of a
   302  %Assuming the classic notion of a
   297 %\emph{language} of a regular expression, written $L(\_)$, t
   303 %\emph{language} of a regular expression, written $L(\_)$, t
   298 The main
   304 \noindent
   299 property of the derivative operation is that
   305 The main property of the derivative operation is that
   300 
   306 
   301 \begin{center}
   307 \begin{center}
   302 $c\!::\!s \in L(r)$ holds
   308 $c\!::\!s \in L(r)$ holds
   303 if and only if $s \in L(r\backslash c)$.
   309 if and only if $s \in L(r\backslash c)$.
   304 \end{center}
   310 \end{center}
   314 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
   320 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
   315 straightforwardly realised within the classic automata approach.
   321 straightforwardly realised within the classic automata approach.
   316 For the moment however, we focus only on the usual basic regular expressions.
   322 For the moment however, we focus only on the usual basic regular expressions.
   317 
   323 
   318 
   324 
   319 Now if we want to find out whether a string $s$
   325 Now if we want to find out whether a string $s$ matches with a regular
   320 matches with a regular expression $r$, build the derivatives of $r$
   326 expression $r$, build the derivatives of $r$ w.r.t.\ (in succession)
   321 w.r.t.\ (in succession) all the characters of the string $s$. Finally,
   327 all the characters of the string $s$. Finally, test whether the
   322 test whether the resulting regular expression can match the empty
   328 resulting regular expression can match the empty string.  If yes, then
   323 string.  If yes, then $r$ matches $s$, and no in the negative
   329 $r$ matches $s$, and no in the negative case. To implement this idea
   324 case.
   330 we can generalise the derivative operation to strings like this:
   325 
       
   326 For this we can generalise the derivative operation for strings like this:
       
   327 \begin{center}
   331 \begin{center}
   328 \begin{tabular}{lcl}
   332 \begin{tabular}{lcl}
   329 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
   333 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
   330 $r \backslash \epsilon $ & $\dn$ & $r$
   334 $r \backslash [\,] $ & $\dn$ & $r$
   331 \end{tabular}
   335 \end{tabular}
   332 \end{center}
   336 \end{center}
   333 \noindent
   337 
   334 Using the above definition we obtain a simple and elegant regular
   338 \noindent
       
   339 Using this definition we obtain a simple and elegant regular
   335 expression matching algorithm: 
   340 expression matching algorithm: 
   336 \[
   341 \[
   337 match\;s\;r \;\dn\; nullable(r\backslash s)
   342 match\;s\;r \;\dn\; nullable(r\backslash s)
   338 \]
   343 \]
   339 This algorithm can be illustrated as follows:
   344 
       
   345 \noindent
       
   346 Pictorially this algorithm can be illustrated as follows:
       
   347 
       
   348 \begin{center}
   340 \begin{tikzcd}\label{graph:*} 
   349 \begin{tikzcd}\label{graph:*} 
   341 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"nullable?"] & Yes/No
   350 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"nullable?"] & Yes/No
   342 \end{tikzcd}
   351 \end{tikzcd}
   343 where we bnuild the successive derivative until we exhaust the string.
   352 \end{center}
       
   353 
       
   354 \noindent
       
   355 where we start with  a regular expression  $r_0$, build successive derivatives
       
   356 until we exhaust the string and then use \textit{nullable} to test whether the
       
   357 result can match the empty string. It can  be relatively  easily shown that this
       
   358 matcher is correct.
   344 
   359 
   345 One limitation, however, of Brzozowski's algorithm is that it only
   360 One limitation, however, of Brzozowski's algorithm is that it only
   346 produces a YES/NO answer for whether a string is being matched by a
   361 produces a YES/NO answer for whether a string is being matched by a
   347 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
   362 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
   348 algorithm to allow generation of an actual matching, called a
   363 algorithm to allow generation of an actual matching, called a
   349 \emph{value}.  
   364 \emph{value}.
   350 
   365 
   351 \begin{center}
   366 \begin{center}
   352 	\begin{tabular}{c@{\hspace{20mm}}c}
   367 	\begin{tabular}{c@{\hspace{20mm}}c}
   353 		\begin{tabular}{@{}rrl@{}}
   368 		\begin{tabular}{@{}rrl@{}}
   354 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
   369 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\