ninems/ninems.tex
changeset 35 f70e9ab4e680
parent 34 96bba9b493e9
child 36 97c9ac95194d
equal deleted inserted replaced
34:96bba9b493e9 35:f70e9ab4e680
     1  \documentclass[a4paper,UKenglish]{lipics}
     1  \documentclass[a4paper,UKenglish]{lipics}
     2 \usepackage{graphic}
     2 \usepackage{graphic}
     3 \usepackage{data}
     3 \usepackage{data}
     4 \usepackage{tikz-cd}
     4 \usepackage{tikz-cd}
       
     5 \usepackage{algorithm}
       
     6 \usepackage{amsmath}
       
     7 \usepackage[noend]{algpseudocode}
     5 % \documentclass{article}
     8 % \documentclass{article}
     6 %\usepackage[utf8]{inputenc}
     9 %\usepackage[utf8]{inputenc}
     7 %\usepackage[english]{babel}
    10 %\usepackage[english]{babel}
     8 %\usepackage{listings}
    11 %\usepackage{listings}
     9 % \usepackage{amsthm}
    12 % \usepackage{amsthm}
    41 %\theoremstyle{lemma}
    44 %\theoremstyle{lemma}
    42 %\newtheorem{lemma}{Lemma}
    45 %\newtheorem{lemma}{Lemma}
    43 %\newcommand{\lemmaautorefname}{Lemma}
    46 %\newcommand{\lemmaautorefname}{Lemma}
    44 %\theoremstyle{definition}
    47 %\theoremstyle{definition}
    45 %\newtheorem{definition}{Definition}
    48 %\newtheorem{definition}{Definition}
       
    49 \algnewcommand\algorithmicswitch{\textbf{switch}}
       
    50 \algnewcommand\algorithmiccase{\textbf{case}}
       
    51 \algnewcommand\algorithmicassert{\texttt{assert}}
       
    52 \algnewcommand\Assert[1]{\State \algorithmicassert(#1)}%
       
    53 % New "environments"
       
    54 \algdef{SE}[SWITCH]{Switch}{EndSwitch}[1]{\algorithmicswitch\ #1\ \algorithmicdo}{\algorithmicend\ \algorithmicswitch}%
       
    55 \algdef{SE}[CASE]{Case}{EndCase}[1]{\algorithmiccase\ #1}{\algorithmicend\ \algorithmiccase}%
       
    56 \algtext*{EndSwitch}%
       
    57 \algtext*{EndCase}%
    46 
    58 
    47 
    59 
    48 \begin{document}
    60 \begin{document}
    49 
    61 
    50 \maketitle
    62 \maketitle
   231 
   243 
   232 Suppose regular expressions are given by the following grammar:
   244 Suppose regular expressions are given by the following grammar:
   233 
   245 
   234 
   246 
   235 \begin{center}
   247 \begin{center}
   236 
   248 %TODO
   237 		\begin{tabular}{@{}rrl@{}}
   249 		\begin{tabular}{@{}rrl@{}}
   238 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
   250 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
   239 			$r$ & $::=$  & $\ZERO$\\
   251 			$r$ & $::=$  & $\ZERO$\\
   240 			& $\mid$ & $\ONE$   \\
   252 			& $\mid$ & $\ONE$   \\
   241 			& $\mid$ & $c$          \\
   253 			& $\mid$ & $c$          \\
   242 			& $\mid$ & $r_1 \cdot r_2$\\
   254 			& $\mid$ & $r_1 \cdot r_2$\\
   243 			& $\mid$ & $r_1 + r_2$   \\
   255 			& $\mid$ & $r_1 + r_2$   \\
   244 			\\
   256 			
   245 			& $\mid$ & $r^*$         \\
   257 			& $\mid$ & $r^*$         
   246 		\end{tabular}
   258 		\end{tabular}
   247 
   259 
   248 \end{center}
   260 \end{center}
   249 
   261 
   250 \noindent
   262 \noindent
   313 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
   325 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
   314 $r \backslash \epsilon $ & $\dn$ & $r$
   326 $r \backslash \epsilon $ & $\dn$ & $r$
   315 \end{tabular}
   327 \end{tabular}
   316 \end{center}
   328 \end{center}
   317 
   329 
   318 
       
   319 We obtain a simple and elegant regular
       
   320 expression matching algorithm: 
       
   321 \begin{definition}{matcher}
       
   322 \[
       
   323 match\;s\;r \;\dn\; nullable(r\backslash s)
       
   324 \]
       
   325 \end{definition}
       
   326 
       
   327  For us the main advantage is that derivatives can be
   330  For us the main advantage is that derivatives can be
   328 straightforwardly implemented in any functional programming language,
   331 straightforwardly implemented in any functional programming language,
   329 and are easily definable and reasoned about in theorem provers---the
   332 and are easily definable and reasoned about in theorem provers---the
   330 definitions just consist of inductive datatypes and simple recursive
   333 definitions just consist of inductive datatypes and simple recursive
   331 functions. Moreover, the notion of derivatives can be easily
   334 functions. Moreover, the notion of derivatives can be easily
   332 generalised to cover extended regular expression constructors such as
   335 generalised to cover extended regular expression constructors such as
   333 the not-regular expression, written $\neg\,r$, or bounded repetitions
   336 the not-regular expression, written $\neg\,r$, or bounded repetitions
   334 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
   337 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
   335 straightforwardly realised within the classic automata approach.
   338 straightforwardly realised within the classic automata approach.
       
   339 
       
   340 
       
   341 We obtain a simple and elegant regular
       
   342 expression matching algorithm: 
       
   343 \begin{definition}{matcher}
       
   344 \[
       
   345 match\;s\;r \;\dn\; nullable(r\backslash s)
       
   346 \]
       
   347 \end{definition}
       
   348 
   336 
   349 
   337 
   350 
   338 One limitation, however, of Brzozowski's algorithm is that it only
   351 One limitation, however, of Brzozowski's algorithm is that it only
   339 produces a YES/NO answer for whether a string is being matched by a
   352 produces a YES/NO answer for whether a string is being matched by a
   340 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
   353 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
   366 		\end{tabular}
   379 		\end{tabular}
   367 	\end{tabular}
   380 	\end{tabular}
   368 \end{center}
   381 \end{center}
   369 
   382 
   370 \noindent
   383 \noindent
   371  Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them. \\
   384  Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them. 
   372  Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$.\\
   385  
       
   386  Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$.
   373 
   387 
   374  To give a concrete example of how value works, consider the string $xy$ and the
   388  To give a concrete example of how value works, consider the string $xy$ and the
   375 regular expression $(x + (y + xy))^*$. We can view this regular
   389 regular expression $(x + (y + xy))^*$. We can view this regular
   376 expression as a tree and if the string $xy$ is matched by two Star
   390 expression as a tree and if the string $xy$ is matched by two Star
   377 ``iterations'', then the $x$ is matched by the left-most alternative
   391 ``iterations'', then the $x$ is matched by the left-most alternative
   399 expression.
   413 expression.
   400 
   414 
   401 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   415 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   402 algorithm by a second phase (the first phase being building successive
   416 algorithm by a second phase (the first phase being building successive
   403 derivatives). In this second phase, for every successful match the
   417 derivatives). In this second phase, for every successful match the
   404 corresponding POSIX value is computed. The whole process looks like this diagram:\\
   418 corresponding POSIX value is computed. The whole process looks like the following diagram(the working flow of the simple matching algorithm that just gives a $YES/NO$ answer is provided on the right for the purpose of comparison):\\
   405 \begin{tikzcd}
   419 \begin{tikzcd}
   406 r_0 \arrow[r, "c_0"]  \arrow[d] & r_1 \arrow[r, "c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
   420 r_0 \arrow[r, "c_0"]  \arrow[d] & r_1 \arrow[r, "c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
   407 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
   421 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
   408 \end{tikzcd}
   422 \end{tikzcd}
       
   423 
   409 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$.
   424 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$.
   410 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ...  until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for the empty word epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
   425 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ...  until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for the empty word epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
   411 
   426 
   412  After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
   427  After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
   413 An inductive proof can be routinely established.
   428 An inductive proof can be routinely established.
   442 initial results in this regard have been obtained in
   457 initial results in this regard have been obtained in
   443 \cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet
   458 \cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet
   444 is a very tight bound for the size. Such a tight bound is suggested by
   459 is a very tight bound for the size. Such a tight bound is suggested by
   445 work of Antimirov who proved that (partial) derivatives can be bound
   460 work of Antimirov who proved that (partial) derivatives can be bound
   446 by the number of characters contained in the initial regular
   461 by the number of characters contained in the initial regular
   447 expression \cite{Antimirov95}. We believe, and have generated test
   462 expression \cite{Antimirov95}.
   448 data, that a similar bound can be obtained for the derivatives in
   463 
   449 Sulzmann and Lu's algorithm. Let us give some details about this next.
   464 Antimirov defined the "partial derivatives" of regular expressions to be this:
       
   465 %TODO definition of partial derivatives
       
   466 
       
   467 it is essentially a set of regular expressions that come from the sub-structure of the original regular expression. 
       
   468 Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression.  Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives.
       
   469 
       
   470  %We believe, and have generated test
       
   471 %data, that a similar bound can be obtained for the derivatives in
       
   472 %Sulzmann and Lu's algorithm. Let us give some details about this next.
   450 
   473 
   451 We first followed Sulzmann and Lu's idea of introducing
   474 We first followed Sulzmann and Lu's idea of introducing
   452 \emph{annotated regular expressions}~\cite{Sulzmann2014}. They are
   475 \emph{annotated regular expressions}~\cite{Sulzmann2014}. They are
   453 defined by the following grammar:
   476 defined by the following grammar:
   454 
   477 
   514 is to allow a list of annotated regular expressions in the
   537 is to allow a list of annotated regular expressions in the
   515 \textit{ALTS} constructor. This allows us to not just delete
   538 \textit{ALTS} constructor. This allows us to not just delete
   516 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
   539 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
   517 unnecessary ``copies'' of regular expressions (very similar to
   540 unnecessary ``copies'' of regular expressions (very similar to
   518 simplifying $r + r$ to just $r$, but in a more general
   541 simplifying $r + r$ to just $r$, but in a more general
   519 setting). Another modification is that we use simplification rules
   542 setting). 
       
   543 A psuedocode version of our algorithm is given below:\\
       
   544 
       
   545 \begin{algorithm}
       
   546 \caption{simplification of annotated regular expression}\label{euclid}
       
   547 \begin{algorithmic}[1]
       
   548 \Procedure{$Simp$}{$areg$}
       
   549 \Switch{$areg$}
       
   550 	\Case{$ALTS(bs, rs)$}
       
   551 		\For{\textit{rs[i] in array rs}}
       
   552         			\State $\textit{rs[i]} \gets$ \textit{Simp(rs[i])}
       
   553       		\EndFor
       
   554 		\For{\textit{rs[i] in array rs}}
       
   555         			\If{$rs[i] == ALTS(bs', rs')$}
       
   556 				\State $rs'' \gets$ attach bits $bs'$ to all elements in $rs'$
       
   557 				\State Insert $rs''$ into $rs$ at position $i$ ($rs[i]$ is destroyed, replaced by its list of children regular expressions)
       
   558 			\EndIf
       
   559       		\EndFor
       
   560 		\State Remove all duplicates in $rs$, only keeping the first copy for multiple occurrences of the same regular expression
       
   561 		\State Remove all $0$s in $rs$
       
   562 		\If{$ rs.length == 0$} \Return $ZERO$ \EndIf
       
   563 		\If {$ rs.length == 1$} \Return$ rs[0] $\EndIf
       
   564 	\EndCase
       
   565 	\Case{$SEQ(bs, r_1, r_2)$}
       
   566 		\If{$ r_1$ or $r_2$ is $ZERO$} \Return ZERO \EndIf
       
   567 		\State update $r_1$ and $r_2$ by attaching $bs$ to their front
       
   568 		\If {$r_1$ or $r_2$ is $ONE(bs')$} \Return $r_2$ or $r_1$ \EndIf
       
   569 	\EndCase
       
   570 	\Case{$Others$}
       
   571 		\Return $areg$ as it is
       
   572 	\EndCase
       
   573 \EndSwitch
       
   574 \EndProcedure
       
   575 \end{algorithmic}
       
   576 \end{algorithm}
       
   577 
       
   578 
       
   579 Another modification is that we use simplification rules
   520 inspired by Antimirov's work on partial derivatives. They maintain the
   580 inspired by Antimirov's work on partial derivatives. They maintain the
   521 idea that only the first ``copy'' of a regular expression in an
   581 idea that only the first ``copy'' of a regular expression in an
   522 alternative contributes to the calculation of a POSIX value. All
   582 alternative contributes to the calculation of a POSIX value. All
   523 subsequent copies can be pruned from the regular expression.
   583 subsequent copies can be pruned from the regular expression.
       
   584 
   524 
   585 
   525 We are currently engaged with proving that our simplification rules
   586 We are currently engaged with proving that our simplification rules
   526 actually do not affect the POSIX value that should be generated by the
   587 actually do not affect the POSIX value that should be generated by the
   527 algorithm according to the specification of a POSIX value and
   588 algorithm according to the specification of a POSIX value and
   528 furthermore that our derivatives stay small for all derivatives. For
   589 furthermore that our derivatives stay small for all derivatives. For