ninems/ninems.tex
changeset 35 f70e9ab4e680
parent 34 96bba9b493e9
child 36 97c9ac95194d
--- a/ninems/ninems.tex	Sun Jun 30 22:23:52 2019 +0100
+++ b/ninems/ninems.tex	Mon Jul 01 23:42:39 2019 +0100
@@ -2,6 +2,9 @@
 \usepackage{graphic}
 \usepackage{data}
 \usepackage{tikz-cd}
+\usepackage{algorithm}
+\usepackage{amsmath}
+\usepackage[noend]{algpseudocode}
 % \documentclass{article}
 %\usepackage[utf8]{inputenc}
 %\usepackage[english]{babel}
@@ -43,6 +46,15 @@
 %\newcommand{\lemmaautorefname}{Lemma}
 %\theoremstyle{definition}
 %\newtheorem{definition}{Definition}
+\algnewcommand\algorithmicswitch{\textbf{switch}}
+\algnewcommand\algorithmiccase{\textbf{case}}
+\algnewcommand\algorithmicassert{\texttt{assert}}
+\algnewcommand\Assert[1]{\State \algorithmicassert(#1)}%
+% New "environments"
+\algdef{SE}[SWITCH]{Switch}{EndSwitch}[1]{\algorithmicswitch\ #1\ \algorithmicdo}{\algorithmicend\ \algorithmicswitch}%
+\algdef{SE}[CASE]{Case}{EndCase}[1]{\algorithmiccase\ #1}{\algorithmicend\ \algorithmiccase}%
+\algtext*{EndSwitch}%
+\algtext*{EndCase}%
 
 
 \begin{document}
@@ -233,7 +245,7 @@
 
 
 \begin{center}
-
+%TODO
 		\begin{tabular}{@{}rrl@{}}
 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
 			$r$ & $::=$  & $\ZERO$\\
@@ -241,8 +253,8 @@
 			& $\mid$ & $c$          \\
 			& $\mid$ & $r_1 \cdot r_2$\\
 			& $\mid$ & $r_1 + r_2$   \\
-			\\
-			& $\mid$ & $r^*$         \\
+			
+			& $\mid$ & $r^*$         
 		\end{tabular}
 
 \end{center}
@@ -315,6 +327,16 @@
 \end{tabular}
 \end{center}
 
+ For us the main advantage is that derivatives can be
+straightforwardly implemented in any functional programming language,
+and are easily definable and reasoned about in theorem provers---the
+definitions just consist of inductive datatypes and simple recursive
+functions. Moreover, the notion of derivatives can be easily
+generalised to cover extended regular expression constructors such as
+the not-regular expression, written $\neg\,r$, or bounded repetitions
+(for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
+straightforwardly realised within the classic automata approach.
+
 
 We obtain a simple and elegant regular
 expression matching algorithm: 
@@ -324,15 +346,6 @@
 \]
 \end{definition}
 
- For us the main advantage is that derivatives can be
-straightforwardly implemented in any functional programming language,
-and are easily definable and reasoned about in theorem provers---the
-definitions just consist of inductive datatypes and simple recursive
-functions. Moreover, the notion of derivatives can be easily
-generalised to cover extended regular expression constructors such as
-the not-regular expression, written $\neg\,r$, or bounded repetitions
-(for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
-straightforwardly realised within the classic automata approach.
 
 
 One limitation, however, of Brzozowski's algorithm is that it only
@@ -368,8 +381,9 @@
 \end{center}
 
 \noindent
- Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them. \\
- Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$.\\
+ Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them. 
+ 
+ Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$.
 
  To give a concrete example of how value works, consider the string $xy$ and the
 regular expression $(x + (y + xy))^*$. We can view this regular
@@ -401,11 +415,12 @@
 The contribution of Sulzmann and Lu is an extension of Brzozowski's
 algorithm by a second phase (the first phase being building successive
 derivatives). In this second phase, for every successful match the
-corresponding POSIX value is computed. The whole process looks like this diagram:\\
+corresponding POSIX value is computed. The whole process looks like the following diagram(the working flow of the simple matching algorithm that just gives a $YES/NO$ answer is provided on the right for the purpose of comparison):\\
 \begin{tikzcd}
 r_0 \arrow[r, "c_0"]  \arrow[d] & r_1 \arrow[r, "c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
 \end{tikzcd}
+
 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$.
 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ...  until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for the empty word epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
 
@@ -444,9 +459,17 @@
 is a very tight bound for the size. Such a tight bound is suggested by
 work of Antimirov who proved that (partial) derivatives can be bound
 by the number of characters contained in the initial regular
-expression \cite{Antimirov95}. We believe, and have generated test
-data, that a similar bound can be obtained for the derivatives in
-Sulzmann and Lu's algorithm. Let us give some details about this next.
+expression \cite{Antimirov95}.
+
+Antimirov defined the "partial derivatives" of regular expressions to be this:
+%TODO definition of partial derivatives
+
+it is essentially a set of regular expressions that come from the sub-structure of the original regular expression. 
+Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression.  Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives.
+
+ %We believe, and have generated test
+%data, that a similar bound can be obtained for the derivatives in
+%Sulzmann and Lu's algorithm. Let us give some details about this next.
 
 We first followed Sulzmann and Lu's idea of introducing
 \emph{annotated regular expressions}~\cite{Sulzmann2014}. They are
@@ -516,12 +539,50 @@
 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
 unnecessary ``copies'' of regular expressions (very similar to
 simplifying $r + r$ to just $r$, but in a more general
-setting). Another modification is that we use simplification rules
+setting). 
+A psuedocode version of our algorithm is given below:\\
+
+\begin{algorithm}
+\caption{simplification of annotated regular expression}\label{euclid}
+\begin{algorithmic}[1]
+\Procedure{$Simp$}{$areg$}
+\Switch{$areg$}
+	\Case{$ALTS(bs, rs)$}
+		\For{\textit{rs[i] in array rs}}
+        			\State $\textit{rs[i]} \gets$ \textit{Simp(rs[i])}
+      		\EndFor
+		\For{\textit{rs[i] in array rs}}
+        			\If{$rs[i] == ALTS(bs', rs')$}
+				\State $rs'' \gets$ attach bits $bs'$ to all elements in $rs'$
+				\State Insert $rs''$ into $rs$ at position $i$ ($rs[i]$ is destroyed, replaced by its list of children regular expressions)
+			\EndIf
+      		\EndFor
+		\State Remove all duplicates in $rs$, only keeping the first copy for multiple occurrences of the same regular expression
+		\State Remove all $0$s in $rs$
+		\If{$ rs.length == 0$} \Return $ZERO$ \EndIf
+		\If {$ rs.length == 1$} \Return$ rs[0] $\EndIf
+	\EndCase
+	\Case{$SEQ(bs, r_1, r_2)$}
+		\If{$ r_1$ or $r_2$ is $ZERO$} \Return ZERO \EndIf
+		\State update $r_1$ and $r_2$ by attaching $bs$ to their front
+		\If {$r_1$ or $r_2$ is $ONE(bs')$} \Return $r_2$ or $r_1$ \EndIf
+	\EndCase
+	\Case{$Others$}
+		\Return $areg$ as it is
+	\EndCase
+\EndSwitch
+\EndProcedure
+\end{algorithmic}
+\end{algorithm}
+
+
+Another modification is that we use simplification rules
 inspired by Antimirov's work on partial derivatives. They maintain the
 idea that only the first ``copy'' of a regular expression in an
 alternative contributes to the calculation of a POSIX value. All
 subsequent copies can be pruned from the regular expression.
 
+
 We are currently engaged with proving that our simplification rules
 actually do not affect the POSIX value that should be generated by the
 algorithm according to the specification of a POSIX value and