cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:96bba9b493e9
+:f70e9ab4e680
 \documentclass[a4paper,UKenglish]{lipics}
 \usepackage{graphic}
 \usepackage{data}
 \usepackage{tikz-cd}
+\usepackage{algorithm}
+\usepackage{amsmath}
+\usepackage[noend]{algpseudocode}
 % \documentclass{article}
 %\usepackage[utf8]{inputenc}
 %\usepackage[english]{babel}
 %\usepackage{listings}
 % \usepackage{amsthm}
 %\theoremstyle{lemma}
 %\newtheorem{lemma}{Lemma}
 %\newcommand{\lemmaautorefname}{Lemma}
 %\theoremstyle{definition}
 %\newtheorem{definition}{Definition}
+\algnewcommand\algorithmicswitch{\textbf{switch}}
+\algnewcommand\algorithmiccase{\textbf{case}}
+\algnewcommand\algorithmicassert{\texttt{assert}}
+\algnewcommand\Assert[1]{\State \algorithmicassert(#1)}%
+% New "environments"
+\algdef{SE}[SWITCH]{Switch}{EndSwitch}[1]{\algorithmicswitch\ #1\ \algorithmicdo}{\algorithmicend\ \algorithmicswitch}%
+\algdef{SE}[CASE]{Case}{EndCase}[1]{\algorithmiccase\ #1}{\algorithmicend\ \algorithmiccase}%
+\algtext*{EndSwitch}%
+\algtext*{EndCase}%
 \begin{document}
 \maketitle
 Suppose regular expressions are given by the following grammar:
 \begin{center}
+%TODO
 		\begin{tabular}{@{}rrl@{}}
 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
 			$r$ & $::=$  & $\ZERO$\\
 			& $\mid$ & $\ONE$   \\
 			& $\mid$ & $c$          \\
 			& $\mid$ & $r_1 \cdot r_2$\\
 			& $\mid$ & $r_1 + r_2$   \\
-			\\
-			& $\mid$ & $r^*$         \\
+			& $\mid$ & $r^*$
 		\end{tabular}
 \end{center}
 \noindent
 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
 $r \backslash \epsilon $ & $\dn$ & $r$
 \end{tabular}
 \end{center}
-We obtain a simple and elegant regular
-expression matching algorithm:
-\begin{definition}{matcher}
-\[
-match\;s\;r \;\dn\; nullable(r\backslash s)
-\]
-\end{definition}
 For us the main advantage is that derivatives can be
 straightforwardly implemented in any functional programming language,
 and are easily definable and reasoned about in theorem provers---the
 definitions just consist of inductive datatypes and simple recursive
 functions. Moreover, the notion of derivatives can be easily
 generalised to cover extended regular expression constructors such as
 the not-regular expression, written $\neg\,r$, or bounded repetitions
 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
 straightforwardly realised within the classic automata approach.
+We obtain a simple and elegant regular
+expression matching algorithm:
+\begin{definition}{matcher}
+\[
+match\;s\;r \;\dn\; nullable(r\backslash s)
+\]
+\end{definition}
 One limitation, however, of Brzozowski's algorithm is that it only
 produces a YES/NO answer for whether a string is being matched by a
 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
 		\end{tabular}
 	\end{tabular}
 \end{center}
 \noindent
-Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them. \\
+Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them.
-Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$.\\
+Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$.
 To give a concrete example of how value works, consider the string $xy$ and the
 regular expression $(x + (y + xy))^*$. We can view this regular
 expression as a tree and if the string $xy$ is matched by two Star
 ``iterations'', then the $x$ is matched by the left-most alternative
 expression.
 The contribution of Sulzmann and Lu is an extension of Brzozowski's
 algorithm by a second phase (the first phase being building successive
 derivatives). In this second phase, for every successful match the
-corresponding POSIX value is computed. The whole process looks like this diagram:\\
+corresponding POSIX value is computed. The whole process looks like the following diagram(the working flow of the simple matching algorithm that just gives a $YES/NO$ answer is provided on the right for the purpose of comparison):\\
 \begin{tikzcd}
 r_0 \arrow[r, "c_0"]  \arrow[d] & r_1 \arrow[r, "c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]
 \end{tikzcd}
 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$.
 First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ...  until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for the empty word epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
 After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
 An inductive proof can be routinely established.
 initial results in this regard have been obtained in
 \cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet
 is a very tight bound for the size. Such a tight bound is suggested by
 work of Antimirov who proved that (partial) derivatives can be bound
 by the number of characters contained in the initial regular
-expression \cite{Antimirov95}. We believe, and have generated test
+expression \cite{Antimirov95}.
-data, that a similar bound can be obtained for the derivatives in
-Sulzmann and Lu's algorithm. Let us give some details about this next.
+Antimirov defined the "partial derivatives" of regular expressions to be this:
+%TODO definition of partial derivatives
+it is essentially a set of regular expressions that come from the sub-structure of the original regular expression.
+Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression.  Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives.
+%We believe, and have generated test
+%data, that a similar bound can be obtained for the derivatives in
+%Sulzmann and Lu's algorithm. Let us give some details about this next.
 We first followed Sulzmann and Lu's idea of introducing
 \emph{annotated regular expressions}~\cite{Sulzmann2014}. They are
 defined by the following grammar:
 is to allow a list of annotated regular expressions in the
 \textit{ALTS} constructor. This allows us to not just delete
 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
 unnecessary ``copies'' of regular expressions (very similar to
 simplifying $r + r$ to just $r$, but in a more general
-setting). Another modification is that we use simplification rules
+setting).
+A psuedocode version of our algorithm is given below:\\
+\begin{algorithm}
+\caption{simplification of annotated regular expression}\label{euclid}
+\begin{algorithmic}[1]
+\Procedure{$Simp$}{$areg$}
+\Switch{$areg$}
+	\Case{$ALTS(bs, rs)$}
+		\For{\textit{rs[i] in array rs}}
+			\State $\textit{rs[i]} \gets$ \textit{Simp(rs[i])}
+		\EndFor
+		\For{\textit{rs[i] in array rs}}
+			\If{$rs[i] == ALTS(bs', rs')$}
+				\State $rs'' \gets$ attach bits $bs'$ to all elements in $rs'$
+				\State Insert $rs''$ into $rs$ at position $i$ ($rs[i]$ is destroyed, replaced by its list of children regular expressions)
+			\EndIf
+		\EndFor
+		\State Remove all duplicates in $rs$, only keeping the first copy for multiple occurrences of the same regular expression
+		\State Remove all $0$s in $rs$
+		\If{$ rs.length == 0$} \Return $ZERO$ \EndIf
+		\If {$ rs.length == 1$} \Return$ rs[0] $\EndIf
+	\EndCase
+	\Case{$SEQ(bs, r_1, r_2)$}
+		\If{$ r_1$ or $r_2$ is $ZERO$} \Return ZERO \EndIf
+		\State update $r_1$ and $r_2$ by attaching $bs$ to their front
+		\If {$r_1$ or $r_2$ is $ONE(bs')$} \Return $r_2$ or $r_1$ \EndIf
+	\EndCase
+	\Case{$Others$}
+		\Return $areg$ as it is
+	\EndCase
+\EndSwitch
+\EndProcedure
+\end{algorithmic}
+\end{algorithm}
+Another modification is that we use simplification rules
 inspired by Antimirov's work on partial derivatives. They maintain the
 idea that only the first ``copy'' of a regular expression in an
 alternative contributes to the calculation of a POSIX value. All
 subsequent copies can be pruned from the regular expression.
 We are currently engaged with proving that our simplification rules
 actually do not affect the POSIX value that should be generated by the
 algorithm according to the specification of a POSIX value and
 furthermore that our derivatives stay small for all derivatives. For

changeset 35	f70e9ab4e680
parent 34	96bba9b493e9
child 36	97c9ac95194d