diff -r 4d6f54c478b5 -r b6984212cf87 etnms/etnms.tex --- a/etnms/etnms.tex Sat Feb 08 22:01:40 2020 +0000 +++ b/etnms/etnms.tex Sat Feb 08 22:24:21 2020 +0000 @@ -361,57 +361,31 @@ algorithm. Fortunately, one can simplify regular expressions after each derivative -step. Various simplifications of regular expressions are possible, such +step. +Various simplifications of regular expressions are possible, such as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r -\cdot \ONE$, and $r + r$ to just $r$. These simplifications do not -affect the answer for whether a regular expression matches a string or -not, but fortunately also do not affect the POSIX strategy of how -regular expressions match strings---although the latter is much harder +\cdot \ONE$, and $r + r$ to just $r$. +Suppose we apply simplification after each derivative step, and compose +these two operations together as an atomic one: $a \backslash_{simp}\,c \dn +\textit{simp}(a \backslash c)$. Then we can build values without having +a cumbersome regular expression, and fortunately if we are careful +enough in making some extra rectifications, the POSIX value of how +regular expressions match strings will not be affected---although is much harder to establish. Some initial results in this regard have been obtained in \cite{AusafDyckhoffUrban2016}. + If we want the size of derivatives in Sulzmann and Lu's algorithm to -stay below this bound, we would need more aggressive simplifications. +stay even lower, we would need more aggressive simplifications. Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible. For example, the parentheses in $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just $a^*+a+\ONE$. Adding these more aggressive simplification rules help us -to achieve the same size bound as that of the partial derivatives. - - -Suppose we apply simplification after each derivative step, and view -these two operations as an atomic one: $a \backslash_{simp}\,c \dn -\textit{simp}(a \backslash c)$. Then we can use the previous natural -extension from derivative w.r.t.~character to derivative -w.r.t.~string:%\comment{simp in the [] case?} - -\begin{center} -\begin{tabular}{lcl} -$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ -$r \backslash_{simp} [\,] $ & $\dn$ & $r$ -\end{tabular} -\end{center} - -\noindent -we obtain an optimised version of the algorithm: - - \begin{center} -\begin{tabular}{lcl} - $\textit{blexer\_simp}\;r\,s$ & $\dn$ & - $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ - & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ - & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ - & & $\;\;\textit{else}\;\textit{None}$ -\end{tabular} -\end{center} - -\noindent -This algorithm keeps the regular expression size small, for example, -with this simplification our previous $(a + aa)^*$ example's 8000 nodes -will be reduced to just 6 and stays constant, no matter how long the -input string is. +to achieve a very tight size bound, namely, + the same size bound as that of the \emph{partial derivatives}. +And we want to get rid of complex and fragile rectification of values. In order to implement the idea of ``spilling out alternatives'' and to @@ -769,6 +743,36 @@ function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. +Having defined the $\simp$ function, +we can use the previous notation of natural +extension from derivative w.r.t.~character to derivative +w.r.t.~string:%\comment{simp in the [] case?} + +\begin{center} +\begin{tabular}{lcl} +$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ +$r \backslash_{simp} [\,] $ & $\dn$ & $r$ +\end{tabular} +\end{center} + +\noindent +to obtain an optimised version of the algorithm: + + \begin{center} +\begin{tabular}{lcl} + $\textit{blexer\_simp}\;r\,s$ & $\dn$ & + $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ + & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ + & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ + & & $\;\;\textit{else}\;\textit{None}$ +\end{tabular} +\end{center} + +\noindent +This algorithm keeps the regular expression size small, for example, +with this simplification our previous $(a + aa)^*$ example's 8000 nodes +will be reduced to just 6 and stays constant, no matter how long the +input string is. @@ -1150,7 +1154,7 @@ &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\ - $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\ + $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\ &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ &&$\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$\\