diff -r cab5eab1f6f1 -r 0573615e41a3 ninems/ninems.tex --- a/ninems/ninems.tex Mon Jul 08 11:10:32 2019 +0100 +++ b/ninems/ninems.tex Mon Jul 08 21:21:54 2019 +0100 @@ -552,7 +552,8 @@ $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$. Other clauses can be understood in a similar way. -The following example gives a \comment{Other word: insight?}taste of $\textit{inj}$'s effect +%\comment{Other word: insight?} +The following example gives an insight of $\textit{inj}$'s effect and how Sulzmann and Lu's algorithm works as a whole. Suppose we have a regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against @@ -674,7 +675,7 @@ when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing the rest $n$ injections. This leads to Sulzmann and Lu's novel -idea of using bit-codes in derivatives. +idea of using bitcodes in derivatives. In the next section, we shall focus on the bit-coded algorithm and the process of simplification of regular expressions. This is needed in @@ -712,11 +713,7 @@ affect the answer for whether a regular expression matches a string or not, but fortunately also do not affect the POSIX strategy of how regular expressions match strings---although the latter is much harder -to establish. \comment{Does not make sense.} The argument for -complicating the data structures from basic regular expressions to those -with bitcodes is that we can introduce simplification without making the -algorithm crash or overly complex to reason about. The latter is crucial -for a correctness proof. Some initial results in this regard have been +to establish. Some initial results in this regard have been obtained in \cite{AusafDyckhoffUrban2016}. Unfortunately, the simplification rules outlined above are not @@ -745,28 +742,41 @@ A partial derivative of a regular expression $r$ is essentially a set of regular expressions that are either $r$'s children expressions or a concatenation of them. Antimirov has proved a tight bound of the size of -partial derivatives. \comment{That looks too preliminary to me.} Roughly -speaking the size will not exceed the fourth power of the number of -nodes in that regular expression. \comment{Improve: which -simplifications?}Interestingly, we observed from experiment that after -the simplification step, our regular expression has the same size or is -smaller than the partial derivatives. This allows us to prove a tight -bound on the size of regular expression during the running time of the -algorithm if we can establish the connection between our simplification -rules and partial derivatives. +partial derivatives. Roughly +speaking the size will not exceed $t^2$, t is the number of characters +appearing in that regular expression. If we want the size of derivatives +to stay below this bound, we would need more aggressive simplifications +such as opening up alternatives to achieve the maximum level of duplicates +cancellation. +For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to +get $a\cdot c +b \cdot c ++ b \cdot c$, and then simplified to $a \cdot c+b \cdot c$. Another example is from +$(a^*+a) + (a^*+ 1) + (a +1)$ to $a^*+a+1$. +Adding these more aggressive simplification rules + helped us to achieve the same size bound as that of the partial derivatives. +To introduce these "spilling out alternatives" simplifications + and make the correctness proof easier, +we used bitcodes. + +%This allows us to prove a tight +%bound on the size of regular expression during the running time of the +%algorithm if we can establish the connection between our simplification +%rules and partial derivatives. %We believe, and have generated test %data, that a similar bound can be obtained for the derivatives in %Sulzmann and Lu's algorithm. Let us give some details about this next. -Bit-codes look like this: +Bitcodes look like this: \begin{center} $b ::= S \mid Z \; \;\; bs ::= [] \mid b:bs $ \end{center} -They are just a string of bits, the names $S$ and $Z$ here are quite arbitrary, we can use 0 and 1 or any other set of binary symbols to substitute them. Bit-codes(or bit-sequences) are a compact form of parse trees. -Here is how values and bit-codes are related: +They are just a string of bits, +the names $S$ and $Z$ here are quite arbitrary, we can use 0 and 1 +or any other set of binary symbols to substitute them. +Bitcodes(or bit-sequences) are a compact form of parse trees. Bitcodes are essentially incomplete values. This can be straightforwardly seen in the following transformation: \begin{center} @@ -825,10 +835,11 @@ \end{center} %\end{definition} - -Sulzmann and Lu's integrated the bitcodes into annotated regular -expressions by attaching them to the head of every substructure of a -regular expression\cite{Sulzmann2014}. They are defined by the following +Sulzmann and Lu's integrated the bitcodes into regular +expressions to create annotated regular expressions. +It is by attaching them to the head of every substructure of a +regular expression\cite{Sulzmann2014}. Annotated regular expressions + are defined by the following grammar: \begin{center} @@ -836,15 +847,15 @@ $\textit{a}$ & $::=$ & $\textit{ZERO}$\\ & $\mid$ & $\textit{ONE}\;\;bs$\\ & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\ - & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\ + & $\mid$ & $\textit{ALT}\;\;bs\,as$\\ & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\ & $\mid$ & $\textit{STAR}\;\;bs\,a$ \end{tabular} \end{center} \noindent -where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a -list of annotated regular expressions. These bitsequences encode +where $bs$ stands for bit-sequences, and $as$ (in \textit{ALTS}) for a +list of annotated regular expressions. These bit-sequences encode information about the (POSIX) value that should be generated by the Sulzmann and Lu algorithm. @@ -957,7 +968,7 @@ Here $(r^\uparrow)\backslash s$ is similar to what we have previously defined for $r\backslash s$. -The main point of the bitsequences and annotated regular expressions +The main point of the bit-sequences and annotated regular expressions is that we can apply rather aggressive (in terms of size) simplification rules in order to keep derivatives small. @@ -1014,8 +1025,8 @@ \end{center} \noindent - \comment{No: functional flatten does not remove ZEROs}Here flatten behaves like the traditional functional programming flatten function, - what it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$. + Here flatten behaves like the traditional functional programming flatten function, except that it also removes $\ZERO$s. + What it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$. Suppose we apply simplification after each derivative step, and view these two operations as an atomic one: $a \backslash_{simp} c \dn \textit{simp}(a \backslash c)$. @@ -1074,16 +1085,31 @@ \noindent We would settle the correctness claim. -It is relatively straightforward to establish that after one simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words, -\comment{that only holds when r is nullable}bmkeps r = bmkeps simp r -as this basically comes down to proving actions like removing the additional $r$ in $r+r$ does not delete important POSIX information in a regular expression. +It is relatively straightforward to establish that after one simplification step, +the part of a nullable derivative that corresponds to a POSIX value +remains intact and can still be collected, +in other words, +\begin{center} +$\textit{bmkeps} \; r = \textit{bmkeps} \; \textit{simp} \; r\;( r\; \textit{nullable})$ +\end{center} +as this basically comes down to proving actions like removing +the additional $r$ in $r+r$ does not delete important POSIX information +in a regular expression. The hardcore of this problem is to prove that -bmkeps bders r = bmkeps bders simp r -That is, if we do derivative on regular expression r and the simplified version for, they can still prove the same POSIX value if there is one . This is not as straightforward as the previous proposition, as the two regular expression r and simp r might become very different regular expressions after repeated application ofd simp and derivative. -The crucial point is to find the \comment{What?}"gene" of a regular expression and how it is kept intact during simplification. -To aid this, we are use the helping function retrieve described by Sulzmann and Lu: +\begin{center} +$\textit{bmkeps} \; \textit{blexer}\_{simp} \; r = \textit{bmkeps} \; \textit{blexer} \; \textit{simp} \; r$ +\end{center} +That is, if we do derivative on regular expression r and the simplified version, +they can still provide the same POSIX value if there is one . +This is not as straightforward as the previous proposition, as the two regular expressions $r$ and $\textit{simp}\; r$ + might become very different regular expressions after repeated application of $\textit{simp}$ and derivative. +The crucial point is to find the indispensable information of +a regular expression and how it is kept intact during simplification so that it performs +as good as a regular expression that has not been simplified in the subsequent derivative operations. +To aid this, we use the helping function retrieve described by Sulzmann and Lu: \\definition of retrieve\\ - This function assembled the bitcode that corresponds to a parse tree for how the current derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value). + This function assembles the bitcode that corresponds to a parse tree for how the current + derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value). Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\ $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\ A little fact that needs to be stated to help comprehension:\\ @@ -1092,7 +1118,7 @@ Our purpose of using this, however, is try to establish \\ $ \textit{retrieve} \; a \; v \;=\; \textit{retrieve} \; \textit{simp}(a) \; v'.$\\ The idea is that using $v'$, - a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still able to extract the bitsequence that gives the same parsing information as the unsimplified one. + a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still able to extract the bit-sequence that gives the same parsing information as the unsimplified one. After establishing this, we might be able to finally bridge the gap of proving\\ $\textit{retrieve} \; r \backslash s \; v = \;\textit{retrieve} \; \textit{simp}(r) \backslash s \; v'$\\ and subsequently\\