cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:d9d61a648292
+:a0df84875788
 fast?
 Unfortunately these preconceptions are not supported by evidence: Take
 for example the regular expression $(a^*)^*\,b$ and ask whether
 strings of the form $aa..a$ match this regular
-expression. Obviously not---the expected $b$ in the last
+expression. Obviously this is not the case---the expected $b$ in the last
 position is missing. One would expect that modern regular expression
 matching engines can find this out very quickly. Alas, if one tries
 this example in JavaScript, Python or Java 8 with strings like 28
 $a$'s, one discovers that this decision takes around 30 seconds and
 takes considerably longer when adding a few more $a$'s, as the graphs
 frequent enough to have a separate name created for
 them---\emph{evil regular expressions}. In empiric work, Davis et al
 report that they have found thousands of such evil regular expressions
 in the JavaScript and Python ecosystems \cite{Davis18}.
+\comment{Needs to be consistent: either exponential blowup; or quadratic blowup. Maybe you use the terminology superlinear, like in  the Davis et al paper}
 This exponential blowup in matching algorithms sometimes causes
 considerable grief in real life: for example on 20 July 2016 one evil
 regular expression brought the webpage
 \href{http://stackexchange.com}{Stack Exchange} to its
 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
 The underlying problem is that many ``real life'' regular expression
 matching engines do not use DFAs for matching. This is because they
 support regular expressions that are not covered by the classical
 automata theory, and in this more general setting there are quite a few
 research questions still unanswered and fast algorithms still need to be
-developed (for example how to treat bounded repetitions, negation and
+developed (for example how to treat efficiently bounded repetitions, negation and
-back-references efficiently).
+back-references).
 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?
 %how do they avoid dfas exponential states if they use them for fast matching?
 There is also another under-researched problem to do with regular
 expressions and lexing, i.e.~the process of breaking up strings into
 sequences of tokens according to some regular expressions. In this
 setting one is not just interested in whether or not a regular
-expression matches a string, but also in \emph{how}.  Consider for example a regular expression
+expression matches a string, but also in \emph{how}.  Consider for
-$r_{key}$ for recognising keywords such as \textit{if}, \textit{then}
+example a regular expression $r_{key}$ for recognising keywords such as
-and so on; and a regular expression $r_{id}$ for recognising
+\textit{if}, \textit{then} and so on; and a regular expression $r_{id}$
-identifiers (say, a single character followed by characters or
+for recognising identifiers (say, a single character followed by
-numbers). One can then form the compound regular expression
+characters or numbers). One can then form the compound regular
-$(r_{key} + r_{id})^*$ and use it to tokenise strings.  But then how
+expression $(r_{key} + r_{id})^*$ and use it to tokenise strings.  But
-should the string \textit{iffoo} be tokenised?  It could be tokenised
+then how should the string \textit{iffoo} be tokenised?  It could be
-as a keyword followed by an identifier, or the entire string as a
+tokenised as a keyword followed by an identifier, or the entire string
-single identifier.  Similarly, how should the string \textit{if} be
+as a single identifier.  Similarly, how should the string \textit{if} be
 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
 ``fire''---so is it an identifier or a keyword?  While in applications
 there is a well-known strategy to decide these questions, called POSIX
 matching, only relatively recently precise definitions of what POSIX
-matching actually means has been formalised
+matching actually means have been formalised
-\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}.
+\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Such a
-Such a definition has also been given by Sulzmann and  Lu \cite{Sulzmann2014}, but the
+definition has also been given by Sulzmann and  Lu \cite{Sulzmann2014},
-corresponding correctness proof turned out to be  faulty \cite{AusafDyckhoffUrban2016}.
+but the corresponding correctness proof turned out to be  faulty
-Roughly, POSIX matching means matching the longest initial substring.
+\cite{AusafDyckhoffUrban2016}. Roughly, POSIX matching means matching
-In the case of a tie, the initial sub-match is chosen according to some priorities attached to the
+the longest initial substring. In the case of a tie, the initial
-regular expressions (e.g.~keywords have a higher priority than
+sub-match is chosen according to some priorities attached to the regular
-identifiers). This sounds rather simple, but according to Grathwohl et
+expressions (e.g.~keywords have a higher priority than identifiers).
-al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
+This sounds rather simple, but according to Grathwohl et al \cite[Page
+36]{CrashCourse2014} this is not the case. They wrote:
 \begin{quote}
 \it{}``The POSIX strategy is more complicated than the greedy because of
 the dependence on information about the length of matched strings in the
 various subexpressions.''
 straightforwardly realised within the classic automata approach.
 For the moment however, we focus only on the usual basic regular expressions.
 Now if we want to find out whether a string $s$ matches with a regular
-expression $r$, build the derivatives of $r$ w.r.t.\ (in succession)
+expression $r$, we can build the derivatives of $r$ w.r.t.\ (in succession)
 all the characters of the string $s$. Finally, test whether the
 resulting regular expression can match the empty string.  If yes, then
 $r$ matches $s$, and no in the negative case. To implement this idea
 we can generalise the derivative operation to strings like this:
 One limitation of Brzozowski's algorithm is that it only produces a
 YES/NO answer for whether a string is being matched by a regular
 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
 to allow generation of an actual matching, called a \emph{value} or
-sometimes also \emph{lexical values}.  These values and regular
+sometimes also \emph{lexical value}.  These values and regular
 expressions correspond to each other as illustrated in the following
 table:
 \begin{center}
 		\end{tabular}
 	\end{tabular}
 \end{center}
 \noindent
-No value  corresponds to $\ZERO$; $\Empty$ corresponds to
+No value  corresponds to $\ZERO$; $\Empty$ corresponds to $\ONE$;
-$\ONE$; $\Char$ to the character regular expression; $\Seq$ to the
+$\Char$ to the character regular expression; $\Seq$ to the sequence
-sequence regular expression and so on. The idea of values is to encode
+regular expression and so on. The idea of values is to encode a kind of
-a kind of lexical value for how the sub-parts of a regular expression match
+lexical value for how the sub-parts of a regular expression match the
-the sub-parts of a string. To see this, suppose a \emph{flatten} operation, written
+sub-parts of a string. To see this, suppose a \emph{flatten} operation,
-$|v|$ for values. We can use this function to extract the underlying string of a value
+written $|v|$ for values. We can use this function to extract the
-$v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char
+underlying string of a value $v$. For example, $|\mathit{Seq} \,
-y})|$ is the string $xy$.  Using flatten, we can describe how values
+(\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$.  Using
-encode parse trees: $\Seq\,v_1\, v_2$ encodes a tree with 2 children nodes
+flatten, we can describe how values encode \comment{Avoid the notion
-that tells how the string $|v_1| @
+parse trees! Also later!!}parse trees: $\Seq\,v_1\, v_2$ encodes a tree with two
-|v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the
+children nodes that tells how the string $|v_1| @ |v_2|$ matches the
-substring $|v_1|$ and, respectively, $r_2$ matches the substring
+regex $r_1 \cdot r_2$ whereby $r_1$ matches the substring $|v_1|$ and,
-$|v_2|$. Exactly how these two are matched is contained in the
+respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two
-children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ .
+are matched is contained in the children nodes $v_1$ and $v_2$ of parent
+$\textit{Seq}$.
 To give a concrete example of how values work, consider the string $xy$
 and the regular expression $(x + (y + xy))^*$. We can view this regular
 expression as a tree and if the string $xy$ is matched by two Star
 ``iterations'', then the $x$ is matched by the left-most alternative in
 $\textit{nullable}$ or not. If not, we know the string does not match
 $r$ and no value needs to be generated. If yes, we start building the
 values incrementally by \emph{injecting} back the characters into the
 earlier values $v_n, \ldots, v_0$. This is the second phase of the
 algorithm from the right to left. For the first value $v_n$, we call the
-function $\textit{mkeps}$, which builds the parse tree for how the empty
+function $\textit{mkeps}$, which builds the \comment{Avoid}parse tree
-string has been matched by the (nullable) regular expression $r_n$. This
+for how the empty string has been matched by the (nullable) regular
-function is defined as
+expression $r_n$. This function is defined as
 	\begin{center}
 		\begin{tabular}{lcl}
 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
 			$\mkeps(r_{1}+r_{2})$	& $\dn$
 \noindent There are no cases for $\ZERO$ and $c$, since
 these regular expression cannot match the empty string. Note
 also that in case of alternatives we give preference to the
 regular expression on the left-hand side. This will become
-important later on.
+important later on about what value is calculated.
-After this, we inject back the characters one by one in order to build
+After the $\mkeps$-call, we inject back the characters one by one in order to build
 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
 ($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$.
 After injecting back $n$ characters, we get the parse tree for how $r_0$
 matches $s$. For this Sulzmann and Lu defined a function that reverses
 the ``chopping off'' of characters during the derivative phase. The
 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value
 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
 Other clauses can be understood in a similar way.
 %\comment{Other word: insight?}
-The following example gives an insight of $\textit{inj}$'s effect
+The following example gives an insight of $\textit{inj}$'s effect and
-and how Sulzmann and Lu's algorithm works as a whole.
+how Sulzmann and Lu's algorithm works as a whole. Suppose we have a
-Suppose we have a
+regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it
-regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against
+against the string $abc$ (when $abc$ is written as a regular expression,
-the string $abc$ (when $abc$ is written as a regular expression, the most
+the standard way of expressing it is $a \cdot (b \cdot c)$. But we
-standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
+usually omit the parentheses and dots here for better readability. This
-the parentheses and dots here for readability).
+algorithm returns a POSIX value, which means it will produce the longest
-This algorithm returns a POSIX value, which means it
+matching. Consequently, it matches the string $abc$ in one star
-will go for the longest matching, i.e.~it should match the string
+iteration, using the longest alternative $abc$ in the sub-expression (we shall use $r$ to denote this
-$abc$ in one star iteration, using the longest alternative $abc$ in the
+sub-expression for conciseness):
-sub-expression $((((a+b)+ab)+c)+abc)$ (we use $r$ to denote this sub-expression
-for conciseness).
+\[((((a+b)+ab)+c)+\underbrace{abc}_r)\]
-Before $\textit{inj}$ comes into play,
-our lexer first builds derivative using string $abc$ (we simplified some regular expressions like
+\noindent
-$\ZERO \cdot b$ to $\ZERO$ for conciseness; we also omit parentheses if
+Before $\textit{inj}$ is called, our lexer first builds derivative using
-they are clear from the context):
+string $abc$ (we simplified some regular expressions like $\ZERO \cdot
+b$ to $\ZERO$ for conciseness; we also omit parentheses if they are
+clear from the context):
 %Similarly, we allow
 %$\textit{ALT}$ to take a list of regular expressions as an argument
 %instead of just 2 operands to reduce the nested depth of
 %$\textit{ALT}$
 \begin{center}
 \begin{tabular}{lcl}
 $r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r^*$\\
 & $\xrightarrow{\backslash b}$ & $r_2 = (\ZERO+\ZERO+\ONE \cdot \ONE + \ZERO + \ONE \cdot \ONE \cdot c) \cdot r^* +(\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*$\\
 & $\xrightarrow{\backslash c}$ & $r_3 = ((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot \ONE) \cdot r^* + (\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^*) + $\\
 \end{center}
 \noindent
 In  case $r_3$ is nullable, we can call $\textit{mkeps}$
 to construct a parse tree for how $r_3$ matched the string $abc$.
-$\textit{mkeps}$ gives the following value $v_3$:
+This function gives the following value $v_3$:
 \begin{center}
 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
 \end{center}
 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
-\begin{center}
+\begin{center}\comment{better layout}
 $( \underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO)
 \cdot r^*) +((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*+(\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* ).$
 \end{center}
 \noindent
 Note that the leftmost location of term $((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot
 \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows
 $\textit{mkeps}$ to pick it up because $\textit{mkeps}$ is defined to always choose the
 left one when it is nullable. In the case of this example, $abc$ is
 preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is
-naturally generated by two applications of the splitting clause
+generated by two applications of the splitting clause
 \begin{center}
-$(r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$
+$(r_1 \cdot r_2)\backslash c  \;\;(when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$
 \end{center}
 \noindent
 By this clause, we put $r_1 \backslash c \cdot r_2 $ at the
 $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This
 $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$
 \end{center}
 \noindent
 tells us how the empty string $[]$ is matched with $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot
-\ONE \cdot \ONE) \cdot r^*$. We match $[]$ by a sequence of 2 nullable regular
+\ONE \cdot \ONE) \cdot r^*$. We match $[]$ by a sequence of two nullable regular
 expressions. The first one is an alternative, we take the rightmost
 alternative---whose language contains the empty string. The second
 nullable regular expression is a Kleene star. $\Stars$ tells us how it
 generates the nullable regular expression: by 0 iterations to form
 $\ONE$. Now $\textit{inj}$ injects characters back and incrementally
 $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$
 \end{center}
 for how $r$ matched $abc$. This completes the algorithm.
 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}.
-Readers might have noticed that the parse tree information
+Readers might have noticed that the parse tree information is actually
-is actually already available when doing derivatives.
+already available when doing derivatives. For example, immediately after
-For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with $a$,
+the operation $\backslash a$ we know that if we want to match a string
-we can either take the initial match to be
+that starts with $a$, we can either take the initial match to be
 \begin{center}
 \begin{enumerate}
 \item[1)] just $a$ or
 \item[2)] string $ab$ or
 \item[3)] string $abc$.
 \end{enumerate}
 \end{center}
 \noindent
 In order to differentiate between these choices, we just need to
-remember their positions--$a$ is on the left, $ab$ is in the middle ,
+remember their positions---$a$ is on the left, $ab$ is in the middle ,
-and $abc$ is on the right. Which one of these alternatives is chosen
+and $abc$ is on the right. Which of these alternatives is chosen
-later does not affect their relative position because our algorithm does
+later does not affect their relative position because the algorithm does
 not change this order. If this parsing information can be determined and
 does not change because of later derivatives, there is no point in
 traversing this information twice. This leads to an optimisation---if we
 store the information for parse trees inside the regular expression,
 update it when we do derivative on them, and collect the information
 \section{Simplification of Regular Expressions}
 Using bitcodes to guide  parsing is not a novel idea. It was applied to
 context free grammars and then adapted by Henglein and Nielson for
-efficient regular expression parsing using DFAs~\cite{nielson11bcre}.
+efficient regular expression \comment{?}parsing using DFAs~\cite{nielson11bcre}.
 Sulzmann and Lu took this idea of bitcodes a step further by integrating
 bitcodes into derivatives. The reason why we want to use bitcodes in
 this project is that we want to introduce more aggressive
 simplification rules in order to keep the size of derivatives small
 throughout. This is because the main drawback of building successive
 A partial derivative of a regular expression $r$ is essentially a set of
 regular expressions that are either $r$'s children expressions or a
 concatenation of them. Antimirov has proved a tight bound of the size of
 \emph{all} partial derivatives no matter what the string looks like.
 Roughly speaking the size will be quadruple in the size of the regular
-expression. If we want the size of derivatives in Sulzmann and Lu's
+expression.\comment{Are you sure? I have just proved that the sum of
-algorithm to stay equal or below this bound, we would need more
+sizes in $pder$ is less or equal $(1 + size\;r)^3$. And this is surely
+not the best bound.} If we want the size of derivatives in Sulzmann and
+Lu's algorithm to stay equal or below this bound, we would need more
 aggressive simplifications. Essentially we need to delete useless
 $\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible.
 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to
-get $a\cdot c +  b \cdot c + b \cdot c$, and then simplified to just $a \cdot
+get $a\cdot c +  b \cdot c + b \cdot c$, and then simplified to just $a
-c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+ \ONE) + (a
+\cdot c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+
-+\ONE)$ to just $a^*+a+\ONE$. Adding these more aggressive simplification
+\ONE) + (a +\ONE)$ to just $a^*+a+\ONE$. Adding these more aggressive
-rules helps us to achieve the same size bound as that of the partial
+simplification rules helps us to achieve the same size bound as that of
-derivatives. In order to implement the idea of ``spilling out alternatives''
+the partial derivatives.
-and to make them compatible with the $\text{inj}$-mechanism, we use \emph{bitcodes}.
-Bits and bitcodes (lists of bits) are just:
+In order to implement the idea of ``spilling out alternatives'' and to
+make them compatible with the $\text{inj}$-mechanism, we use
+\emph{bitcodes}. Bits and bitcodes (lists of bits) are just:
 %This allows us to prove a tight
 %bound on the size of regular expression during the running time of the
 %algorithm if we can establish the connection between our simplification
 %rules and partial derivatives.
 bs ::= [] \mid b:bs
 $
 \end{center}
 \noindent
-The names $S$ and $Z$  here are quite arbitrary in order to avoid
+The $S$ and $Z$ are arbitrary names for the bits in order to avoid
 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
 bit-lists) can be used to encode values (or incomplete values) in a
 compact form. This can be straightforwardly seen in the following
 coding function from values to bitcodes:
 $\textit{STAR}\,(bs\,@\,bs')\,a$
 \end{tabular}
 \end{center}
 \noindent
-After internalise we do successive derivative operations on the
+After internalising the regular expression, we perform successive
-annotated regular expressions. This derivative operation is the same as
+derivative operations on the annotated regular expressions. This
-what we previously have for the simple regular expressions, except that
+derivative operation is the same as what we had previously for the
-we beed to take special care of the bitcodes:\comment{You need to be consitent with  ALTS and ALT; ALT is just
+basic regular expressions, except that we beed to take care of
-an abbreviation; derivations and so on are defined for ALTS}\comment{no this is not the case,
+the bitcodes:\comment{You need to be consitent with  ALTS and ALT; ALT
-ALT for 2 regexes, ALTS for a list}
+is just an abbreviation; derivations and so on are defined for
+ALTS}\comment{no this is not the case, ALT for 2 regexes, ALTS for a
+list...\textcolor{blue}{This does not make sense to me. CU}}
 %\begin{definition}{bder}
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
 then the right branch of $ALTS$, which is $fuse \; bmkeps \;  a_1 (a_2
 \backslash c)$ will collapse the regular expression $a_1$(as it has
 already been fully matched) and store the parsing information at the
 head of the regular expression $a_2 \backslash c$ by fusing to it. The
 bitsequence $bs$, which was initially attached to the head of $SEQ$, has
-now been elevated to the top-level of ALT, as this information will be
+now been elevated to the top-level of $ALT$, as this information will be
-needed whichever way the $SEQ$ is matched--no matter whether $c$ belongs
+needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs
 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
 the lexing information, we complete the lexing by collecting the
 bitcodes using a generalised version of the $\textit{mkeps}$ function
 for annotated regular expressions, called $\textit{bmkeps}$:
 & & $\;\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
 \noindent
-In this definition $_\backslash s$ is the  generalisation  of the derivative
+In this definition $\_\backslash s$ is the  generalisation  of the derivative
 operation from characters to strings (just like the derivatives for un-annotated
 regular expressions).
 The main point of the bitcodes and annotated regular expressions is that
 we can apply rather aggressive (in terms of size) simplification rules
 When it detected that the regular expression is an alternative or
 sequence, it will try to simplify its children regular expressions
 recursively and then see if one of the children turn into $\ZERO$ or
 $\ONE$, which might trigger further simplification at the current level.
 The most involved part is the $\textit{ALTS}$ clause, where we use two
-auxiliary functions flatten and distinct to open up nested
+auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
 $\textit{ALTS}$ and reduce as many duplicates as possible. Function
-distinct  keeps the first occurring copy only and remove all later ones
+$\textit{distinct}$  keeps the first occurring copy only and remove all later ones
-when detected duplicates. Function flatten opens up nested \textit{ALTS}.
+when detected duplicates. Function $\textit{flatten}$ opens up nested \textit{ALTS}.
 Its recursive definition is given below:
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 $\textit{flatten} \; (\textit{ALTS}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise)
 \end{tabular}
 \end{center}
 \noindent
-Here flatten behaves like the traditional functional programming flatten
+Here $\textit{flatten}$ behaves like the traditional functional programming flatten
-function, except that it also removes $\ZERO$s. What it does is
+function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
-basically removing parentheses like changing $a+(b+c)$ into $a+b+c$.
+removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
 Suppose we apply simplification after each derivative step, and view
 these two operations as an atomic one: $a \backslash_{simp}\,c \dn
 \textit{simp}(a \backslash c)$. Then we can use the previous natural
 extension from derivative w.r.t.~character to derivative
 Figure 6 yields POSIX parse trees. We have tested this claim
 extensively by using the method in Figure~3 as a reference but yet
 have to work out all proof details.''
 \end{quote}
-\noindent We would settle this correctness claim. It is relatively
+\noindent We like to settle this correctness claim. It is relatively
 straightforward to establish that after one simplification step, the part of a
 nullable derivative that corresponds to a POSIX value remains intact and can
 still be collected, in other words, we can show that\comment{Double-check....I
-think this  is not the case}\comment{If i remember correctly, you have proved this lemma.
+think this  is not the case}
-I feel this is indeed not true because you might place arbitrary
+%\comment{If i remember correctly, you have proved this lemma.
-bits on the regex r, however if this is the case, did i remember wrongly that
+%I feel this is indeed not true because you might place arbitrary
-you proved something like simplification does not affect $\textit{bmkeps}$ results?
+%bits on the regex r, however if this is the case, did i remember wrongly that
-Anyway, i have amended this a little bit so it does not allow arbitrary bits attached
+%you proved something like simplification does not affect $\textit{bmkeps}$ results?
-to a regex. Maybe it works now.}
+%Anyway, i have amended this a little bit so it does not allow arbitrary bits attached
+%to a regex. Maybe it works now.}
 \begin{center}
 $\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;( a\; \textit{bnullable} and \textit{decode}(r, \textit{bmkeps}(a)) is a \textit{POSIX} value)$
 \end{center}
+\comment{\textcolor{blue}{I proved $bmkeps\,(bsimp\,a) = bmkeps\,a$ provided $a$ is
+$\textit{bnullable}$}}
 \noindent
 as this basically comes down to proving actions like removing the
 additional $r$ in $r+r$  does not delete important POSIX information in
 a regular expression. The hard part of this proof is to establish that
 \begin{center}
 	$\textit{bmkeps} \; \textit{blexer}\_{simp}(s, \; r) = \textit{bmkeps} \; \textit{blexer} \; \textit{simp}(s, \; r)$
-\end{center}
+\end{center}\comment{This is not true either...look at the definion blexer/blexer-simp}
 \noindent That is, if we do derivative on regular expression $r$ and then
 simplify it, and repeat this process until we exhaust the string, we get a
 regular expression $r''$($\textit{LHS}$)  that provides the POSIX matching
 information, which is exactly the same as the result $r'$($\textit{RHS}$ of the
 $\textit{POSIX}$  information of a regular expression and how it is modified,
 augmented and propagated
 during simplification in parallel with the regularr expression that
 has not been simplified in the subsequent derivative operations.  To aid this,
 we use the helping function retrieve described by Sulzmann and Lu: \\definition
-of retrieve TODO\\
+of retrieve TODO\comment{Did not read further}\\
 This function assembles the bitcode that corresponds to a parse tree for how
 the current derivative matches the suffix of the string(the characters that
 have not yet appeared, but will appear as the successive derivatives go on,
 how do we get this "future" information? By the value $v$, which is
 computed by a pass of the algorithm that uses

changeset 81	a0df84875788
parent 80	d9d61a648292
child 82	3153338ec6e4