updated and added comments
authorChristian Urban <urbanc@in.tum.de>
Tue, 23 Jul 2019 21:54:13 +0100
changeset 81 a0df84875788
parent 80 d9d61a648292
child 82 3153338ec6e4
updated and added comments
ninems/ninems.tex
--- a/ninems/ninems.tex	Tue Jul 23 09:26:22 2019 +0100
+++ b/ninems/ninems.tex	Tue Jul 23 21:54:13 2019 +0100
@@ -98,7 +98,7 @@
 Unfortunately these preconceptions are not supported by evidence: Take
 for example the regular expression $(a^*)^*\,b$ and ask whether
 strings of the form $aa..a$ match this regular
-expression. Obviously not---the expected $b$ in the last
+expression. Obviously this is not the case---the expected $b$ in the last
 position is missing. One would expect that modern regular expression
 matching engines can find this out very quickly. Alas, if one tries
 this example in JavaScript, Python or Java 8 with strings like 28
@@ -188,6 +188,7 @@
 report that they have found thousands of such evil regular expressions
 in the JavaScript and Python ecosystems \cite{Davis18}.
 
+\comment{Needs to be consistent: either exponential blowup; or quadratic blowup. Maybe you use the terminology superlinear, like in  the Davis et al paper}
 This exponential blowup in matching algorithms sometimes causes
 considerable grief in real life: for example on 20 July 2016 one evil
 regular expression brought the webpage
@@ -220,8 +221,8 @@
 support regular expressions that are not covered by the classical
 automata theory, and in this more general setting there are quite a few
 research questions still unanswered and fast algorithms still need to be
-developed (for example how to treat bounded repetitions, negation and
-back-references efficiently).
+developed (for example how to treat efficiently bounded repetitions, negation and
+back-references).
 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?
 %how do they avoid dfas exponential states if they use them for fast matching?
 
@@ -229,28 +230,29 @@
 expressions and lexing, i.e.~the process of breaking up strings into
 sequences of tokens according to some regular expressions. In this
 setting one is not just interested in whether or not a regular
-expression matches a string, but also in \emph{how}.  Consider for example a regular expression
-$r_{key}$ for recognising keywords such as \textit{if}, \textit{then}
-and so on; and a regular expression $r_{id}$ for recognising
-identifiers (say, a single character followed by characters or
-numbers). One can then form the compound regular expression
-$(r_{key} + r_{id})^*$ and use it to tokenise strings.  But then how
-should the string \textit{iffoo} be tokenised?  It could be tokenised
-as a keyword followed by an identifier, or the entire string as a
-single identifier.  Similarly, how should the string \textit{if} be
+expression matches a string, but also in \emph{how}.  Consider for
+example a regular expression $r_{key}$ for recognising keywords such as
+\textit{if}, \textit{then} and so on; and a regular expression $r_{id}$
+for recognising identifiers (say, a single character followed by
+characters or numbers). One can then form the compound regular
+expression $(r_{key} + r_{id})^*$ and use it to tokenise strings.  But
+then how should the string \textit{iffoo} be tokenised?  It could be
+tokenised as a keyword followed by an identifier, or the entire string
+as a single identifier.  Similarly, how should the string \textit{if} be
 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
 ``fire''---so is it an identifier or a keyword?  While in applications
 there is a well-known strategy to decide these questions, called POSIX
 matching, only relatively recently precise definitions of what POSIX
-matching actually means has been formalised
-\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. 
-Such a definition has also been given by Sulzmann and  Lu \cite{Sulzmann2014}, but the
-corresponding correctness proof turned out to be  faulty \cite{AusafDyckhoffUrban2016}.
-Roughly, POSIX matching means matching the longest initial substring.
-In the case of a tie, the initial sub-match is chosen according to some priorities attached to the
-regular expressions (e.g.~keywords have a higher priority than
-identifiers). This sounds rather simple, but according to Grathwohl et
-al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
+matching actually means have been formalised
+\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Such a
+definition has also been given by Sulzmann and  Lu \cite{Sulzmann2014},
+but the corresponding correctness proof turned out to be  faulty
+\cite{AusafDyckhoffUrban2016}. Roughly, POSIX matching means matching
+the longest initial substring. In the case of a tie, the initial
+sub-match is chosen according to some priorities attached to the regular
+expressions (e.g.~keywords have a higher priority than identifiers).
+This sounds rather simple, but according to Grathwohl et al \cite[Page
+36]{CrashCourse2014} this is not the case. They wrote:
 
 \begin{quote}
 \it{}``The POSIX strategy is more complicated than the greedy because of 
@@ -348,7 +350,7 @@
 
 
 Now if we want to find out whether a string $s$ matches with a regular
-expression $r$, build the derivatives of $r$ w.r.t.\ (in succession)
+expression $r$, we can build the derivatives of $r$ w.r.t.\ (in succession)
 all the characters of the string $s$. Finally, test whether the
 resulting regular expression can match the empty string.  If yes, then
 $r$ matches $s$, and no in the negative case. To implement this idea
@@ -389,7 +391,7 @@
 YES/NO answer for whether a string is being matched by a regular
 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
 to allow generation of an actual matching, called a \emph{value} or
-sometimes also \emph{lexical values}.  These values and regular
+sometimes also \emph{lexical value}.  These values and regular
 expressions correspond to each other as illustrated in the following
 table:
 
@@ -421,20 +423,21 @@
 \end{center}
 
 \noindent
-No value  corresponds to $\ZERO$; $\Empty$ corresponds to
-$\ONE$; $\Char$ to the character regular expression; $\Seq$ to the
-sequence regular expression and so on. The idea of values is to encode
-a kind of lexical value for how the sub-parts of a regular expression match
-the sub-parts of a string. To see this, suppose a \emph{flatten} operation, written
-$|v|$ for values. We can use this function to extract the underlying string of a value
-$v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char
-y})|$ is the string $xy$.  Using flatten, we can describe how values
-encode parse trees: $\Seq\,v_1\, v_2$ encodes a tree with 2 children nodes
-that tells how the string $|v_1| @
-|v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the
-substring $|v_1|$ and, respectively, $r_2$ matches the substring
-$|v_2|$. Exactly how these two are matched is contained in the
-children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ . 
+No value  corresponds to $\ZERO$; $\Empty$ corresponds to $\ONE$;
+$\Char$ to the character regular expression; $\Seq$ to the sequence
+regular expression and so on. The idea of values is to encode a kind of
+lexical value for how the sub-parts of a regular expression match the
+sub-parts of a string. To see this, suppose a \emph{flatten} operation,
+written $|v|$ for values. We can use this function to extract the
+underlying string of a value $v$. For example, $|\mathit{Seq} \,
+(\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$.  Using
+flatten, we can describe how values encode \comment{Avoid the notion
+parse trees! Also later!!}parse trees: $\Seq\,v_1\, v_2$ encodes a tree with two
+children nodes that tells how the string $|v_1| @ |v_2|$ matches the
+regex $r_1 \cdot r_2$ whereby $r_1$ matches the substring $|v_1|$ and,
+respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two
+are matched is contained in the children nodes $v_1$ and $v_2$ of parent
+$\textit{Seq}$. 
 
 To give a concrete example of how values work, consider the string $xy$
 and the regular expression $(x + (y + xy))^*$. We can view this regular
@@ -490,9 +493,9 @@
 values incrementally by \emph{injecting} back the characters into the
 earlier values $v_n, \ldots, v_0$. This is the second phase of the
 algorithm from the right to left. For the first value $v_n$, we call the
-function $\textit{mkeps}$, which builds the parse tree for how the empty
-string has been matched by the (nullable) regular expression $r_n$. This
-function is defined as
+function $\textit{mkeps}$, which builds the \comment{Avoid}parse tree
+for how the empty string has been matched by the (nullable) regular
+expression $r_n$. This function is defined as
 
 	\begin{center}
 		\begin{tabular}{lcl}
@@ -511,9 +514,9 @@
 these regular expression cannot match the empty string. Note
 also that in case of alternatives we give preference to the
 regular expression on the left-hand side. This will become
-important later on.
+important later on about what value is calculated.
 
-After this, we inject back the characters one by one in order to build
+After the $\mkeps$-call, we inject back the characters one by one in order to build
 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
 ($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$.
 After injecting back $n$ characters, we get the parse tree for how $r_0$
@@ -572,26 +575,30 @@
 Other clauses can be understood in a similar way.
 
 %\comment{Other word: insight?}
-The following example gives an insight of $\textit{inj}$'s effect
-and how Sulzmann and Lu's algorithm works as a whole.
- Suppose we have a
-regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against
-the string $abc$ (when $abc$ is written as a regular expression, the most
-standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
-the parentheses and dots here for readability). 
-This algorithm returns a POSIX value, which means it
-will go for the longest matching, i.e.~it should match the string
-$abc$ in one star iteration, using the longest alternative $abc$ in the
-sub-expression $((((a+b)+ab)+c)+abc)$ (we use $r$ to denote this sub-expression
-for conciseness). 
-Before $\textit{inj}$ comes into play, 
-our lexer first builds derivative using string $abc$ (we simplified some regular expressions like
-$\ZERO \cdot b$ to $\ZERO$ for conciseness; we also omit parentheses if
-they are clear from the context):
+The following example gives an insight of $\textit{inj}$'s effect and
+how Sulzmann and Lu's algorithm works as a whole. Suppose we have a
+regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it
+against the string $abc$ (when $abc$ is written as a regular expression,
+the standard way of expressing it is $a \cdot (b \cdot c)$. But we
+usually omit the parentheses and dots here for better readability. This
+algorithm returns a POSIX value, which means it will produce the longest
+matching. Consequently, it matches the string $abc$ in one star
+iteration, using the longest alternative $abc$ in the sub-expression (we shall use $r$ to denote this
+sub-expression for conciseness):
+
+\[((((a+b)+ab)+c)+\underbrace{abc}_r)\] 
+
+\noindent
+Before $\textit{inj}$ is called, our lexer first builds derivative using
+string $abc$ (we simplified some regular expressions like $\ZERO \cdot
+b$ to $\ZERO$ for conciseness; we also omit parentheses if they are
+clear from the context):
+
 %Similarly, we allow
 %$\textit{ALT}$ to take a list of regular expressions as an argument
 %instead of just 2 operands to reduce the nested depth of
 %$\textit{ALT}$
+
 \begin{center}
 \begin{tabular}{lcl}
 $r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r^*$\\
@@ -604,13 +611,15 @@
 \noindent
 In  case $r_3$ is nullable, we can call $\textit{mkeps}$ 
 to construct a parse tree for how $r_3$ matched the string $abc$. 
-$\textit{mkeps}$ gives the following value $v_3$: 
+This function gives the following value $v_3$: 
+
+
 \begin{center}
 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
 \end{center}
 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
 
-\begin{center}
+\begin{center}\comment{better layout}
    $( \underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO)
   \cdot r^*) +((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*+(\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* ).$
  \end{center}
@@ -621,10 +630,10 @@
  $\textit{mkeps}$ to pick it up because $\textit{mkeps}$ is defined to always choose the
  left one when it is nullable. In the case of this example, $abc$ is
  preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is
- naturally generated by two applications of the splitting clause
+ generated by two applications of the splitting clause
 
 \begin{center}
-     $(r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$
+     $(r_1 \cdot r_2)\backslash c  \;\;(when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$
 \end{center}
        
 \noindent
@@ -640,7 +649,7 @@
 
 \noindent
 tells us how the empty string $[]$ is matched with $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot
-\ONE \cdot \ONE) \cdot r^*$. We match $[]$ by a sequence of 2 nullable regular
+\ONE \cdot \ONE) \cdot r^*$. We match $[]$ by a sequence of two nullable regular
 expressions. The first one is an alternative, we take the rightmost
 alternative---whose language contains the empty string. The second
 nullable regular expression is a Kleene star. $\Stars$ tells us how it
@@ -669,10 +678,11 @@
    for how $r$ matched $abc$. This completes the algorithm.
    
 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
-Readers might have noticed that the parse tree information 
-is actually already available when doing derivatives. 
-For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with $a$,
- we can either take the initial match to be 
+Readers might have noticed that the parse tree information is actually
+already available when doing derivatives. For example, immediately after
+the operation $\backslash a$ we know that if we want to match a string
+that starts with $a$, we can either take the initial match to be 
+
  \begin{center}
 \begin{enumerate}
     \item[1)] just $a$ or
@@ -683,9 +693,9 @@
 
 \noindent
 In order to differentiate between these choices, we just need to
-remember their positions--$a$ is on the left, $ab$ is in the middle ,
-and $abc$ is on the right. Which one of these alternatives is chosen
-later does not affect their relative position because our algorithm does
+remember their positions---$a$ is on the left, $ab$ is in the middle ,
+and $abc$ is on the right. Which of these alternatives is chosen
+later does not affect their relative position because the algorithm does
 not change this order. If this parsing information can be determined and
 does not change because of later derivatives, there is no point in
 traversing this information twice. This leads to an optimisation---if we
@@ -707,7 +717,7 @@
 
 Using bitcodes to guide  parsing is not a novel idea. It was applied to
 context free grammars and then adapted by Henglein and Nielson for
-efficient regular expression parsing using DFAs~\cite{nielson11bcre}.
+efficient regular expression \comment{?}parsing using DFAs~\cite{nielson11bcre}.
 Sulzmann and Lu took this idea of bitcodes a step further by integrating
 bitcodes into derivatives. The reason why we want to use bitcodes in
 this project is that we want to introduce more aggressive
@@ -763,18 +773,23 @@
 concatenation of them. Antimirov has proved a tight bound of the size of
 \emph{all} partial derivatives no matter what the string looks like.
 Roughly speaking the size will be quadruple in the size of the regular
-expression. If we want the size of derivatives in Sulzmann and Lu's
-algorithm to stay equal or below this bound, we would need more
+expression.\comment{Are you sure? I have just proved that the sum of
+sizes in $pder$ is less or equal $(1 + size\;r)^3$. And this is surely
+not the best bound.} If we want the size of derivatives in Sulzmann and
+Lu's algorithm to stay equal or below this bound, we would need more
 aggressive simplifications. Essentially we need to delete useless
 $\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible.
 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to
-get $a\cdot c +  b \cdot c + b \cdot c$, and then simplified to just $a \cdot
-c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+ \ONE) + (a
-+\ONE)$ to just $a^*+a+\ONE$. Adding these more aggressive simplification
-rules helps us to achieve the same size bound as that of the partial
-derivatives. In order to implement the idea of ``spilling out alternatives''
-and to make them compatible with the $\text{inj}$-mechanism, we use \emph{bitcodes}.
-Bits and bitcodes (lists of bits) are just:
+get $a\cdot c +  b \cdot c + b \cdot c$, and then simplified to just $a
+\cdot c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+
+\ONE) + (a +\ONE)$ to just $a^*+a+\ONE$. Adding these more aggressive
+simplification rules helps us to achieve the same size bound as that of
+the partial derivatives. 
+
+In order to implement the idea of ``spilling out alternatives'' and to
+make them compatible with the $\text{inj}$-mechanism, we use
+\emph{bitcodes}. Bits and bitcodes (lists of bits) are just:
+
 %This allows us to prove a tight
 %bound on the size of regular expression during the running time of the
 %algorithm if we can establish the connection between our simplification
@@ -792,7 +807,7 @@
 \end{center}
 
 \noindent
-The names $S$ and $Z$  here are quite arbitrary in order to avoid 
+The $S$ and $Z$ are arbitrary names for the bits in order to avoid 
 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
 bit-lists) can be used to encode values (or incomplete values) in a
 compact form. This can be straightforwardly seen in the following
@@ -925,12 +940,14 @@
 \end{center}  
 
 \noindent
-After internalise we do successive derivative operations on the
-annotated regular expressions. This derivative operation is the same as
-what we previously have for the simple regular expressions, except that
-we beed to take special care of the bitcodes:\comment{You need to be consitent with  ALTS and ALT; ALT is just 
-an abbreviation; derivations and so on are defined for ALTS}\comment{no this is not the case,
-ALT for 2 regexes, ALTS for a list}
+After internalising the regular expression, we perform successive
+derivative operations on the annotated regular expressions. This
+derivative operation is the same as what we had previously for the
+basic regular expressions, except that we beed to take care of
+the bitcodes:\comment{You need to be consitent with  ALTS and ALT; ALT
+is just an abbreviation; derivations and so on are defined for
+ALTS}\comment{no this is not the case, ALT for 2 regexes, ALTS for a
+list...\textcolor{blue}{This does not make sense to me. CU}}
 
  %\begin{definition}{bder}
 \begin{center}
@@ -968,8 +985,8 @@
 already been fully matched) and store the parsing information at the
 head of the regular expression $a_2 \backslash c$ by fusing to it. The
 bitsequence $bs$, which was initially attached to the head of $SEQ$, has
-now been elevated to the top-level of ALT, as this information will be
-needed whichever way the $SEQ$ is matched--no matter whether $c$ belongs
+now been elevated to the top-level of $ALT$, as this information will be
+needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs
 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
 the lexing information, we complete the lexing by collecting the
 bitcodes using a generalised version of the $\textit{mkeps}$ function
@@ -1011,7 +1028,7 @@
 \end{center}
 
 \noindent
-In this definition $_\backslash s$ is the  generalisation  of the derivative
+In this definition $\_\backslash s$ is the  generalisation  of the derivative
 operation from characters to strings (just like the derivatives for un-annotated
 regular expressions).
 
@@ -1062,10 +1079,10 @@
 recursively and then see if one of the children turn into $\ZERO$ or
 $\ONE$, which might trigger further simplification at the current level.
 The most involved part is the $\textit{ALTS}$ clause, where we use two
-auxiliary functions flatten and distinct to open up nested
+auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
 $\textit{ALTS}$ and reduce as many duplicates as possible. Function
-distinct  keeps the first occurring copy only and remove all later ones
-when detected duplicates. Function flatten opens up nested \textit{ALTS}.
+$\textit{distinct}$  keeps the first occurring copy only and remove all later ones
+when detected duplicates. Function $\textit{flatten}$ opens up nested \textit{ALTS}.
 Its recursive definition is given below:
 
  \begin{center}
@@ -1078,9 +1095,9 @@
 \end{center}  
 
 \noindent
-Here flatten behaves like the traditional functional programming flatten
-function, except that it also removes $\ZERO$s. What it does is
-basically removing parentheses like changing $a+(b+c)$ into $a+b+c$.
+Here $\textit{flatten}$ behaves like the traditional functional programming flatten
+function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
+removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
 
 Suppose we apply simplification after each derivative step, and view
 these two operations as an atomic one: $a \backslash_{simp}\,c \dn
@@ -1139,20 +1156,23 @@
   have to work out all proof details.''
 \end{quote}  
 
-\noindent We would settle this correctness claim. It is relatively
+\noindent We like to settle this correctness claim. It is relatively
 straightforward to establish that after one simplification step, the part of a
 nullable derivative that corresponds to a POSIX value remains intact and can
 still be collected, in other words, we can show that\comment{Double-check....I
-think this  is not the case}\comment{If i remember correctly, you have proved this lemma.
-I feel this is indeed not true because you might place arbitrary 
-bits on the regex r, however if this is the case, did i remember wrongly that
-you proved something like simplification does not affect $\textit{bmkeps}$ results?
-Anyway, i have amended this a little bit so it does not allow arbitrary bits attached
-to a regex. Maybe it works now.}
+think this  is not the case}
+%\comment{If i remember correctly, you have proved this lemma.
+%I feel this is indeed not true because you might place arbitrary 
+%bits on the regex r, however if this is the case, did i remember wrongly that
+%you proved something like simplification does not affect $\textit{bmkeps}$ results?
+%Anyway, i have amended this a little bit so it does not allow arbitrary bits attached
+%to a regex. Maybe it works now.}
 
 \begin{center}
 $\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;( a\; \textit{bnullable} and \textit{decode}(r, \textit{bmkeps}(a)) is a \textit{POSIX} value)$
 \end{center}
+\comment{\textcolor{blue}{I proved $bmkeps\,(bsimp\,a) = bmkeps\,a$ provided $a$ is 
+$\textit{bnullable}$}}
 
 \noindent
 as this basically comes down to proving actions like removing the
@@ -1161,7 +1181,7 @@
 
 \begin{center}
 	$\textit{bmkeps} \; \textit{blexer}\_{simp}(s, \; r) = \textit{bmkeps} \; \textit{blexer} \; \textit{simp}(s, \; r)$
-\end{center}
+\end{center}\comment{This is not true either...look at the definion blexer/blexer-simp}
 
 \noindent That is, if we do derivative on regular expression $r$ and then
 simplify it, and repeat this process until we exhaust the string, we get a
@@ -1181,7 +1201,7 @@
 during simplification in parallel with the regularr expression that
 has not been simplified in the subsequent derivative operations.  To aid this,
 we use the helping function retrieve described by Sulzmann and Lu: \\definition
-of retrieve TODO\\
+of retrieve TODO\comment{Did not read further}\\
 This function assembles the bitcode that corresponds to a parse tree for how
 the current derivative matches the suffix of the string(the characters that
 have not yet appeared, but will appear as the successive derivatives go on,