# HG changeset patch # User Christian Urban # Date 1562580632 -3600 # Node ID cab5eab1f6f1ec2a61df5e9aefca730b2f1905ef # Parent 4c7173b7ddca6c94cc92a6e5155cc987ab331725 more proofreading diff -r 4c7173b7ddca -r cab5eab1f6f1 ninems/ninems.tex --- a/ninems/ninems.tex Sun Jul 07 22:23:35 2019 +0100 +++ b/ninems/ninems.tex Mon Jul 08 11:10:32 2019 +0100 @@ -6,9 +6,13 @@ \usepackage{amsmath} \usepackage[noend]{algpseudocode} \usepackage{enumitem} +\usepackage{nccmath} \definecolor{darkblue}{rgb}{0,0,0.6} \hypersetup{colorlinks=true,allcolors=darkblue} +\newcommand{\comment}[1]% +{{\color{red}$\Rightarrow$}\marginpar{\raggedright\small{\bf\color{red}#1}}} + % \documentclass{article} %\usepackage[utf8]{inputenc} %\usepackage[english]{babel} @@ -83,6 +87,7 @@ \section{Introduction} + This PhD-project is about regular expression matching and lexing. Given the maturity of this topic, the reader might wonder: Surely, regular expressions must have already been studied to death? @@ -447,12 +452,14 @@ is generated assuming the regular expression matches the string. Pictorially, the algorithm is as follows: +\begin{ceqn} \begin{equation}\label{graph:2} \begin{tikzcd} r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] \end{tikzcd} \end{equation} +\end{ceqn} \noindent For convenience, we shall employ the following notations: the regular expression we @@ -545,7 +552,7 @@ $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$. Other clauses can be understood in a similar way. -The following example gives a taste of $\textit{inj}$'s effect +The following example gives a \comment{Other word: insight?}taste of $\textit{inj}$'s effect and how Sulzmann and Lu's algorithm works as a whole. Suppose we have a regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against @@ -575,46 +582,53 @@ \end{center} \noindent -Now when $nullable$ gives a $yes$ on $r_3$, we call $mkeps$ +In case $r_3$ is nullable, we can call $mkeps$ to construct a parse tree for how $r_3$ matched the string $abc$. $mkeps$ gives the following value $v_3$: \begin{center} $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$ \end{center} The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined): + \begin{center} $( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0 + 1 + 0) \cdot r^*) +((0+1+0 + 0 + 0) \cdot r^*+(0+0+0 + 1 + 0) \cdot r^* ).$ - \end{center} - Note that the leftmost location of term $((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*$ - (which corresponds to the initial sub-match $abc$) - allows $mkeps$ to pick it up - because $mkeps$ is defined to always choose the left one when it is nullable. - In the case of this example, $abc$ is preferred over $a$ or $ab$. - This $\Left(\Left(\ldots))$ location is naturally generated by - two applications of the splitting clause - \begin{center} + +\noindent + Note that the leftmost location of term $((0+0+0 + 0 + 1 \cdot 1 \cdot + 1) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows + $mkeps$ to pick it up because $mkeps$ is defined to always choose the + left one when it is nullable. In the case of this example, $abc$ is + preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is + naturally generated by two applications of the splitting clause + +\begin{center} $(r_1 \cdot r_2)\backslash c (when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$ - \end{center} - By this clause, we put -$r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. -This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match. - Removing the outside $\Left(\Left(...))$, the inside sub-value - \begin{center} +\end{center} + +\noindent +By this clause, we put $r_1 \backslash c \cdot r_2 $ at the +$\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This +allows $mkeps$ to always pick up among two matches the one with a longer +initial sub-match. Removing the outside $\Left(\Left(...))$, the inside +sub-value + +\begin{center} $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$ - \end{center} -tells us how the empty string $[]$ is matched with $(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*$. -We match $[]$ by a sequence of 2 nullable regular expressions. -The first one is an alternative, we take the rightmost alternative---whose language -contains the empty string. The second nullable regular -expression is a Kleene star. $\Stars$ tells us how it -generates the nullable regular expression: by 0 iterations to form $\epsilon$. -Now $\textit{inj}$ injects characters back and - incrementally builds a parse tree based on $v_3$. -Using the value $v_3$, the character c, and the regular expression $r_2$, -we can recover how $r_2$ matched the string $[c]$ : - $\textit{inj} \; r_2 \; c \; v_3$ gives us +\end{center} + +\noindent +tells us how the empty string $[]$ is matched with $(0+0+0 + 0 + 1 \cdot +1 \cdot 1) \cdot r^*$. We match $[]$ by a sequence of 2 nullable regular +expressions. The first one is an alternative, we take the rightmost +alternative---whose language contains the empty string. The second +nullable regular expression is a Kleene star. $\Stars$ tells us how it +generates the nullable regular expression: by 0 iterations to form +$\epsilon$. Now $\textit{inj}$ injects characters back and incrementally +builds a parse tree based on $v_3$. Using the value $v_3$, the character +c, and the regular expression $r_2$, we can recover how $r_2$ matched +the string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us \begin{center} $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$ \end{center} @@ -646,12 +660,21 @@ \item[3)] string $abc$. \end{enumerate} \end{center} -In order to differentiate between these choices, -we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. -Which one of these alternatives is chosen later does not affect their relative position because our algorithm -does not change this order. If this parsing information can be determined and does -not change because of later derivatives, -there is no point in traversing this information twice. This leads to an optimisation---if we store the information for parse trees inside the regular expression, update it when we do derivative on them, and collect the information when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing the rest $n$ injections. This leads to Sulzmann and Lu's novel idea of using bit-codes in derivatives. + +\noindent +In order to differentiate between these choices, we just need to +remember their positions--$a$ is on the left, $ab$ is in the middle , +and $abc$ is on the right. Which one of these alternatives is chosen +later does not affect their relative position because our algorithm does +not change this order. If this parsing information can be determined and +does not change because of later derivatives, there is no point in +traversing this information twice. This leads to an optimisation---if we +store the information for parse trees inside the regular expression, +update it when we do derivative on them, and collect the information +when finished with derivatives and call $mkeps$ for deciding which +branch is POSIX, we can generate the parse tree in one pass, instead of +doing the rest $n$ injections. This leads to Sulzmann and Lu's novel +idea of using bit-codes in derivatives. In the next section, we shall focus on the bit-coded algorithm and the process of simplification of regular expressions. This is needed in @@ -662,61 +685,75 @@ \section{Simplification of Regular Expressions} - -Using bit-codes to guide parsing is not a novel idea. It was applied to +Using bitcodes to guide parsing is not a novel idea. It was applied to context free grammars and then adapted by Henglein and Nielson for -efficient regular expression parsing using DFAs \cite{nielson11bcre}. Sulzmann and -Lu took a step further by integrating bitcodes into derivatives. -The reason why we want to use bitcodes in -this project is that we want to introduce aggressive -simplifications. - -The main drawback of building -successive derivatives according to Brzozowski's definition is that they -can grow very quickly in size. This is mainly due to the fact that the -derivative operation generates often ``useless'' $\ZERO$s and $\ONE$s in +efficient regular expression parsing using DFAs~\cite{nielson11bcre}. +Sulzmann and Lu took this idea of bitcodes a step further by integrating +bitcodes into derivatives. The reason why we want to use bitcodes in +this project is that we want to introduce more aggressive +simplifications in order to keep the size of derivatives small +throughout. This is because the main drawback of building successive +derivatives according to Brzozowski's definition is that they can grow +very quickly in size. This is mainly due to the fact that the derivative +operation generates often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, if implemented naively both algorithms by -Brzozowski and by Sulzmann and Lu are excruciatingly slow. -For example when starting with the regular expression $(a + aa)^*$ and building 12 +Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example +when starting with the regular expression $(a + aa)^*$ and building 12 successive derivatives w.r.t.~the character $a$, one obtains a derivative regular expression with more than 8000 nodes (when viewed as a tree). Operations like derivative and $\nullable$ need to traverse such trees and consequently the bigger the size of the derivative the slower the algorithm. -Fortunately, one can simplify regular expressions -after each derivative step. Various simplifications of regular -expressions are possible, such as the simplifications of $\ZERO + r$, $r -+ \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just $r$. These -simplifications do not affect the answer for whether a regular -expression matches a string or not, but fortunately also do not affect -the POSIX strategy of how regular expressions match strings---although -the latter is much harder to establish. -The argument for complicating the data structures from basic regular -expressions to those with bitcodes is that we can introduce -simplification without making the algorithm crash or overly complex to -reason about. The latter is crucial for a correctness proof. -Some initial results in this -regard have been obtained in \cite{AusafDyckhoffUrban2016}. However, -what has not been achieved yet is correctness for the bit-coded algorithm -that involves simplifications and a very tight bound for the size. Such -a tight bound is suggested by work of Antimirov who proved that -(partial) derivatives can be bound by the number of characters contained -in the initial regular expression \cite{Antimirov95}. -Antimirov defined the \emph{partial derivatives} of regular expressions to be this: -%TODO definition of partial derivatives +Fortunately, one can simplify regular expressions after each derivative +step. Various simplifications of regular expressions are possible, such +as the simplifications of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r +\cdot \ONE$, and $r + r$ to just $r$. These simplifications do not +affect the answer for whether a regular expression matches a string or +not, but fortunately also do not affect the POSIX strategy of how +regular expressions match strings---although the latter is much harder +to establish. \comment{Does not make sense.} The argument for +complicating the data structures from basic regular expressions to those +with bitcodes is that we can introduce simplification without making the +algorithm crash or overly complex to reason about. The latter is crucial +for a correctness proof. Some initial results in this regard have been +obtained in \cite{AusafDyckhoffUrban2016}. + +Unfortunately, the simplification rules outlined above are not +sufficient to prevent an explosion for all regular expression. We +believe a tighter bound can be achieved that prevents an explosion in +all cases. Such a tighter bound is suggested by work of Antimirov who +proved that (partial) derivatives can be bound by the number of +characters contained in the initial regular expression +\cite{Antimirov95}. He defined the \emph{partial derivatives} of regular +expressions as follows: + \begin{center} \begin{tabular}{lcl} $\textit{pder} \; c \; 0$ & $\dn$ & $\emptyset$\\ $\textit{pder} \; c \; 1$ & $\dn$ & $\emptyset$ \\ $\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{ 1 \} \; \textit{else} \; \emptyset$ \\ $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \; r_2$ \\ - $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} \cup pder \; c \; r_2 \; \textit{else} \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} $ \\ + $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 $\\ + & & $\textit{then} \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} \cup pder \; c \; r_2 \;$\\ + & & $\textit{else} \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} $ \\ $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{ r' \cdot r^* \mid r' \in pder \; c \; r \} $ \\ \end{tabular} \end{center} - A partial derivative of a regular expression $r$ is essentially a set of regular expressions that are either $r$'s children expressions or a concatenation of them. -Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression. Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives. + +\noindent +A partial derivative of a regular expression $r$ is essentially a set of +regular expressions that are either $r$'s children expressions or a +concatenation of them. Antimirov has proved a tight bound of the size of +partial derivatives. \comment{That looks too preliminary to me.} Roughly +speaking the size will not exceed the fourth power of the number of +nodes in that regular expression. \comment{Improve: which +simplifications?}Interestingly, we observed from experiment that after +the simplification step, our regular expression has the same size or is +smaller than the partial derivatives. This allows us to prove a tight +bound on the size of regular expression during the running time of the +algorithm if we can establish the connection between our simplification +rules and partial derivatives. %We believe, and have generated test %data, that a similar bound can be obtained for the derivatives in @@ -744,7 +781,21 @@ code(\Stars\,vs)$ \end{tabular} \end{center} -Here code encodes a value into a bit-sequence by converting Left into $\Z$, Right into $\S$, the start point of a non-empty star iteration into $\S$, and the border where a local star terminates into $\Z$. This conversion is apparently lossy, as it throws away the character information, and does not decode the boundary between the two operands of the sequence constructor. Moreover, with only the bitcode we cannot even tell whether the $\S$s and $\Z$s are for $\Left/\Right$ or $\Stars$. The reason for choosing this compact way of storing information is that the relatively small size of bits can be easily moved around. In order to recover the bitcode back into values, we will need the regular expression as the extra information and decode it back into value:\\ + +Here code encodes a value into a bit-sequence by converting Left into +$\Z$, Right into $\S$, the start point of a non-empty star iteration +into $\S$, and the border where a local star terminates into $\Z$. This +conversion is apparently lossy, as it throws away the character +information, and does not decode the boundary between the two operands +of the sequence constructor. Moreover, with only the bitcode we cannot +even tell whether the $\S$s and $\Z$s are for $\Left/\Right$ or +$\Stars$. The reason for choosing this compact way of storing +information is that the relatively small size of bits can be easily +moved around. In order to recover the bitcode back into values, we will +need the regular expression as the extra information and decode it back +into value:\\ + + %\begin{definition}[Bitdecoding of Values]\mbox{} \begin{center} \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} @@ -775,8 +826,10 @@ %\end{definition} -Sulzmann and Lu's integrated the bitcodes into annotated regular expressions by attaching them to the head of every substructure of a regular expression\cite{Sulzmann2014}. They are -defined by the following grammar: +Sulzmann and Lu's integrated the bitcodes into annotated regular +expressions by attaching them to the head of every substructure of a +regular expression\cite{Sulzmann2014}. They are defined by the following +grammar: \begin{center} \begin{tabular}{lcl} @@ -795,9 +848,11 @@ information about the (POSIX) value that should be generated by the Sulzmann and Lu algorithm. -To do lexing using annotated regular expressions, we shall first transform the -usual (un-annotated) regular expressions into annotated regular -expressions:\\ +To do lexing using annotated regular expressions, we shall first +transform the usual (un-annotated) regular expressions into annotated +regular expressions. This operation is called \emph{internalisation} and +defined as follows: + %\begin{definition} \begin{center} \begin{tabular}{lcl} @@ -815,7 +870,10 @@ \end{center} %\end{definition} -Here $fuse$ is an auxiliary function that helps to attach bits to the front of an annotated regular expression. Its definition goes as follows: +\noindent +In the fourth clause, $fuse$ is an auxiliary function that helps to attach bits to the +front of an annotated regular expression. Its definition is as follows: + \begin{center} \begin{tabular}{lcl} $\textit{fuse}\,bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\ @@ -832,9 +890,13 @@ \end{tabular} \end{center} -After internalise we do successive derivative operations on the annotated regular expression. - This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits :\\ -%\begin{definition}{bder} +\noindent +After internalise we do successive derivative operations on the + annotated regular expression. This derivative operation is the same as + what we previously have for the simple regular expressions, except that + we take special care of the bits :\\ + + %\begin{definition}{bder} \begin{center} \begin{tabular}{@{}lcl@{}} $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\ @@ -877,6 +939,8 @@ \end{tabular} \end{center} %\end{definition} + +\noindent This function completes the parse tree information by travelling along the path on the regular expression that corresponds to a POSIX value snd collect all the bits, and using S to indicate the end of star iterations. If we take the bitsproduced by $bmkeps$ and decode it, @@ -942,13 +1006,15 @@ Function flatten opens up nested ALT. Its recursive definition is given below: \begin{center} \begin{tabular}{@{}lcl@{}} - $\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{ map fuse}( \textit{bs, \_} ) \textit{ as}) \; +\!+ \; \textit{flatten} \; as' $ \\ + $\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \; + (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \; as' $ \\ - $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as' $ + $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) \end{tabular} \end{center} - Here flatten behaves like the traditional functional programming flatten function, +\noindent + \comment{No: functional flatten does not remove ZEROs}Here flatten behaves like the traditional functional programming flatten function, what it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$. Suppose we apply simplification after each derivative step, @@ -978,7 +1044,12 @@ -We are currently engaged in 2 tasks related to this algorithm. +\section{Current Work} + +We are currently engaged in two tasks related to this algorithm. + +\begin{itemize} +\item The first one is proving that our simplification rules actually do not affect the POSIX value that should be generated by the algorithm according to the specification of a POSIX value @@ -1003,14 +1074,14 @@ \noindent We would settle the correctness claim. -It is relatively straightforward to establish that after 1 simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words, -bmkeps r = bmkeps simp r +It is relatively straightforward to establish that after one simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words, +\comment{that only holds when r is nullable}bmkeps r = bmkeps simp r as this basically comes down to proving actions like removing the additional $r$ in $r+r$ does not delete important POSIX information in a regular expression. The hardcore of this problem is to prove that bmkeps bders r = bmkeps bders simp r That is, if we do derivative on regular expression r and the simplified version for, they can still prove the same POSIX value if there is one . This is not as straightforward as the previous proposition, as the two regular expression r and simp r might become very different regular expressions after repeated application ofd simp and derivative. -The crucial point is to find the "gene" of a regular expression and how it is kept intact during simplification. -To aid this, we are utilizing the helping function retrieve described by Sulzmann and Lu: +The crucial point is to find the \comment{What?}"gene" of a regular expression and how it is kept intact during simplification. +To aid this, we are use the helping function retrieve described by Sulzmann and Lu: \\definition of retrieve\\ This function assembled the bitcode that corresponds to a parse tree for how the current derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value). Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\ @@ -1028,8 +1099,13 @@ $\textit{retrieve} \; r \backslash s \; v\; = \; \textit{retrieve} \; r \backslash_{simp} s \; v'$.\\ This proves that our simplified version of regular expression still contains all the bitcodes needed. -The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplification(the naive version as implemented in ADU of course can explode in some cases). -So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress. +\item +The second task is to speed up the more aggressive simplification. +Currently it is slower than a naive simplification(the naive version as +implemented in ADU of course can explode in some cases). So it needs to +be explored how to make it faster. Our possibility would be to explore +again the connection to DFAs. This is very much work in progress. +\end{itemize} \section{Conclusion}