diff -r ff7945a988a3 -r 856d025dbc15 ChengsongTanPhdThesis/Chapters/Chapter2.tex --- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex Thu May 26 20:51:40 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Chapter2.tex Fri May 27 18:27:39 2022 +0100 @@ -1,8 +1,896 @@ % Chapter Template -\chapter{Chapter Title Here} % Main chapter title +\chapter{Regular Expressions and Sulzmanna and Lu's Lexing Algorithm Without Bitcodes} % Main chapter title + +\label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts +%and notations we +%use for describing the lexing algorithm by Sulzmann and Lu, +%and then give the algorithm and its variant, and discuss +%why more aggressive simplifications are needed. + + +\section{Preliminaries} + +Suppose we have an alphabet $\Sigma$, the strings whose characters +are from $\Sigma$ +can be expressed as $\Sigma^*$. + +We use patterns to define a set of strings concisely. Regular expressions +are one of such patterns systems: +The basic regular expressions are defined inductively + by the following grammar: +\[ r ::= \ZERO \mid \ONE + \mid c + \mid r_1 \cdot r_2 + \mid r_1 + r_2 + \mid r^* +\] + +The language or set of strings defined by regular expressions are defined as +%TODO: FILL in the other defs +\begin{center} +\begin{tabular}{lcl} +$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\ +$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\ +\end{tabular} +\end{center} +Which are also called the "language interpretation". + + + +The Brzozowski derivative w.r.t character $c$ is an operation on the regex, +where the operation transforms the regex to a new one containing +strings without the head character $c$. + +Formally, we define first such a transformation on any string set, which +we call semantic derivative: +\begin{center} +$\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$ +\end{center} +Mathematically, it can be expressed as the + +If the $\textit{StringSet}$ happen to have some structure, for example, +if it is regular, then we have that it + +% Derivatives of a +%regular expression, written $r \backslash c$, give a simple solution +%to the problem of matching a string $s$ with a regular +%expression $r$: if the derivative of $r$ w.r.t.\ (in +%succession) all the characters of the string matches the empty string, +%then $r$ matches $s$ (and {\em vice versa}). + +The the derivative of regular expression, denoted as +$r \backslash c$, is a function that takes parameters +$r$ and $c$, and returns another regular expression $r'$, +which is computed by the following recursive function: + +\begin{center} +\begin{tabular}{lcl} + $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ + $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ + $d \backslash c$ & $\dn$ & + $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ +$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ +$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ + & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ + & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ + $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ +\end{tabular} +\end{center} +\noindent +\noindent + +The $\nullable$ function tests whether the empty string $""$ +is in the language of $r$: + + +\begin{center} + \begin{tabular}{lcl} + $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ + $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ + $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ + $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ + $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ + $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ + \end{tabular} +\end{center} +\noindent +The empty set does not contain any string and +therefore not the empty string, the empty string +regular expression contains the empty string +by definition, the character regular expression +is the singleton that contains character only, +and therefore does not contain the empty string, +the alternative regular expression(or "or" expression) +might have one of its children regular expressions +being nullable and any one of its children being nullable +would suffice. The sequence regular expression +would require both children to have the empty string +to compose an empty string and the Kleene star +operation naturally introduced the empty string. + +We can give the meaning of regular expressions derivatives +by language interpretation: + + + + +\begin{center} +\begin{tabular}{lcl} + $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ + $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ + $d \backslash c$ & $\dn$ & + $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ +$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ +$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ + & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ + & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ + $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ +\end{tabular} +\end{center} +\noindent +\noindent +The function derivative, written $\backslash c$, +defines how a regular expression evolves into +a new regular expression after all the string it contains +is chopped off a certain head character $c$. +The most involved cases are the sequence +and star case. +The sequence case says that if the first regular expression +contains an empty string then the second component of the sequence +might be chosen as the target regular expression to be chopped +off its head character. +The star regular expression's derivative unwraps the iteration of +regular expression and attaches the star regular expression +to the sequence's second element to make sure a copy is retained +for possible more iterations in later phases of lexing. + + +The main property of the derivative operation +that enables us to reason about the correctness of +an algorithm using derivatives is + +\begin{center} +$c\!::\!s \in L(r)$ holds +if and only if $s \in L(r\backslash c)$. +\end{center} + +\noindent +We can generalise the derivative operation shown above for single characters +to strings as follows: + +\begin{center} +\begin{tabular}{lcl} +$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ +$r \backslash [\,] $ & $\dn$ & $r$ +\end{tabular} +\end{center} + +\noindent +and then define Brzozowski's regular-expression matching algorithm as: + +\[ +match\;s\;r \;\dn\; nullable(r\backslash s) +\] + +\noindent +Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, +this algorithm presented graphically is as follows: + +\begin{equation}\label{graph:*} +\begin{tikzcd} +r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} +\end{tikzcd} +\end{equation} + +\noindent +where we start with a regular expression $r_0$, build successive +derivatives until we exhaust the string and then use \textit{nullable} +to test whether the result can match the empty string. It can be +relatively easily shown that this matcher is correct (that is given +an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). + +Beautiful and simple definition. + +If we implement the above algorithm naively, however, +the algorithm can be excruciatingly slow. + + +\begin{figure} +\centering +\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,-0.05)}}, + ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=10000, + ytick={0,1000,...,10000}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={JavaScript}, + legend pos=north west, + legend cell align=left] +\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data}; +\end{axis} +\end{tikzpicture}\\ +\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings + of the form $\underbrace{aa..a}_{n}$.} +\end{tabular} +\caption{EightThousandNodes} \label{fig:EightThousandNodes} +\end{figure} + + +(8000 node data to be added here) +For example, when starting with the regular +expression $(a + aa)^*$ and building a few successive derivatives (around 10) +w.r.t.~the character $a$, one obtains a derivative regular expression +with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}. +The reason why $(a + aa) ^*$ explodes so drastically is that without +pruning, the algorithm will keep records of all possible ways of matching: +\begin{center} +$(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$ +\end{center} + +\noindent +Each of the above alternative branches correspond to the match +$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete). +These different ways of matching will grow exponentially with the string length, +and without simplifications that throw away some of these very similar matchings, +it is no surprise that these expressions grow so quickly. +Operations like +$\backslash$ and $\nullable$ need to traverse such trees and +consequently the bigger the size of the derivative the slower the +algorithm. + +Brzozowski was quick in finding that during this process a lot useless +$\ONE$s and $\ZERO$s are generated and therefore not optimal. +He also introduced some "similarity rules" such +as $P+(Q+R) = (P+Q)+R$ to merge syntactically +different but language-equivalent sub-regexes to further decrease the size +of the intermediate regexes. + +More simplifications are possible, such as deleting duplicates +and opening up nested alternatives to trigger even more simplifications. +And suppose we apply simplification after each derivative step, and compose +these two operations together as an atomic one: $a \backslash_{simp}\,c \dn +\textit{simp}(a \backslash c)$. Then we can build +a matcher without having cumbersome regular expressions. + + +If we want the size of derivatives in the algorithm to +stay even lower, we would need more aggressive simplifications. +Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as +deleting duplicates whenever possible. For example, the parentheses in +$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b +\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another +example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just +$a^*+a+\ONE$. Adding these more aggressive simplification rules help us +to achieve a very tight size bound, namely, + the same size bound as that of the \emph{partial derivatives}. + +Building derivatives and then simplify them. +So far so good. But what if we want to +do lexing instead of just a YES/NO answer? +This requires us to go back again to the world +without simplification first for a moment. +Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and +elegant(arguably as beautiful as the original +derivatives definition) solution for this. + +\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu} + + +They first defined the datatypes for storing the +lexing information called a \emph{value} or +sometimes also \emph{lexical value}. These values and regular +expressions correspond to each other as illustrated in the following +table: + +\begin{center} + \begin{tabular}{c@{\hspace{20mm}}c} + \begin{tabular}{@{}rrl@{}} + \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ + $r$ & $::=$ & $\ZERO$\\ + & $\mid$ & $\ONE$ \\ + & $\mid$ & $c$ \\ + & $\mid$ & $r_1 \cdot r_2$\\ + & $\mid$ & $r_1 + r_2$ \\ + \\ + & $\mid$ & $r^*$ \\ + \end{tabular} + & + \begin{tabular}{@{\hspace{0mm}}rrl@{}} + \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ + $v$ & $::=$ & \\ + & & $\Empty$ \\ + & $\mid$ & $\Char(c)$ \\ + & $\mid$ & $\Seq\,v_1\, v_2$\\ + & $\mid$ & $\Left(v)$ \\ + & $\mid$ & $\Right(v)$ \\ + & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ + \end{tabular} + \end{tabular} +\end{center} + +\noindent + +Building on top of Sulzmann and Lu's attempt to formalize the +notion of POSIX lexing rules \parencite{Sulzmann2014}, +Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled +POSIX matching as a ternary relation recursively defined in a +natural deduction style. +With the formally-specified rules for what a POSIX matching is, +they proved in Isabelle/HOL that the algorithm gives correct results. + +But having a correct result is still not enough, +we want at least some degree of $\mathbf{efficiency}$. + + + +One regular expression can have multiple lexical values. For example +for the regular expression $(a+b)^*$, it has a infinite list of +values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$, +$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$, +$\ldots$, and vice versa. +Even for the regular expression matching a certain string, there could +still be more than one value corresponding to it. +Take the example where $r= (a^*\cdot a^*)^*$ and the string +$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. +The number of different ways of matching +without allowing any value under a star to be flattened +to an empty string can be given by the following formula: +\begin{equation} + C_n = (n+1)+n C_1+\ldots + 2 C_{n-1} +\end{equation} +and a closed form formula can be calculated to be +\begin{equation} + C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}} +\end{equation} +which is clearly in exponential order. + +A lexer aimed at getting all the possible values has an exponential +worst case runtime. Therefore it is impractical to try to generate +all possible matches in a run. In practice, we are usually +interested about POSIX values, which by intuition always +\begin{itemize} +\item +match the leftmost regular expression when multiple options of matching +are available +\item +always match a subpart as much as possible before proceeding +to the next token. +\end{itemize} + + + For example, the above example has the POSIX value +$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. +The output of an algorithm we want would be a POSIX matching +encoded as a value. +The reason why we are interested in $\POSIX$ values is that they can +be practically used in the lexing phase of a compiler front end. +For instance, when lexing a code snippet +$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized +as an identifier rather than a keyword. + +The contribution of Sulzmann and Lu is an extension of Brzozowski's +algorithm by a second phase (the first phase being building successive +derivatives---see \eqref{graph:*}). In this second phase, a POSIX value +is generated in case the regular expression matches the string. +Pictorially, the Sulzmann and Lu algorithm is as follows: + +\begin{ceqn} +\begin{equation}\label{graph:2} +\begin{tikzcd} +r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ +v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] +\end{tikzcd} +\end{equation} +\end{ceqn} + + +\noindent +For convenience, we shall employ the following notations: the regular +expression we start with is $r_0$, and the given string $s$ is composed +of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the +left to right, we build the derivatives $r_1$, $r_2$, \ldots according +to the characters $c_0$, $c_1$ until we exhaust the string and obtain +the derivative $r_n$. We test whether this derivative is +$\textit{nullable}$ or not. If not, we know the string does not match +$r$ and no value needs to be generated. If yes, we start building the +values incrementally by \emph{injecting} back the characters into the +earlier values $v_n, \ldots, v_0$. This is the second phase of the +algorithm from the right to left. For the first value $v_n$, we call the +function $\textit{mkeps}$, which builds a POSIX lexical value +for how the empty string has been matched by the (nullable) regular +expression $r_n$. This function is defined as + + \begin{center} + \begin{tabular}{lcl} + $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ + $\mkeps(r_{1}+r_{2})$ & $\dn$ + & \textit{if} $\nullable(r_{1})$\\ + & & \textit{then} $\Left(\mkeps(r_{1}))$\\ + & & \textit{else} $\Right(\mkeps(r_{2}))$\\ + $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ + $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ + \end{tabular} + \end{center} + + +\noindent +After the $\mkeps$-call, we inject back the characters one by one in order to build +the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ +($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. +After injecting back $n$ characters, we get the lexical value for how $r_0$ +matches $s$. The POSIX value is maintained throught out the process. +For this Sulzmann and Lu defined a function that reverses +the ``chopping off'' of characters during the derivative phase. The +corresponding function is called \emph{injection}, written +$\textit{inj}$; it takes three arguments: the first one is a regular +expression ${r_{i-1}}$, before the character is chopped off, the second +is a character ${c_{i-1}}$, the character we want to inject and the +third argument is the value ${v_i}$, into which one wants to inject the +character (it corresponds to the regular expression after the character +has been chopped off). The result of this function is a new value. The +definition of $\textit{inj}$ is as follows: -\label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX} +\begin{center} +\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} + $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ + $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ + $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ + $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ + $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ + $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ + $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ +\end{tabular} +\end{center} + +\noindent This definition is by recursion on the ``shape'' of regular +expressions and values. +The clauses basically do one thing--identifying the ``holes'' on +value to inject the character back into. +For instance, in the last clause for injecting back to a value +that would turn into a new star value that corresponds to a star, +we know it must be a sequence value. And we know that the first +value of that sequence corresponds to the child regex of the star +with the first character being chopped off--an iteration of the star +that had just been unfolded. This value is followed by the already +matched star iterations we collected before. So we inject the character +back to the first value and form a new value with this new iteration +being added to the previous list of iterations, all under the $Stars$ +top level. + +We have mentioned before that derivatives without simplification +can get clumsy, and this is true for values as well--they reflect +the regular expressions size by definition. + +One can introduce simplification on the regex and values, but have to +be careful in not breaking the correctness as the injection +function heavily relies on the structure of the regexes and values +being correct and match each other. +It can be achieved by recording some extra rectification functions +during the derivatives step, and applying these rectifications in +each run during the injection phase. +And we can prove that the POSIX value of how +regular expressions match strings will not be affected---although is much harder +to establish. +Some initial results in this regard have been +obtained in \cite{AusafDyckhoffUrban2016}. + + + +%Brzozowski, after giving the derivatives and simplification, +%did not explore lexing with simplification or he may well be +%stuck on an efficient simplificaiton with a proof. +%He went on to explore the use of derivatives together with +%automaton, and did not try lexing using derivatives. + +We want to get rid of complex and fragile rectification of values. +Can we not create those intermediate values $v_1,\ldots v_n$, +and get the lexing information that should be already there while +doing derivatives in one pass, without a second phase of injection? +In the meantime, can we make sure that simplifications +are easily handled without breaking the correctness of the algorithm? + +Sulzmann and Lu solved this problem by +introducing additional informtaion to the +regular expressions called \emph{bitcodes}. + +\subsection*{Bit-coded Algorithm} +Bits and bitcodes (lists of bits) are defined as: + +\begin{center} + $b ::= 1 \mid 0 \qquad +bs ::= [] \mid b::bs +$ +\end{center} + +\noindent +The $1$ and $0$ are not in bold in order to avoid +confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or +bit-lists) can be used to encode values (or potentially incomplete values) in a +compact form. This can be straightforwardly seen in the following +coding function from values to bitcodes: + +\begin{center} +\begin{tabular}{lcl} + $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ + $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ + $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\ + $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\ + $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ + $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\ + $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\; + code(\Stars\,vs)$ +\end{tabular} +\end{center} + +\noindent +Here $\textit{code}$ encodes a value into a bitcodes by converting +$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty +star iteration by $1$. The border where a local star terminates +is marked by $0$. This coding is lossy, as it throws away the information about +characters, and also does not encode the ``boundary'' between two +sequence values. Moreover, with only the bitcode we cannot even tell +whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The +reason for choosing this compact way of storing information is that the +relatively small size of bits can be easily manipulated and ``moved +around'' in a regular expression. In order to recover values, we will +need the corresponding regular expression as an extra information. This +means the decoding function is defined as: + + +%\begin{definition}[Bitdecoding of Values]\mbox{} +\begin{center} +\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} + $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ + $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ + $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & + $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; + (\Left\,v, bs_1)$\\ + $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & + $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; + (\Right\,v, bs_1)$\\ + $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & + $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ + & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ + & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ + $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ + $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & + $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ + & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ + & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ + + $\textit{decode}\,bs\,r$ & $\dn$ & + $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ + & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; + \textit{else}\;\textit{None}$ +\end{tabular} +\end{center} +%\end{definition} + +Sulzmann and Lu's integrated the bitcodes into regular expressions to +create annotated regular expressions \cite{Sulzmann2014}. +\emph{Annotated regular expressions} are defined by the following +grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$} + +\begin{center} +\begin{tabular}{lcl} + $\textit{a}$ & $::=$ & $\ZERO$\\ + & $\mid$ & $_{bs}\ONE$\\ + & $\mid$ & $_{bs}{\bf c}$\\ + & $\mid$ & $_{bs}\sum\,as$\\ + & $\mid$ & $_{bs}a_1\cdot a_2$\\ + & $\mid$ & $_{bs}a^*$ +\end{tabular} +\end{center} +%(in \textit{ALTS}) + +\noindent +where $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regular +expressions and $as$ for a list of annotated regular expressions. +The alternative constructor($\sum$) has been generalized to +accept a list of annotated regular expressions rather than just 2. +We will show that these bitcodes encode information about +the (POSIX) value that should be generated by the Sulzmann and Lu +algorithm. + + +To do lexing using annotated regular expressions, we shall first +transform the usual (un-annotated) regular expressions into annotated +regular expressions. This operation is called \emph{internalisation} and +defined as follows: + +%\begin{definition} +\begin{center} +\begin{tabular}{lcl} + $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\ + $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\ + $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\ + $(r_1 + r_2)^\uparrow$ & $\dn$ & + $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\, + \textit{fuse}\,[1]\,r_2^\uparrow]$\\ + $(r_1\cdot r_2)^\uparrow$ & $\dn$ & + $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\ + $(r^*)^\uparrow$ & $\dn$ & + $_{[]}(r^\uparrow)^*$\\ +\end{tabular} +\end{center} +%\end{definition} + +\noindent +We use up arrows here to indicate that the basic un-annotated regular +expressions are ``lifted up'' into something slightly more complex. In the +fourth clause, $\textit{fuse}$ is an auxiliary function that helps to +attach bits to the front of an annotated regular expression. Its +definition is as follows: + +\begin{center} +\begin{tabular}{lcl} + $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ + $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & + $_{bs @ bs'}\ONE$\\ + $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ & + $_{bs@bs'}{\bf c}$\\ + $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ & + $_{bs@bs'}\sum\textit{as}$\\ + $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & + $_{bs@bs'}a_1 \cdot a_2$\\ + $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & + $_{bs @ bs'}a^*$ +\end{tabular} +\end{center} + +\noindent +After internalising the regular expression, we perform successive +derivative operations on the annotated regular expressions. This +derivative operation is the same as what we had previously for the +basic regular expressions, except that we beed to take care of +the bitcodes: + + +\iffalse + %\begin{definition}{bder} +\begin{center} + \begin{tabular}{@{}lcl@{}} + $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ + $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ + $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & + $\textit{if}\;c=d\; \;\textit{then}\; + \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ + $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & + $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\ + $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & + $\textit{if}\;\textit{bnullable}\,a_1$\\ + & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ + & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ + & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ + $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & + $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, + (\textit{STAR}\,[]\,r)$ +\end{tabular} +\end{center} +%\end{definition} + +\begin{center} + \begin{tabular}{@{}lcl@{}} + $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ + $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ + $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ & + $\textit{if}\;c=d\; \;\textit{then}\; + _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\ + $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ & + $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\ + $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ & + $\textit{if}\;\textit{bnullable}\,a_1$\\ + & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\ + & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ + & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\ + $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ & + $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\, + (_{bs}\textit{STAR}\,[]\,r)$ +\end{tabular} +\end{center} +%\end{definition} +\fi + +\begin{center} + \begin{tabular}{@{}lcl@{}} + $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ + $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ + $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & + $\textit{if}\;c=d\; \;\textit{then}\; + _{bs}\ONE\;\textit{else}\;\ZERO$\\ + $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ & + $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\ + $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & + $\textit{if}\;\textit{bnullable}\,a_1$\\ + & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ + & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ + & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ + $(_{bs}a^*)\,\backslash c$ & $\dn$ & + $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot + (_{[]}r^*))$ +\end{tabular} +\end{center} + +%\end{definition} +\noindent +For instance, when we do derivative of $_{bs}a^*$ with respect to c, +we need to unfold it into a sequence, +and attach an additional bit $0$ to the front of $r \backslash c$ +to indicate one more star iteration. Also the sequence clause +is more subtle---when $a_1$ is $\textit{bnullable}$ (here +\textit{bnullable} is exactly the same as $\textit{nullable}$, except +that it is for annotated regular expressions, therefore we omit the +definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how +$a_1$ matches the string prior to character $c$ (more on this later), +then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2 +\backslash c)$ will collapse the regular expression $a_1$(as it has +already been fully matched) and store the parsing information at the +head of the regular expression $a_2 \backslash c$ by fusing to it. The +bitsequence $\textit{bs}$, which was initially attached to the +first element of the sequence $a_1 \cdot a_2$, has +now been elevated to the top-level of $\sum$, as this information will be +needed whichever way the sequence is matched---no matter whether $c$ belongs +to $a_1$ or $ a_2$. After building these derivatives and maintaining all +the lexing information, we complete the lexing by collecting the +bitcodes using a generalised version of the $\textit{mkeps}$ function +for annotated regular expressions, called $\textit{bmkeps}$: + + +%\begin{definition}[\textit{bmkeps}]\mbox{} +\begin{center} +\begin{tabular}{lcl} + $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\ + $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ & + $\textit{if}\;\textit{bnullable}\,a$\\ + & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\ + & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\ + $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ & + $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ + $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ & + $bs \,@\, [0]$ +\end{tabular} +\end{center} +%\end{definition} + +\noindent +This function completes the value information by travelling along the +path of the regular expression that corresponds to a POSIX value and +collecting all the bitcodes, and using $S$ to indicate the end of star +iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and +decode them, we get the value we expect. The corresponding lexing +algorithm looks as follows: + +\begin{center} +\begin{tabular}{lcl} + $\textit{blexer}\;r\,s$ & $\dn$ & + $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ + & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ + & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ + & & $\;\;\textit{else}\;\textit{None}$ +\end{tabular} +\end{center} + +\noindent +In this definition $\_\backslash s$ is the generalisation of the derivative +operation from characters to strings (just like the derivatives for un-annotated +regular expressions). + +Now we introduce the simplifications, which is why we introduce the +bitcodes in the first place. + +\subsection*{Simplification Rules} + +This section introduces aggressive (in terms of size) simplification rules +on annotated regular expressions +to keep derivatives small. Such simplifications are promising +as we have +generated test data that show +that a good tight bound can be achieved. We could only +partially cover the search space as there are infinitely many regular +expressions and strings. + +One modification we introduced is to allow a list of annotated regular +expressions in the $\sum$ constructor. This allows us to not just +delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but +also unnecessary ``copies'' of regular expressions (very similar to +simplifying $r + r$ to just $r$, but in a more general setting). Another +modification is that we use simplification rules inspired by Antimirov's +work on partial derivatives. They maintain the idea that only the first +``copy'' of a regular expression in an alternative contributes to the +calculation of a POSIX value. All subsequent copies can be pruned away from +the regular expression. A recursive definition of our simplification function +that looks somewhat similar to our Scala code is given below: +%\comment{Use $\ZERO$, $\ONE$ and so on. +%Is it $ALTS$ or $ALTS$?}\\ + +\begin{center} + \begin{tabular}{@{}lcl@{}} + + $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ + &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ + &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ + &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ + &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ + &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ + + $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\ + &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ + &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ + &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ + + $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ +\end{tabular} +\end{center} + +\noindent +The simplification does a pattern matching on the regular expression. +When it detected that the regular expression is an alternative or +sequence, it will try to simplify its child regular expressions +recursively and then see if one of the children turns into $\ZERO$ or +$\ONE$, which might trigger further simplification at the current level. +The most involved part is the $\sum$ clause, where we use two +auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested +alternatives and reduce as many duplicates as possible. Function +$\textit{distinct}$ keeps the first occurring copy only and removes all later ones +when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. +Its recursive definition is given below: + + \begin{center} + \begin{tabular}{@{}lcl@{}} + $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; + (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ + $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ + $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) +\end{tabular} +\end{center} + +\noindent +Here $\textit{flatten}$ behaves like the traditional functional programming flatten +function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it +removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. + +Having defined the $\simp$ function, +we can use the previous notation of natural +extension from derivative w.r.t.~character to derivative +w.r.t.~string:%\comment{simp in the [] case?} + +\begin{center} +\begin{tabular}{lcl} +$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ +$r \backslash_{simp} [\,] $ & $\dn$ & $r$ +\end{tabular} +\end{center} + +\noindent +to obtain an optimised version of the algorithm: + + \begin{center} +\begin{tabular}{lcl} + $\textit{blexer\_simp}\;r\,s$ & $\dn$ & + $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ + & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ + & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ + & & $\;\;\textit{else}\;\textit{None}$ +\end{tabular} +\end{center} + +\noindent +This algorithm keeps the regular expression size small, for example, +with this simplification our previous $(a + aa)^*$ example's 8000 nodes +will be reduced to just 6 and stays constant, no matter how long the +input string is. + + @@ -17,749 +905,91 @@ -%----------------------------------- -% SECTION ? -%----------------------------------- -\section{preparatory properties for getting a finiteness bound} -Before we get to the proof that says the intermediate result of our lexer will -remain finitely bounded, which is an important efficiency/liveness guarantee, -we shall first develop a few preparatory properties and definitions to -make the process of proving that a breeze. - -We define rewriting relations for $\rrexp$s, which allows us to do the -same trick as we did for the correctness proof, -but this time we will have stronger equalities established. -\subsection{"hrewrite" relation} -List of 1-step rewrite rules for regular expressions simplification without bitcodes: -\begin{center} -\begin{tabular}{lcl} -$r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$ -\end{tabular} -\end{center} +%---------------------------------------------------------------------------------------- +% SECTION correctness proof +%---------------------------------------------------------------------------------------- +\section{Correctness of Bit-coded Algorithm (Without Simplification) +We now give the proof the correctness of the algorithm with bit-codes. -And we define an "grewrite" relation that works on lists: -\begin{center} -\begin{tabular}{lcl} -$ \ZERO :: rs$ & $\rightarrow_g$ & $rs$ -\end{tabular} -\end{center} - - - -With these relations we prove +Ausaf and Urban cleverly defined an auxiliary function called $\flex$, +defined as +\[ +\flex \; r \; f \; [] \; v \; = \; f\; v +\flex \; r \; f \; c :: s \; v = \flex r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v +\] +which accumulates the characters that needs to be injected back, +and does the injection in a stack-like manner (last taken derivative first injected). +$\flex$ is connected to the $\lexer$: \begin{lemma} -$rs \rightarrow rs' \implies \RALTS{rs} \rightarrow \RALTS{rs'}$ +$\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$ \end{lemma} -which enables us to prove "closed forms" equalities such as -\begin{lemma} -$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$ -\end{lemma} - -But the most involved part of the above lemma is proving the following: -\begin{lemma} -$\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ -\end{lemma} -which is needed in proving things like -\begin{lemma} -$r \rightarrow_f r' \implies \rsimp{r} \rightarrow \rsimp{r'}$ +$\flex$ provides us a bridge between $\lexer$ and $\blexer$. +What is even better about $\flex$ is that it allows us to +directly operate on the value $\mkeps (r\backslash v)$, +which is pivotal in the definition of $\lexer $ and $\blexer$, but not visible as an argument. +When the value created by $\mkeps$ becomes available, one can +prove some stepwise properties of lexing nicely: +\begin{lemma}\label{flexStepwise} +$\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $ \end{lemma} -Fortunately this is provable by induction on the list $rs$ - - - -%----------------------------------- -% SECTION 2 -%----------------------------------- - -\section{Finiteness Property} -In this section let us describe our argument for why the size of the simplified -derivatives with the aggressive simplification function is finitely bounded. - Suppose -we have a size function for bitcoded regular expressions, written -$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree - -\begin{center} -\begin{tabular}{ccc} -$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\ -\end{tabular} -\end{center} -(TODO: COMPLETE this defn and for $rs$) - -The size is based on a recursive function on the structure of the regex, -not the bitcodes. -Therefore we may as well talk about size of an annotated regular expression -in their un-annotated form: -\begin{center} -$\asize(a) = \size(\erase(a))$. -\end{center} -(TODO: turn equals to roughly equals) - -But there is a minor nuisance here: -the erase function actually messes with the structure of the regular expression: -\begin{center} -\begin{tabular}{ccc} -$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\ -\end{tabular} -\end{center} -(TODO: complete this definition with singleton r in alts) - -An alternative regular expression with an empty list of children - is turned into an $\ZERO$ during the -$\erase$ function, thereby changing the size and structure of the regex. -These will likely be fixable if we really want to use plain $\rexp$s for dealing -with size, but we choose a more straightforward (or stupid) method by -defining a new datatype that is similar to plain $\rexp$s but can take -non-binary arguments for its alternative constructor, - which we call $\rrexp$ to denote -the difference between it and plain regular expressions. -\[ \rrexp ::= \RZERO \mid \RONE - \mid \RCHAR{c} - \mid \RSEQ{r_1}{r_2} - \mid \RALTS{rs} - \mid \RSTAR{r} +And for $\blexer$ we have something with stepwise properties like $\flex$ as well, +called $\retrieve$: +\[ +\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs} \] -For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, -but keep everything else intact. -It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved -(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton -$\ALTS$). -We denote the operation of erasing the bits and turning an annotated regular expression -into an $\rrexp{}$ as $\rerase{}$. -\begin{center} -\begin{tabular}{lcl} -$\rerase{\AZERO}$ & $=$ & $\RZERO$\\ -$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ -$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$ -\end{tabular} -\end{center} -%TODO: FINISH definition of rerase -Similarly we could define the derivative and simplification on -$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, -except that now they can operate on alternatives taking multiple arguments. +$\retrieve$ takes bit-codes from annotated regular expressions +guided by a value. +$\retrieve$ is connected to the $\blexer$ in the following way: +\begin{lemma}\label{blexer_retrieve} +$\blexer \; r \; s = \decode \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$ +\end{lemma} +If you take derivative of an annotated regular expression, +you can $\retrieve$ the same bit-codes as before the derivative took place, +provided that you use the corresponding value: -\begin{center} -\begin{tabular}{lcr} -$\RALTS{rs} \backslash c$ & $=$ & $\RALTS{map\; (\_ \backslash c) \;rs}$\\ -(other clauses omitted) -\end{tabular} -\end{center} - -Now that $\rrexp$s do not have bitcodes on them, we can do the -duplicate removal without an equivalence relation: -\begin{center} -\begin{tabular}{lcl} -$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ - & & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$ -\end{tabular} -\end{center} -%TODO: definition of rsimp (maybe only the alternative clause) - - -The reason why these definitions mirror precisely the corresponding operations -on annotated regualar expressions is that we can calculate sizes more easily: - -\begin{lemma} -$\rsize{\rerase a} = \asize a$ +\begin{lemma}\label{retrieveStepwise} +$\retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$ \end{lemma} -This lemma says that the size of an r-erased regex is the same as the annotated regex. -this does not hold for plain $\rexp$s due to their ways of how alternatives are represented. -\begin{lemma} -$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$ -\end{lemma} -Putting these two together we get a property that allows us to estimate the -size of an annotated regular expression derivative using its un-annotated counterpart: -\begin{lemma} -$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$ +The other good thing about $\retrieve$ is that it can be connected to $\flex$: +%centralLemma1 +\begin{lemma}\label{flex_retrieve} +$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$ \end{lemma} -Unless stated otherwise in this chapter all $\textit{rexp}$s without - bitcodes are seen as $\rrexp$s. - We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably - as the former suits people's intuitive way of stating a binary alternative - regular expression. - - -\begin{theorem} -For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$ -\end{theorem} - -\noindent \begin{proof} -We prove this by induction on $r$. The base cases for $\AZERO$, -$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is -for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction -hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and -$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows -% -\begin{center} -\begin{tabular}{lcll} -& & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\ -& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} :: - [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\ -& $\leq$ & - $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) :: - [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\ -& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket + - \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\ -& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + - \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\ -& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5) -\end{tabular} -\end{center} - - -\noindent -where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$). -The reason why we could write the derivatives of a sequence this way is described in section 2. -The term (2) is used to control (1). -That is because one can obtain an overall -smaller regex list -by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it. -Section 3 is dedicated to its proof. -In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is -bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller -than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands -for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). -We reason similarly for $\STAR$.\medskip +By induction on $s$. The induction tactic is reverse induction on strings. +$v$ is allowed to be arbitrary. +The crucial point is to rewrite +\[ +\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) +\] +as +\[ +\retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\; \mkeps (r \backslash s@[c])) +\]. +This enables us to equate +\[ +\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) +\] +with +\[ +\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c]))) +\], +which in turn can be rewritten as +\[ +\flex \; r \; \textit{id} \; s@[c] \; (\mkeps (r\backslash s@[c])) +\]. \end{proof} -What guarantee does this bound give us? - -Whatever the regex is, it will not grow indefinitely. -Take our previous example $(a + aa)^*$ as an example: -\begin{center} -\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} -\begin{tikzpicture} -\begin{axis}[ - xlabel={number of $a$'s}, - x label style={at={(1.05,-0.05)}}, - ylabel={regex size}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax= 40, - ytick={0,10,...,40}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={$(a + aa)^*$}, - legend pos=north west, - legend cell align=left] -\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data}; -\end{axis} -\end{tikzpicture} -\end{tabular} -\end{center} -We are able to limit the size of the regex $(a + aa)^*$'s derivatives - with our simplification -rules very effectively. - - -As the graphs of some more randomly generated regexes show, the size of -the derivative might grow quickly at the start of the input, - but after a sufficiently long input string, the trend will stop. - - -%a few sample regular experessions' derivatives -%size change -%TODO: giving graphs showing a few regexes' size changes -%w;r;t the input characters number -%a*, aa*, aaa*, ..... -%randomly generated regexes -\begin{center} -\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} -\begin{tikzpicture} -\begin{axis}[ - xlabel={number of $a$'s}, - x label style={at={(1.05,-0.05)}}, - ylabel={regex size}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=1000, - ytick={0,100,...,1000}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={regex1}, - legend pos=north west, - legend cell align=left] -\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; -\end{axis} -\end{tikzpicture} - & -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - %ylabel={time in secs}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=1000, - ytick={0,100,...,1000}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={regex2}, - legend pos=north west, - legend cell align=left] -\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data}; -\end{axis} -\end{tikzpicture} - & -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - %ylabel={time in secs}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=1000, - ytick={0,100,...,1000}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={regex3}, - legend pos=north west, - legend cell align=left] -\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; -\end{axis} -\end{tikzpicture}\\ -\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.} -\end{tabular} -\end{center} - - - - - -\noindent -Clearly we give in this finiteness argument (Step (5)) a very loose bound that is -far from the actual bound we can expect. -In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound -is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$. -Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$ -inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function -$f(x) = x * 2^x$. -This means the bound we have will surge up at least -tower-exponentially with a linear increase of the depth. -For a regex of depth $n$, the bound -would be approximately $4^n$. -%TODO: change this to tower exponential! - -We can do better than this, but this does not improve -the finiteness property we are proving. If we are interested in a polynomial bound, -one would hope to obtain a similar tight bound as for partial -derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with - $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in -partial derivatives). -To obtain the exact same bound would mean -we need to introduce simplifications, such as - $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, -which exist for partial derivatives. - -However, if we introduce them in our -setting we would lose the POSIX property of our calculated values. -A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$. -If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer -would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of -what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information -in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$ -occurs, and apply them in the right order once we get -a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value. -This is unlike the simplification we had before, where the rewriting rules -such as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value. -We will discuss better -bounds in the last section of this chapter.\\[-6.5mm] - - - - -%---------------------------------------------------------------------------------------- -% SECTION ?? -%---------------------------------------------------------------------------------------- - -\section{"Closed Forms" of regular expressions' derivatives w.r.t strings} -To embark on getting the "closed forms" of regexes, we first -define a few auxiliary definitions to make expressing them smoothly. - - \begin{center} - \begin{tabular}{ccc} - $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ -$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\ -$\sflataux r$ & $=$ & $ [r]$ -\end{tabular} -\end{center} -The intuition behind $\sflataux{\_}$ is to break up nested regexes -of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape -into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$. -It will return the singleton list $[r]$ otherwise. - -$\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps -the output type a regular expression, not a list: - \begin{center} - \begin{tabular}{ccc} - $\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ -$\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\ -$\sflat r$ & $=$ & $ [r]$ -\end{tabular} -\end{center} -$\sflataux{\_}$ and $\sflat{\_}$ is only recursive in terms of the - first element of the list of children of -an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for dealing with the regular expression -$r_1 \cdot r_2 \backslash s$. - -With $\sflat{\_}$ and $\sflataux{\_}$ we make -precise what "closed forms" we have for the sequence derivatives and their simplifications, -in other words, how can we construct $(r_1 \cdot r_2) \backslash s$ -and $\bderssimp{(r_1\cdot r_2)}{s}$, -if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$) -and $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$ ranges over -the substring of $s$? -First let's look at a series of derivatives steps on a sequence -regular expression, (assuming) that each time the first -component of the sequence is always nullable): -\begin{center} - -$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\ -$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad - \ldots$ - -\end{center} -%TODO: cite indian paper -Indianpaper have come up with a slightly more formal way of putting the above process: -\begin{center} -$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) + -\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots -+ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$ -\end{center} -where $\delta(b, r)$ will produce $r$ -if $b$ evaluates to true, -and $\ZERO$ otherwise. - - But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical - equivalence. To make this intuition useful - for a formal proof, we need something -stronger than language equivalence. -With the help of $\sflat{\_}$ we can state the equation in Indianpaper -more rigorously: -\begin{lemma} -$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$ -\end{lemma} - -The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string: - -\begin{center} -\begin{tabular}{lcl} -$\vsuf{[]}{\_} $ & $=$ & $[]$\\ -$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ - && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ -\end{tabular} -\end{center} -It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable, -and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in -the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$. -In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing -the entire result of $(r_1 \cdot r_2) \backslash s$, -it only stores all the $s''$ such that $r_2 \backslash s''$ are occurring terms in $(r_1\cdot r_2)\backslash s$. - -With this we can also add simplifications to both sides and get -\begin{lemma} -$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ -\end{lemma} -Together with the idempotency property of $\rsimp$, -%TODO: state the idempotency property of rsimp -\begin{lemma} -$\rsimp(r) = \rsimp (\rsimp(r))$ -\end{lemma} -it is possible to convert the above lemma to obtain a "closed form" -for derivatives nested with simplification: -\begin{lemma} -$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ -\end{lemma} -And now the reason we have (1) in section 1 is clear: -$\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, -is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$ - as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$. - -%---------------------------------------------------------------------------------------- -% SECTION 3 -%---------------------------------------------------------------------------------------- +With the above lemma we can now link $\flex$ and $\blexer$. -\section{interaction between $\distinctWith$ and $\flts$} -Note that it is not immediately obvious that -\begin{center} -$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $. -\end{center} -The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of -duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. - - -%---------------------------------------------------------------------------------------- -% SECTION ALTS CLOSED FORM -%---------------------------------------------------------------------------------------- -\section{A Closed Form for \textit{ALTS}} -Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. - - -There are a few key steps, one of these steps is -$rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {}))) -= rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))$ - -One might want to prove this by something a simple statement like: -$map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$. - -For this to hold we want the $\textit{distinct}$ function to pick up -the elements before and after derivatives correctly: -$r \in rset \equiv (rder x r) \in (rder x rset)$. -which essentially requires that the function $\backslash$ is an injective mapping. - -Unfortunately the function $\backslash c$ is not an injective mapping. - -\subsection{function $\backslash c$ is not injective (1-to-1)} -\begin{center} -The derivative $w.r.t$ character $c$ is not one-to-one. -Formally, - $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ -\end{center} -This property is trivially true for the -character regex example: -\begin{center} - $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ -\end{center} -But apart from the cases where the derivative -output is $\ZERO$, are there non-trivial results -of derivatives which contain strings? -The answer is yes. -For example, -\begin{center} - Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ - where $a$ is not nullable.\\ - $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ - $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ -\end{center} -We start with two syntactically different regexes, -and end up with the same derivative result. -This is not surprising as we have such -equality as below in the style of Arden's lemma:\\ -\begin{center} - $L(A^*B) = L(A\cdot A^* \cdot B + B)$ -\end{center} - -%---------------------------------------------------------------------------------------- -% SECTION 4 -%---------------------------------------------------------------------------------------- -\section{A Bound for the Star Regular Expression} -We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using -the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then -the property of the $\distinct$ function. -Now we try to get a bound on $r^* \backslash s$ as well. -Again, we first look at how a star's derivatives evolve, if they grow maximally: -\begin{center} - -$r^* \quad \longrightarrow_{\backslash c} \quad (r\backslash c) \cdot r^* \quad \longrightarrow_{\backslash c'} \quad -r \backslash cc' \cdot r^* + r \backslash c' \cdot r^* \quad \longrightarrow_{\backslash c''} \quad -(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) \quad \longrightarrow_{\backslash c'''} -\quad \ldots$ - -\end{center} -When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, -$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable, -the number of terms in $r^* \backslash s$ will grow exponentially, causing the size -of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not -count the possible size explosions of $r \backslash c$ themselves. - -Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like -$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ -into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$ -and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$). -For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$: -%TODO: definitions of and \hflataux \hflat - \begin{center} - \begin{tabular}{ccc} - $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\ -$\hflataux r$ & $=$ & $ [r]$ -\end{tabular} -\end{center} - - \begin{center} - \begin{tabular}{ccc} - $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\ -$\hflat r$ & $=$ & $ r$ -\end{tabular} -\end{center}s -Again these definitions are tailor-made for dealing with alternatives that have -originated from a star's derivatives, so we don't attempt to open up all possible -regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2 -elements. -We give a predicate for such "star-created" regular expressions: -\begin{center} -\begin{tabular}{lcr} - & & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\ - $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$ - \end{tabular} - \end{center} - - These definitions allows us the flexibility to talk about - regular expressions in their most convenient format, - for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $ - instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$. - These definitions help express that certain classes of syntatically - distinct regular expressions are actually the same under simplification. - This is not entirely true for annotated regular expressions: - %TODO: bsimp bders \neq bderssimp - \begin{center} - $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ - \end{center} - For bit-codes, the order in which simplification is applied - might cause a difference in the location they are placed. - If we want something like - \begin{center} - $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ - \end{center} - Some "canonicalization" procedure is required, - which either pushes all the common bitcodes to nodes - as senior as possible: - \begin{center} - $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ - \end{center} - or does the reverse. However bitcodes are not of interest if we are talking about - the $\llbracket r \rrbracket$ size of a regex. - Therefore for the ease and simplicity of producing a - proof for a size bound, we are happy to restrict ourselves to - unannotated regular expressions, and obtain such equalities as - \begin{lemma} - $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$ - \end{lemma} - - \begin{proof} - By using the rewriting relation $\rightsquigarrow$ - \end{proof} - %TODO: rsimp sflat -And from this we obtain a proof that a star's derivative will be the same -as if it had all its nested alternatives created during deriving being flattened out: - For example, - \begin{lemma} - $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ - \end{lemma} - \begin{proof} - By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$. - \end{proof} -% The simplification of a flattened out regular expression, provided it comes -%from the derivative of a star, is the same as the one nested. - - - - - - - - - -One might wonder the actual bound rather than the loose bound we gave -for the convenience of an easier proof. -How much can the regex $r^* \backslash s$ grow? -As earlier graphs have shown, -%TODO: reference that graph where size grows quickly - they can grow at a maximum speed - exponential $w.r.t$ the number of characters, -but will eventually level off when the string $s$ is long enough. -If they grow to a size exponential $w.r.t$ the original regex, our algorithm -would still be slow. -And unfortunately, we have concrete examples -where such regexes grew exponentially large before levelling off: -$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + -(\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum - size that is exponential on the number $n$ -under our current simplification rules: -%TODO: graph of a regex whose size increases exponentially. -\begin{center} -\begin{tikzpicture} - \begin{axis}[ - height=0.5\textwidth, - width=\textwidth, - xlabel=number of a's, - xtick={0,...,9}, - ylabel=maximum size, - ymode=log, - log basis y={2} -] - \addplot[mark=*,blue] table {re-chengsong.data}; - \end{axis} -\end{tikzpicture} -\end{center} - -For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ -to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + -(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion. -The exponential size is triggered by that the regex -$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$ -inside the $(\ldots) ^*$ having exponentially many -different derivatives, despite those difference being minor. -$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ -will therefore contain the following terms (after flattening out all nested -alternatives): -\begin{center} -$(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ -$(1 \leq m' \leq m )$ -\end{center} -These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix). - With each new input character taking the derivative against the intermediate result, more and more such distinct - terms will accumulate, -until the length reaches $L.C.M.(1, \ldots, n)$. -$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms -$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ - -$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ - where $m' \neq m''$ \\ - as they are slightly different. - This means that with our current simplification methods, - we will not be able to control the derivative so that - $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$ - as there are already exponentially many terms. - These terms are similar in the sense that the head of those terms - are all consisted of sub-terms of the form: - $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $. - For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most - $n * (n + 1) / 2$ such terms. - For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives - can be described by 6 terms: - $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, - $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$. -The total number of different "head terms", $n * (n + 1) / 2$, - is proportional to the number of characters in the regex -$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$. -This suggests a slightly different notion of size, which we call the -alphabetic width: -%TODO: -(TODO: Alphabetic width def.) - - -Antimirov\parencite{Antimirov95} has proven that -$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$. -where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms -created by doing derivatives of $r$ against all possible strings. -If we can make sure that at any moment in our lexing algorithm our -intermediate result hold at most one copy of each of the -subterms then we can get the same bound as Antimirov's. -This leads to the algorithm in the next section. - -%---------------------------------------------------------------------------------------- -% SECTION 5 -%---------------------------------------------------------------------------------------- -\section{A Stronger Version of Simplification Inspired by Antimirov} -%TODO: search for isabelle proofs of algorithms that check equivalence - - - - - +\begin{lemma}\label{flex_blexer} +$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s) = \blexer \; r \; s$ +\end{lemma} +\begin{proof} +Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}. +\end{proof} +Finally