diff -r 50e590823220 -r 8016a2480704 ChengsongTanPhdThesis/Chapters/Bitcoded1.tex --- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Mon Jun 06 23:17:45 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Thu Jun 09 12:57:53 2022 +0100 @@ -1,17 +1,17 @@ % Chapter Template % Main chapter title -\chapter{Correctness of Bit-coded Algorithm without Simplification} +\chapter{Bit-coded Algorithm of Sulzmann and Lu} \label{Bitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX} %Then we illustrate how the algorithm without bitcodes falls short for such aggressive %simplifications and therefore introduce our version of the bitcoded algorithm and %its correctness proof in %Chapter 3\ref{Chapter3}. - +In this chapter, we are going to introduce the bit-coded algorithm +introduced by Sulzmann and Lu to address the problem of +under-simplified regular expressions. \section{Bit-coded Algorithm} - - The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure}, stores information of previous lexing steps on a stack, in the form of regular expressions @@ -135,8 +135,9 @@ \end{envForCaption} \noindent -The $S$ and $Z$ are not in bold in order to avoid -confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or +Using $S$ and $Z$ rather than $1$ and $0$ is to avoid +confusion with the regular expressions $\ZERO$ and $\ONE$. +Bitcodes (or bit-lists) can be used to encode values (or potentially incomplete values) in a compact form. This can be straightforwardly seen in the following coding function from values to bitcodes: @@ -203,15 +204,31 @@ & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; \textit{else}\;\textit{None}$ \end{tabular} -\end{center} -\caption{Bit-decoding of values} +\end{center} \end{envForCaption} %\end{definition} -Sulzmann and Lu's integrated the bitcodes into regular expressions to -create annotated regular expressions \cite{Sulzmann2014}. -\emph{Annotated regular expressions} are defined by the following -grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$} +\noindent +$\decode'$ does most of the job while $\decode$ throws +away leftover bit-codes and returns the value only. +$\decode$ is terminating as $\decode'$ is terminating. +We have the property that $\decode$ and $\code$ are +reverse operations of one another: +\begin{lemma} +\[\vdash v : r \implies \decode \; (\code \; v) \; r = \textit{Some}(v) \] +\end{lemma} +\begin{proof} +By proving a more general version of the lemma, on $\decode'$: +\[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \] +Then setting $ds$ to be $[]$ and unfolding $\decode$ definition +we get the lemma. +\end{proof} +With the $\code$ and $\decode$ functions in hand, we know how to +switch between bit-codes and value--the two different representations of +lexing information. +The next step is to integrate this information into the working regular expression. +Attaching bits to the front of regular expressions is the solution Sulzamann and Lu +gave for storing partial values on the fly: \begin{center} \begin{tabular}{lcl} @@ -226,13 +243,130 @@ %(in \textit{ALTS}) \noindent -where $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regular +We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}. +$bs$ stands for bit-codes, $a$ for $\mathbf{a}$nnotated regular expressions and $as$ for a list of annotated regular expressions. -The alternative constructor($\sum$) has been generalized to +The alternative constructor ($\sum$) has been generalised to accept a list of annotated regular expressions rather than just 2. -We will show that these bitcodes encode information about -the (POSIX) value that should be generated by the Sulzmann and Lu -algorithm. +%We will show that these bitcodes encode information about +%the ($\POSIX$) value that should be generated by the Sulzmann and Lu +%algorithm. +The most central question is how these partial lexing information +represented as bit-codes is augmented and carried around +during a derivative is taken. + +This is done by adding bitcodes to the +derivatives, for example when one more star iteratoin is taken (we +call the operation of derivatives on annotated regular expressions $\bder$ +because it is derivatives on regexes with bitcodes): +\begin{center} + \begin{tabular}{@{}lcl@{}} + $\bder \; c\; (_{bs}a^*) $ & $\dn$ & + $_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot + (_{[]}a^*))$ +\end{tabular} +\end{center} + +\noindent +For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when +there is no danger of confusion with derivatives on plain regular expressions, +for example, the above can be expressed as +\begin{center} + \begin{tabular}{@{}lcl@{}} + $(_{bs}a^*)\,\backslash c$ & $\dn$ & + $_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot + (_{[]}a^*))$ +\end{tabular} +\end{center} + + +Using the picture we used earlier to depict this, the transformation when +taking a derivative w.r.t a star is like below: +\centering +\begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}} +\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] + \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] + {$bs$ + \nodepart{two} $a^*$ }; +%\caption{term 1 \ref{term:1}'s matching configuration} +\end{tikzpicture} +& +\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] + \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] + {$v_{\text{previous iterations}}$ + \nodepart{two} $a^*$}; +%\caption{term 1 \ref{term:1}'s matching configuration} +\end{tikzpicture} +\\ +\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] + \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] + { $bs$ + [Z] + \nodepart{two} $(a\backslash c )\cdot a^*$ }; +%\caption{term 1 \ref{term:1}'s matching configuration} +\end{tikzpicture} +& +\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] + \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] + {$v_{\text{previous iterations}}$ + 1 more iteration + \nodepart{two} $(a\backslash c )\cdot a^*$ }; +%\caption{term 1 \ref{term:1}'s matching configuration} +\end{tikzpicture} +\end{tabular} +\noindent +The operation $\fuse$ is just to attach bit-codes +to the front of an annotated regular expression: +\begin{center} +\begin{tabular}{lcl} + $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ + $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & + $_{bs @ bs'}\ONE$\\ + $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ & + $_{bs@bs'}{\bf c}$\\ + $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ & + $_{bs@bs'}\sum\textit{as}$\\ + $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & + $_{bs@bs'}a_1 \cdot a_2$\\ + $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & + $_{bs @ bs'}a^*$ +\end{tabular} +\end{center} + +\noindent +Another place in the $\bder$ function where it differs +from normal derivatives on un-annotated regular expressions +is the sequence case: +\begin{center} + \begin{tabular}{@{}lcl@{}} + + $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & + $\textit{if}\;\textit{bnullable}\,a_1$\\ + & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ + & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ + & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$ +\end{tabular} +\end{center} +Here + + +\begin{center} + \begin{tabular}{@{}lcl@{}} + $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ + $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ + $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & + $\textit{if}\;c=d\; \;\textit{then}\; + _{bs}\ONE\;\textit{else}\;\ZERO$\\ + $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ & + $_{bs}\sum\;(\textit{map} (\_\backslash c) as )$\\ + $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & + $\textit{if}\;\textit{bnullable}\,a_1$\\ + & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ + & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ + & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ + $(_{bs}a^*)\,\backslash c$ & $\dn$ & + $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot + (_{[]}r^*))$ +\end{tabular} +\end{center} To do lexing using annotated regular expressions, we shall first @@ -264,21 +398,7 @@ attach bits to the front of an annotated regular expression. Its definition is as follows: -\begin{center} -\begin{tabular}{lcl} - $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ - $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & - $_{bs @ bs'}\ONE$\\ - $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ & - $_{bs@bs'}{\bf c}$\\ - $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ & - $_{bs@bs'}\sum\textit{as}$\\ - $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & - $_{bs@bs'}a_1 \cdot a_2$\\ - $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & - $_{bs @ bs'}a^*$ -\end{tabular} -\end{center} + \noindent After internalising the regular expression, we perform successive @@ -288,70 +408,8 @@ the bitcodes: -\iffalse - %\begin{definition}{bder} -\begin{center} - \begin{tabular}{@{}lcl@{}} - $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ - $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ - $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & - $\textit{if}\;c=d\; \;\textit{then}\; - \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ - $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & - $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\ - $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & - $\textit{if}\;\textit{bnullable}\,a_1$\\ - & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ - & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ - & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ - $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & - $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, - (\textit{STAR}\,[]\,r)$ -\end{tabular} -\end{center} -%\end{definition} -\begin{center} - \begin{tabular}{@{}lcl@{}} - $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ - $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ - $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ & - $\textit{if}\;c=d\; \;\textit{then}\; - _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\ - $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ & - $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\ - $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ & - $\textit{if}\;\textit{bnullable}\,a_1$\\ - & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\ - & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ - & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\ - $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ & - $_{bs}\textit{SEQ}\;(\textit{fuse}\, [Z] \; r\,\backslash c )\, - (_{bs}\textit{STAR}\,[]\,r)$ -\end{tabular} -\end{center} -%\end{definition} -\fi -\begin{center} - \begin{tabular}{@{}lcl@{}} - $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ - $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ - $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & - $\textit{if}\;c=d\; \;\textit{then}\; - _{bs}\ONE\;\textit{else}\;\ZERO$\\ - $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ & - $_{bs}\sum\;(\textit{map} (\_\backslash c) as )$\\ - $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & - $\textit{if}\;\textit{bnullable}\,a_1$\\ - & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ - & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ - & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ - $(_{bs}a^*)\,\backslash c$ & $\dn$ & - $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot - (_{[]}r^*))$ -\end{tabular} -\end{center} %\end{definition} \noindent @@ -417,119 +475,6 @@ operation from characters to strings (just like the derivatives for un-annotated regular expressions). -Now we introduce the simplifications, which is why we introduce the -bitcodes in the first place. - -\subsection*{Simplification Rules} - -This section introduces aggressive (in terms of size) simplification rules -on annotated regular expressions -to keep derivatives small. Such simplifications are promising -as we have -generated test data that show -that a good tight bound can be achieved. We could only -partially cover the search space as there are infinitely many regular -expressions and strings. - -One modification we introduced is to allow a list of annotated regular -expressions in the $\sum$ constructor. This allows us to not just -delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but -also unnecessary ``copies'' of regular expressions (very similar to -simplifying $r + r$ to just $r$, but in a more general setting). Another -modification is that we use simplification rules inspired by Antimirov's -work on partial derivatives. They maintain the idea that only the first -``copy'' of a regular expression in an alternative contributes to the -calculation of a POSIX value. All subsequent copies can be pruned away from -the regular expression. A recursive definition of our simplification function -that looks somewhat similar to our Scala code is given below: -%\comment{Use $\ZERO$, $\ONE$ and so on. -%Is it $ALTS$ or $ALTS$?}\\ - -\begin{center} - \begin{tabular}{@{}lcl@{}} - - $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ - &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ - &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ - &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ - &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ - &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ - - $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\ - &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ - &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ - &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ - - $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ -\end{tabular} -\end{center} - -\noindent -The simplification does a pattern matching on the regular expression. -When it detected that the regular expression is an alternative or -sequence, it will try to simplify its child regular expressions -recursively and then see if one of the children turns into $\ZERO$ or -$\ONE$, which might trigger further simplification at the current level. -The most involved part is the $\sum$ clause, where we use two -auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested -alternatives and reduce as many duplicates as possible. Function -$\textit{distinct}$ keeps the first occurring copy only and removes all later ones -when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. -Its recursive definition is given below: - - \begin{center} - \begin{tabular}{@{}lcl@{}} - $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; - (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ - $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ - $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) -\end{tabular} -\end{center} - -\noindent -Here $\textit{flatten}$ behaves like the traditional functional programming flatten -function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it -removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. - -Having defined the $\simp$ function, -we can use the previous notation of natural -extension from derivative w.r.t.~character to derivative -w.r.t.~string:%\comment{simp in the [] case?} - -\begin{center} -\begin{tabular}{lcl} -$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ -$r \backslash_{simp} [\,] $ & $\dn$ & $r$ -\end{tabular} -\end{center} - -\noindent -to obtain an optimised version of the algorithm: - - \begin{center} -\begin{tabular}{lcl} - $\textit{blexer\_simp}\;r\,s$ & $\dn$ & - $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ - & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ - & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ - & & $\;\;\textit{else}\;\textit{None}$ -\end{tabular} -\end{center} - -\noindent -This algorithm keeps the regular expression size small, for example, -with this simplification our previous $(a + aa)^*$ example's 8000 nodes -will be reduced to just 17 and stays constant, no matter how long the -input string is. - - - - - - - - - %-----------------------------------