# HG changeset patch # User Chengsong # Date 1653939375 -3600 # Node ID 823d9b19d21c1db6c55379503ce262c759eb3905 # Parent 96e93df60954a90ee25338b9f99efce2e15b4980 all comments addressed diff -r 96e93df60954 -r 823d9b19d21c ChengsongTanPhdThesis/Chapters/Chapter1.tex --- a/ChengsongTanPhdThesis/Chapters/Chapter1.tex Mon May 30 17:24:52 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Chapter1.tex Mon May 30 20:36:15 2022 +0100 @@ -69,6 +69,7 @@ \def\distinctWith{\textit{distinctWith}} \def\lf{\textit{lf}} \def\PD{\textit{PD}} +\def\suffix{\textit{Suffix}} \def\size{\mathit{size}} diff -r 96e93df60954 -r 823d9b19d21c ChengsongTanPhdThesis/Chapters/Chapter2.tex --- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex Mon May 30 17:24:52 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Chapter2.tex Mon May 30 20:36:15 2022 +0100 @@ -277,10 +277,10 @@ \end{definition} \noindent -Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, +Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, this algorithm presented graphically is as follows: -\begin{equation}\label{graph:*} +\begin{equation}\label{graph:successive_ders} \begin{tikzcd} r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} \end{tikzcd} @@ -477,26 +477,145 @@ always match a subpart as much as possible before proceeding to the next token. \end{itemize} -The formal definition of a $\POSIX$ value can be described +The formal definition of a $\POSIX$ value $v$ for a regular expression +$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified in the following set of rules: -\ - +(TODO: write the entire set of inference rules for POSIX ) +\newcommand*{\inference}[3][t]{% + \begingroup + \def\and{\\}% + \begin{tabular}[#1]{@{\enspace}c@{\enspace}} + #2 \\ + \hline + #3 + \end{tabular}% + \endgroup +} +\begin{center} +\inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$ }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ } +\end{center} - For example, the above example has the POSIX value -$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. -The output of an algorithm we want would be a POSIX matching -encoded as a value. The reason why we are interested in $\POSIX$ values is that they can be practically used in the lexing phase of a compiler front end. For instance, when lexing a code snippet $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized as an identifier rather than a keyword. + For example, the above $r= (a^*\cdot a^*)^*$ and +$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value +$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. +The output of an algorithm we want would be a POSIX matching +encoded as a value. + + + + The contribution of Sulzmann and Lu is an extension of Brzozowski's algorithm by a second phase (the first phase being building successive -derivatives---see \eqref{graph:*}). In this second phase, a POSIX value -is generated in case the regular expression matches the string. -Pictorially, the Sulzmann and Lu algorithm is as follows: +derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value +is generated in case the regular expression matches the string. +How can we construct a value out of regular expressions and character +sequences only? +Two functions are involved: $\inj$ and $\mkeps$. +The function $\mkeps$ constructs a value from the last +one of all the successive derivatives: +\begin{ceqn} +\begin{equation}\label{graph:mkeps} +\begin{tikzcd} +r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\ + & & & v_n +\end{tikzcd} +\end{equation} +\end{ceqn} + +It tells us how can an empty string be matched by a +regular expression, in a $\POSIX$ way: + + \begin{center} + \begin{tabular}{lcl} + $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ + $\mkeps(r_{1}+r_{2})$ & $\dn$ + & \textit{if} $\nullable(r_{1})$\\ + & & \textit{then} $\Left(\mkeps(r_{1}))$\\ + & & \textit{else} $\Right(\mkeps(r_{2}))$\\ + $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ + $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ + \end{tabular} + \end{center} + + +\noindent +We favour the left to match an empty string in case there is a choice. +When there is a star for us to match the empty string, +we simply give the $\Stars$ constructor an empty list, meaning +no iterations are taken. + + +After the $\mkeps$-call, we inject back the characters one by one in order to build +the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ +($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. +After injecting back $n$ characters, we get the lexical value for how $r_0$ +matches $s$. The POSIX value is maintained throught out the process. +For this Sulzmann and Lu defined a function that reverses +the ``chopping off'' of characters during the derivative phase. The +corresponding function is called \emph{injection}, written +$\textit{inj}$; it takes three arguments: the first one is a regular +expression ${r_{i-1}}$, before the character is chopped off, the second +is a character ${c_{i-1}}$, the character we want to inject and the +third argument is the value ${v_i}$, into which one wants to inject the +character (it corresponds to the regular expression after the character +has been chopped off). The result of this function is a new value. +\begin{ceqn} +\begin{equation}\label{graph:inj} +\begin{tikzcd} +r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ +v_1 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed] +\end{tikzcd} +\end{equation} +\end{ceqn} + + +\noindent +The +definition of $\textit{inj}$ is as follows: + +\begin{center} +\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} + $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ + $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ + $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ + $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ + $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ + $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ + $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ +\end{tabular} +\end{center} + +\noindent This definition is by recursion on the ``shape'' of regular +expressions and values. +The clauses basically do one thing--identifying the ``holes'' on +value to inject the character back into. +For instance, in the last clause for injecting back to a value +that would turn into a new star value that corresponds to a star, +we know it must be a sequence value. And we know that the first +value of that sequence corresponds to the child regex of the star +with the first character being chopped off--an iteration of the star +that had just been unfolded. This value is followed by the already +matched star iterations we collected before. So we inject the character +back to the first value and form a new value with this new iteration +being added to the previous list of iterations, all under the $\Stars$ +top level. + +Putting all the functions $\inj$, $\mkeps$, $\backslash$ together, +we have a lexer with the following recursive definition: +\begin{center} +\begin{tabular}{lcr} +$\lexer \; r \; [] $ & $=$ & $\mkeps \; r$\\ +$\lexer \; r \;c::s$ & $=$ & $\inj \; r \; c (\lexer (r\backslash c) s)$ +\end{tabular} +\end{center} + +Pictorially, the algorithm is as follows: \begin{ceqn} \begin{equation}\label{graph:2} @@ -524,63 +643,8 @@ for how the empty string has been matched by the (nullable) regular expression $r_n$. This function is defined as - \begin{center} - \begin{tabular}{lcl} - $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ - $\mkeps(r_{1}+r_{2})$ & $\dn$ - & \textit{if} $\nullable(r_{1})$\\ - & & \textit{then} $\Left(\mkeps(r_{1}))$\\ - & & \textit{else} $\Right(\mkeps(r_{2}))$\\ - $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ - $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ - \end{tabular} - \end{center} -\noindent -After the $\mkeps$-call, we inject back the characters one by one in order to build -the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ -($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. -After injecting back $n$ characters, we get the lexical value for how $r_0$ -matches $s$. The POSIX value is maintained throught out the process. -For this Sulzmann and Lu defined a function that reverses -the ``chopping off'' of characters during the derivative phase. The -corresponding function is called \emph{injection}, written -$\textit{inj}$; it takes three arguments: the first one is a regular -expression ${r_{i-1}}$, before the character is chopped off, the second -is a character ${c_{i-1}}$, the character we want to inject and the -third argument is the value ${v_i}$, into which one wants to inject the -character (it corresponds to the regular expression after the character -has been chopped off). The result of this function is a new value. The -definition of $\textit{inj}$ is as follows: - -\begin{center} -\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} - $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ - $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ - $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ - $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ - $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ - $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ - $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ -\end{tabular} -\end{center} - -\noindent This definition is by recursion on the ``shape'' of regular -expressions and values. -The clauses basically do one thing--identifying the ``holes'' on -value to inject the character back into. -For instance, in the last clause for injecting back to a value -that would turn into a new star value that corresponds to a star, -we know it must be a sequence value. And we know that the first -value of that sequence corresponds to the child regex of the star -with the first character being chopped off--an iteration of the star -that had just been unfolded. This value is followed by the already -matched star iterations we collected before. So we inject the character -back to the first value and form a new value with this new iteration -being added to the previous list of iterations, all under the $Stars$ -top level. - We have mentioned before that derivatives without simplification can get clumsy, and this is true for values as well--they reflect the regular expressions size by definition. @@ -617,411 +681,9 @@ introducing additional informtaion to the regular expressions called \emph{bitcodes}. -\subsection*{Bit-coded Algorithm} -Bits and bitcodes (lists of bits) are defined as: - -\begin{center} - $b ::= 1 \mid 0 \qquad -bs ::= [] \mid b::bs -$ -\end{center} - -\noindent -The $1$ and $0$ are not in bold in order to avoid -confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or -bit-lists) can be used to encode values (or potentially incomplete values) in a -compact form. This can be straightforwardly seen in the following -coding function from values to bitcodes: - -\begin{center} -\begin{tabular}{lcl} - $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ - $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ - $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\ - $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\ - $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ - $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\ - $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\; - code(\Stars\,vs)$ -\end{tabular} -\end{center} - -\noindent -Here $\textit{code}$ encodes a value into a bitcodes by converting -$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty -star iteration by $1$. The border where a local star terminates -is marked by $0$. This coding is lossy, as it throws away the information about -characters, and also does not encode the ``boundary'' between two -sequence values. Moreover, with only the bitcode we cannot even tell -whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The -reason for choosing this compact way of storing information is that the -relatively small size of bits can be easily manipulated and ``moved -around'' in a regular expression. In order to recover values, we will -need the corresponding regular expression as an extra information. This -means the decoding function is defined as: - - -%\begin{definition}[Bitdecoding of Values]\mbox{} -\begin{center} -\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} - $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ - $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ - $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & - $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; - (\Left\,v, bs_1)$\\ - $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & - $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; - (\Right\,v, bs_1)$\\ - $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & - $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ - & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ - & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ - $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ - $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & - $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ - & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ - & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ - - $\textit{decode}\,bs\,r$ & $\dn$ & - $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ - & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; - \textit{else}\;\textit{None}$ -\end{tabular} -\end{center} -%\end{definition} - -Sulzmann and Lu's integrated the bitcodes into regular expressions to -create annotated regular expressions \cite{Sulzmann2014}. -\emph{Annotated regular expressions} are defined by the following -grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$} - -\begin{center} -\begin{tabular}{lcl} - $\textit{a}$ & $::=$ & $\ZERO$\\ - & $\mid$ & $_{bs}\ONE$\\ - & $\mid$ & $_{bs}{\bf c}$\\ - & $\mid$ & $_{bs}\sum\,as$\\ - & $\mid$ & $_{bs}a_1\cdot a_2$\\ - & $\mid$ & $_{bs}a^*$ -\end{tabular} -\end{center} -%(in \textit{ALTS}) - -\noindent -where $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regular -expressions and $as$ for a list of annotated regular expressions. -The alternative constructor($\sum$) has been generalized to -accept a list of annotated regular expressions rather than just 2. -We will show that these bitcodes encode information about -the (POSIX) value that should be generated by the Sulzmann and Lu -algorithm. - - -To do lexing using annotated regular expressions, we shall first -transform the usual (un-annotated) regular expressions into annotated -regular expressions. This operation is called \emph{internalisation} and -defined as follows: - -%\begin{definition} -\begin{center} -\begin{tabular}{lcl} - $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\ - $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\ - $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\ - $(r_1 + r_2)^\uparrow$ & $\dn$ & - $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\, - \textit{fuse}\,[1]\,r_2^\uparrow]$\\ - $(r_1\cdot r_2)^\uparrow$ & $\dn$ & - $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\ - $(r^*)^\uparrow$ & $\dn$ & - $_{[]}(r^\uparrow)^*$\\ -\end{tabular} -\end{center} -%\end{definition} - -\noindent -We use up arrows here to indicate that the basic un-annotated regular -expressions are ``lifted up'' into something slightly more complex. In the -fourth clause, $\textit{fuse}$ is an auxiliary function that helps to -attach bits to the front of an annotated regular expression. Its -definition is as follows: - -\begin{center} -\begin{tabular}{lcl} - $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ - $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & - $_{bs @ bs'}\ONE$\\ - $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ & - $_{bs@bs'}{\bf c}$\\ - $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ & - $_{bs@bs'}\sum\textit{as}$\\ - $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & - $_{bs@bs'}a_1 \cdot a_2$\\ - $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & - $_{bs @ bs'}a^*$ -\end{tabular} -\end{center} - -\noindent -After internalising the regular expression, we perform successive -derivative operations on the annotated regular expressions. This -derivative operation is the same as what we had previously for the -basic regular expressions, except that we beed to take care of -the bitcodes: - - -\iffalse - %\begin{definition}{bder} -\begin{center} - \begin{tabular}{@{}lcl@{}} - $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ - $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ - $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & - $\textit{if}\;c=d\; \;\textit{then}\; - \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ - $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & - $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\ - $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & - $\textit{if}\;\textit{bnullable}\,a_1$\\ - & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ - & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ - & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ - $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & - $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, - (\textit{STAR}\,[]\,r)$ -\end{tabular} -\end{center} -%\end{definition} - -\begin{center} - \begin{tabular}{@{}lcl@{}} - $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ - $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ - $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ & - $\textit{if}\;c=d\; \;\textit{then}\; - _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\ - $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ & - $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\ - $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ & - $\textit{if}\;\textit{bnullable}\,a_1$\\ - & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\ - & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ - & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\ - $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ & - $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\, - (_{bs}\textit{STAR}\,[]\,r)$ -\end{tabular} -\end{center} -%\end{definition} -\fi - -\begin{center} - \begin{tabular}{@{}lcl@{}} - $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ - $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ - $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & - $\textit{if}\;c=d\; \;\textit{then}\; - _{bs}\ONE\;\textit{else}\;\ZERO$\\ - $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ & - $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\ - $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & - $\textit{if}\;\textit{bnullable}\,a_1$\\ - & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ - & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ - & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ - $(_{bs}a^*)\,\backslash c$ & $\dn$ & - $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot - (_{[]}r^*))$ -\end{tabular} -\end{center} - -%\end{definition} -\noindent -For instance, when we do derivative of $_{bs}a^*$ with respect to c, -we need to unfold it into a sequence, -and attach an additional bit $0$ to the front of $r \backslash c$ -to indicate one more star iteration. Also the sequence clause -is more subtle---when $a_1$ is $\textit{bnullable}$ (here -\textit{bnullable} is exactly the same as $\textit{nullable}$, except -that it is for annotated regular expressions, therefore we omit the -definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how -$a_1$ matches the string prior to character $c$ (more on this later), -then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2 -\backslash c)$ will collapse the regular expression $a_1$(as it has -already been fully matched) and store the parsing information at the -head of the regular expression $a_2 \backslash c$ by fusing to it. The -bitsequence $\textit{bs}$, which was initially attached to the -first element of the sequence $a_1 \cdot a_2$, has -now been elevated to the top-level of $\sum$, as this information will be -needed whichever way the sequence is matched---no matter whether $c$ belongs -to $a_1$ or $ a_2$. After building these derivatives and maintaining all -the lexing information, we complete the lexing by collecting the -bitcodes using a generalised version of the $\textit{mkeps}$ function -for annotated regular expressions, called $\textit{bmkeps}$: - - -%\begin{definition}[\textit{bmkeps}]\mbox{} -\begin{center} -\begin{tabular}{lcl} - $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\ - $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ & - $\textit{if}\;\textit{bnullable}\,a$\\ - & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\ - & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\ - $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ & - $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ - $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ & - $bs \,@\, [0]$ -\end{tabular} -\end{center} -%\end{definition} - -\noindent -This function completes the value information by travelling along the -path of the regular expression that corresponds to a POSIX value and -collecting all the bitcodes, and using $S$ to indicate the end of star -iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and -decode them, we get the value we expect. The corresponding lexing -algorithm looks as follows: - -\begin{center} -\begin{tabular}{lcl} - $\textit{blexer}\;r\,s$ & $\dn$ & - $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ - & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ - & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ - & & $\;\;\textit{else}\;\textit{None}$ -\end{tabular} -\end{center} - -\noindent -In this definition $\_\backslash s$ is the generalisation of the derivative -operation from characters to strings (just like the derivatives for un-annotated -regular expressions). - -Now we introduce the simplifications, which is why we introduce the -bitcodes in the first place. - -\subsection*{Simplification Rules} - -This section introduces aggressive (in terms of size) simplification rules -on annotated regular expressions -to keep derivatives small. Such simplifications are promising -as we have -generated test data that show -that a good tight bound can be achieved. We could only -partially cover the search space as there are infinitely many regular -expressions and strings. - -One modification we introduced is to allow a list of annotated regular -expressions in the $\sum$ constructor. This allows us to not just -delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but -also unnecessary ``copies'' of regular expressions (very similar to -simplifying $r + r$ to just $r$, but in a more general setting). Another -modification is that we use simplification rules inspired by Antimirov's -work on partial derivatives. They maintain the idea that only the first -``copy'' of a regular expression in an alternative contributes to the -calculation of a POSIX value. All subsequent copies can be pruned away from -the regular expression. A recursive definition of our simplification function -that looks somewhat similar to our Scala code is given below: -%\comment{Use $\ZERO$, $\ONE$ and so on. -%Is it $ALTS$ or $ALTS$?}\\ - -\begin{center} - \begin{tabular}{@{}lcl@{}} - - $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ - &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ - &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ - &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ - &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ - &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ - - $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\ - &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ - &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ - &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ - - $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ -\end{tabular} -\end{center} - -\noindent -The simplification does a pattern matching on the regular expression. -When it detected that the regular expression is an alternative or -sequence, it will try to simplify its child regular expressions -recursively and then see if one of the children turns into $\ZERO$ or -$\ONE$, which might trigger further simplification at the current level. -The most involved part is the $\sum$ clause, where we use two -auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested -alternatives and reduce as many duplicates as possible. Function -$\textit{distinct}$ keeps the first occurring copy only and removes all later ones -when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. -Its recursive definition is given below: - - \begin{center} - \begin{tabular}{@{}lcl@{}} - $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; - (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ - $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ - $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) -\end{tabular} -\end{center} - -\noindent -Here $\textit{flatten}$ behaves like the traditional functional programming flatten -function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it -removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. - -Having defined the $\simp$ function, -we can use the previous notation of natural -extension from derivative w.r.t.~character to derivative -w.r.t.~string:%\comment{simp in the [] case?} - -\begin{center} -\begin{tabular}{lcl} -$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ -$r \backslash_{simp} [\,] $ & $\dn$ & $r$ -\end{tabular} -\end{center} - -\noindent -to obtain an optimised version of the algorithm: - - \begin{center} -\begin{tabular}{lcl} - $\textit{blexer\_simp}\;r\,s$ & $\dn$ & - $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ - & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ - & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ - & & $\;\;\textit{else}\;\textit{None}$ -\end{tabular} -\end{center} - -\noindent -This algorithm keeps the regular expression size small, for example, -with this simplification our previous $(a + aa)^*$ example's 8000 nodes -will be reduced to just 6 and stays constant, no matter how long the -input string is. - - -%----------------------------------- -% SUBSECTION 1 -%----------------------------------- -\section{Specifications of Certain Functions to be Used} -Here we give some functions' definitions, -which we will use later. -\begin{center} -\begin{tabular}{ccc} -$\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$ -\end{tabular} -\end{center} - - - diff -r 96e93df60954 -r 823d9b19d21c ChengsongTanPhdThesis/Chapters/ChapterBitcoded1.tex --- a/ChengsongTanPhdThesis/Chapters/ChapterBitcoded1.tex Mon May 30 17:24:52 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/ChapterBitcoded1.tex Mon May 30 20:36:15 2022 +0100 @@ -9,47 +9,416 @@ %its correctness proof in %Chapter 3\ref{Chapter3}. +\section*{Bit-coded Algorithm} +Bits and bitcodes (lists of bits) are defined as: +\begin{center} + $b ::= 1 \mid 0 \qquad +bs ::= [] \mid b::bs +$ +\end{center} + +\noindent +The $1$ and $0$ are not in bold in order to avoid +confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or +bit-lists) can be used to encode values (or potentially incomplete values) in a +compact form. This can be straightforwardly seen in the following +coding function from values to bitcodes: + +\begin{center} +\begin{tabular}{lcl} + $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ + $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ + $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\ + $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\ + $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ + $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\ + $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\; + code(\Stars\,vs)$ +\end{tabular} +\end{center} + +\noindent +Here $\textit{code}$ encodes a value into a bitcodes by converting +$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty +star iteration by $1$. The border where a local star terminates +is marked by $0$. This coding is lossy, as it throws away the information about +characters, and also does not encode the ``boundary'' between two +sequence values. Moreover, with only the bitcode we cannot even tell +whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The +reason for choosing this compact way of storing information is that the +relatively small size of bits can be easily manipulated and ``moved +around'' in a regular expression. In order to recover values, we will +need the corresponding regular expression as an extra information. This +means the decoding function is defined as: + + +%\begin{definition}[Bitdecoding of Values]\mbox{} +\begin{center} +\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} + $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ + $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ + $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & + $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; + (\Left\,v, bs_1)$\\ + $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & + $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; + (\Right\,v, bs_1)$\\ + $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & + $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ + & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ + & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ + $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ + $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & + $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ + & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ + & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ + + $\textit{decode}\,bs\,r$ & $\dn$ & + $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ + & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; + \textit{else}\;\textit{None}$ +\end{tabular} +\end{center} +%\end{definition} + +Sulzmann and Lu's integrated the bitcodes into regular expressions to +create annotated regular expressions \cite{Sulzmann2014}. +\emph{Annotated regular expressions} are defined by the following +grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$} + +\begin{center} +\begin{tabular}{lcl} + $\textit{a}$ & $::=$ & $\ZERO$\\ + & $\mid$ & $_{bs}\ONE$\\ + & $\mid$ & $_{bs}{\bf c}$\\ + & $\mid$ & $_{bs}\sum\,as$\\ + & $\mid$ & $_{bs}a_1\cdot a_2$\\ + & $\mid$ & $_{bs}a^*$ +\end{tabular} +\end{center} +%(in \textit{ALTS}) + +\noindent +where $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regular +expressions and $as$ for a list of annotated regular expressions. +The alternative constructor($\sum$) has been generalized to +accept a list of annotated regular expressions rather than just 2. +We will show that these bitcodes encode information about +the (POSIX) value that should be generated by the Sulzmann and Lu +algorithm. + + +To do lexing using annotated regular expressions, we shall first +transform the usual (un-annotated) regular expressions into annotated +regular expressions. This operation is called \emph{internalisation} and +defined as follows: + +%\begin{definition} +\begin{center} +\begin{tabular}{lcl} + $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\ + $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\ + $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\ + $(r_1 + r_2)^\uparrow$ & $\dn$ & + $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\, + \textit{fuse}\,[1]\,r_2^\uparrow]$\\ + $(r_1\cdot r_2)^\uparrow$ & $\dn$ & + $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\ + $(r^*)^\uparrow$ & $\dn$ & + $_{[]}(r^\uparrow)^*$\\ +\end{tabular} +\end{center} +%\end{definition} + +\noindent +We use up arrows here to indicate that the basic un-annotated regular +expressions are ``lifted up'' into something slightly more complex. In the +fourth clause, $\textit{fuse}$ is an auxiliary function that helps to +attach bits to the front of an annotated regular expression. Its +definition is as follows: + +\begin{center} +\begin{tabular}{lcl} + $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ + $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & + $_{bs @ bs'}\ONE$\\ + $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ & + $_{bs@bs'}{\bf c}$\\ + $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ & + $_{bs@bs'}\sum\textit{as}$\\ + $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & + $_{bs@bs'}a_1 \cdot a_2$\\ + $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & + $_{bs @ bs'}a^*$ +\end{tabular} +\end{center} + +\noindent +After internalising the regular expression, we perform successive +derivative operations on the annotated regular expressions. This +derivative operation is the same as what we had previously for the +basic regular expressions, except that we beed to take care of +the bitcodes: -%---------------------------------------------------------------------------------------- -% SECTION common identities -%---------------------------------------------------------------------------------------- -\section{Common Identities In Simplification-Related Functions} -Need to prove these before starting on the big correctness proof. +\iffalse + %\begin{definition}{bder} +\begin{center} + \begin{tabular}{@{}lcl@{}} + $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ + $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ + $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & + $\textit{if}\;c=d\; \;\textit{then}\; + \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ + $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & + $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\ + $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & + $\textit{if}\;\textit{bnullable}\,a_1$\\ + & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ + & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ + & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ + $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & + $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, + (\textit{STAR}\,[]\,r)$ +\end{tabular} +\end{center} +%\end{definition} + +\begin{center} + \begin{tabular}{@{}lcl@{}} + $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ + $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ + $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ & + $\textit{if}\;c=d\; \;\textit{then}\; + _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\ + $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ & + $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\ + $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ & + $\textit{if}\;\textit{bnullable}\,a_1$\\ + & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\ + & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ + & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\ + $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ & + $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\, + (_{bs}\textit{STAR}\,[]\,r)$ +\end{tabular} +\end{center} +%\end{definition} +\fi -%----------------------------------- -% SUBSECTION -%----------------------------------- -\subsection{Idempotency of $\simp$} +\begin{center} + \begin{tabular}{@{}lcl@{}} + $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ + $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ + $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & + $\textit{if}\;c=d\; \;\textit{then}\; + _{bs}\ONE\;\textit{else}\;\ZERO$\\ + $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ & + $_{bs}\sum\;(\textit{map} (\_\backslash c) as )$\\ + $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & + $\textit{if}\;\textit{bnullable}\,a_1$\\ + & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ + & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ + & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ + $(_{bs}a^*)\,\backslash c$ & $\dn$ & + $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot + (_{[]}r^*))$ +\end{tabular} +\end{center} + +%\end{definition} +\noindent +For instance, when we do derivative of $_{bs}a^*$ with respect to c, +we need to unfold it into a sequence, +and attach an additional bit $0$ to the front of $r \backslash c$ +to indicate one more star iteration. Also the sequence clause +is more subtle---when $a_1$ is $\textit{bnullable}$ (here +\textit{bnullable} is exactly the same as $\textit{nullable}$, except +that it is for annotated regular expressions, therefore we omit the +definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how +$a_1$ matches the string prior to character $c$ (more on this later), +then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2 +\backslash c)$ will collapse the regular expression $a_1$(as it has +already been fully matched) and store the parsing information at the +head of the regular expression $a_2 \backslash c$ by fusing to it. The +bitsequence $\textit{bs}$, which was initially attached to the +first element of the sequence $a_1 \cdot a_2$, has +now been elevated to the top-level of $\sum$, as this information will be +needed whichever way the sequence is matched---no matter whether $c$ belongs +to $a_1$ or $ a_2$. After building these derivatives and maintaining all +the lexing information, we complete the lexing by collecting the +bitcodes using a generalised version of the $\textit{mkeps}$ function +for annotated regular expressions, called $\textit{bmkeps}$: + + +%\begin{definition}[\textit{bmkeps}]\mbox{} +\begin{center} +\begin{tabular}{lcl} + $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\ + $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ & + $\textit{if}\;\textit{bnullable}\,a$\\ + & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\ + & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\ + $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ & + $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ + $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ & + $bs \,@\, [0]$ +\end{tabular} +\end{center} +%\end{definition} + +\noindent +This function completes the value information by travelling along the +path of the regular expression that corresponds to a POSIX value and +collecting all the bitcodes, and using $S$ to indicate the end of star +iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and +decode them, we get the value we expect. The corresponding lexing +algorithm looks as follows: -\begin{equation} - \simp \;r = \simp\; \simp \; r -\end{equation} -This property means we do not have to repeatedly -apply simplification in each step, which justifies -our definition of $\blexersimp$. -It will also be useful in future proofs where properties such as -closed forms are needed. -The proof is by structural induction on $r$. +\begin{center} +\begin{tabular}{lcl} + $\textit{blexer}\;r\,s$ & $\dn$ & + $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ + & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ + & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ + & & $\;\;\textit{else}\;\textit{None}$ +\end{tabular} +\end{center} + +\noindent +In this definition $\_\backslash s$ is the generalisation of the derivative +operation from characters to strings (just like the derivatives for un-annotated +regular expressions). + +Now we introduce the simplifications, which is why we introduce the +bitcodes in the first place. + +\subsection*{Simplification Rules} + +This section introduces aggressive (in terms of size) simplification rules +on annotated regular expressions +to keep derivatives small. Such simplifications are promising +as we have +generated test data that show +that a good tight bound can be achieved. We could only +partially cover the search space as there are infinitely many regular +expressions and strings. + +One modification we introduced is to allow a list of annotated regular +expressions in the $\sum$ constructor. This allows us to not just +delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but +also unnecessary ``copies'' of regular expressions (very similar to +simplifying $r + r$ to just $r$, but in a more general setting). Another +modification is that we use simplification rules inspired by Antimirov's +work on partial derivatives. They maintain the idea that only the first +``copy'' of a regular expression in an alternative contributes to the +calculation of a POSIX value. All subsequent copies can be pruned away from +the regular expression. A recursive definition of our simplification function +that looks somewhat similar to our Scala code is given below: +%\comment{Use $\ZERO$, $\ONE$ and so on. +%Is it $ALTS$ or $ALTS$?}\\ + +\begin{center} + \begin{tabular}{@{}lcl@{}} + + $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ + &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ + &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ + &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ + &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ + &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ + + $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\ + &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ + &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ + &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ -%----------------------------------- -% SUBSECTION -%----------------------------------- -\subsection{Syntactic Equivalence Under $\simp$} -We prove that minor differences can be annhilated -by $\simp$. -For example, + $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ +\end{tabular} +\end{center} + +\noindent +The simplification does a pattern matching on the regular expression. +When it detected that the regular expression is an alternative or +sequence, it will try to simplify its child regular expressions +recursively and then see if one of the children turns into $\ZERO$ or +$\ONE$, which might trigger further simplification at the current level. +The most involved part is the $\sum$ clause, where we use two +auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested +alternatives and reduce as many duplicates as possible. Function +$\textit{distinct}$ keeps the first occurring copy only and removes all later ones +when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. +Its recursive definition is given below: + + \begin{center} + \begin{tabular}{@{}lcl@{}} + $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; + (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ + $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ + $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) +\end{tabular} +\end{center} + +\noindent +Here $\textit{flatten}$ behaves like the traditional functional programming flatten +function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it +removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. + +Having defined the $\simp$ function, +we can use the previous notation of natural +extension from derivative w.r.t.~character to derivative +w.r.t.~string:%\comment{simp in the [] case?} + \begin{center} -$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = - \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$ +\begin{tabular}{lcl} +$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ +$r \backslash_{simp} [\,] $ & $\dn$ & $r$ +\end{tabular} \end{center} +\noindent +to obtain an optimised version of the algorithm: + + \begin{center} +\begin{tabular}{lcl} + $\textit{blexer\_simp}\;r\,s$ & $\dn$ & + $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ + & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ + & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ + & & $\;\;\textit{else}\;\textit{None}$ +\end{tabular} +\end{center} + +\noindent +This algorithm keeps the regular expression size small, for example, +with this simplification our previous $(a + aa)^*$ example's 8000 nodes +will be reduced to just 6 and stays constant, no matter how long the +input string is. + + + + + + +%----------------------------------- +% SUBSECTION 1 +%----------------------------------- +\section{Specifications of Some Helper Functions} +Here we give some functions' definitions, +which we will use later. +\begin{center} +\begin{tabular}{ccc} +$\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$ +\end{tabular} +\end{center} + + %---------------------------------------------------------------------------------------- % SECTION correctness proof %---------------------------------------------------------------------------------------- @@ -134,3 +503,4 @@ Finally + diff -r 96e93df60954 -r 823d9b19d21c ChengsongTanPhdThesis/Chapters/ChapterFinite.tex --- a/ChengsongTanPhdThesis/Chapters/ChapterFinite.tex Mon May 30 17:24:52 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/ChapterFinite.tex Mon May 30 20:36:15 2022 +0100 @@ -7,67 +7,29 @@ %of our bitcoded algorithm, that is a finite bound on the size of any %regex's derivatives. - -%----------------------------------- -% SECTION ? -%----------------------------------- -\section{preparatory properties for getting a finiteness bound} -Before we get to the proof that says the intermediate result of our lexer will -remain finitely bounded, which is an important efficiency/liveness guarantee, -we shall first develop a few preparatory properties and definitions to -make the process of proving that a breeze. - -We define rewriting relations for $\rrexp$s, which allows us to do the -same trick as we did for the correctness proof, -but this time we will have stronger equalities established. -\subsection{"hrewrite" relation} -List of 1-step rewrite rules for regular expressions simplification without bitcodes: -\begin{center} -\begin{tabular}{lcl} -$r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$ -\end{tabular} -\end{center} - -And we define an "grewrite" relation that works on lists: -\begin{center} -\begin{tabular}{lcl} -$ \ZERO :: rs$ & $\rightarrow_g$ & $rs$ -\end{tabular} -\end{center} - - +In this chapter we give a guarantee in terms of time complexity: +given a regular expression $r$, for any string $s$ +our algorithm's internal data structure is finitely bounded. +To obtain such a proof, we need to +\begin{itemize} +\item +Define an new datatype for regular expressions that makes it easy +to reason about the size of an annotated regular expression. +\item +A set of equalities for this new datatype that enables one to +rewrite $\bderssimp{r_1 \cdot r_2}{s}$ and $\bderssimp{r^*}{s}$ etc. +by their children regexes $r_1$, $r_2$, and $r$. +\item +Using those equalities to actually get those rewriting equations, which we call +"closed forms". +\item +Bound the closed forms, thereby bounding the original +$\blexersimp$'s internal data structures. +\end{itemize} -With these relations we prove -\begin{lemma} -$rs \rightarrow rs' \implies \RALTS{rs} \rightarrow \RALTS{rs'}$ -\end{lemma} -which enables us to prove "closed forms" equalities such as -\begin{lemma} -$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$ -\end{lemma} +\section{the $\mathbf{r}$-rexp datatype and the size functions} -But the most involved part of the above lemma is proving the following: -\begin{lemma} -$\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ -\end{lemma} -which is needed in proving things like -\begin{lemma} -$r \rightarrow_f r' \implies \rsimp{r} \rightarrow \rsimp{r'}$ -\end{lemma} - -Fortunately this is provable by induction on the list $rs$ - - - -%----------------------------------- -% SECTION 2 -%----------------------------------- - -\section{Finiteness Property} -In this section let us describe our argument for why the size of the simplified -derivatives with the aggressive simplification function is finitely bounded. - Suppose -we have a size function for bitcoded regular expressions, written +We have a size function for bitcoded regular expressions, written $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree \begin{center} @@ -170,6 +132,65 @@ regular expression. + +%----------------------------------- +% SECTION ? +%----------------------------------- +\section{preparatory properties for getting a finiteness bound} +Before we get to the proof that says the intermediate result of our lexer will +remain finitely bounded, which is an important efficiency/liveness guarantee, +we shall first develop a few preparatory properties and definitions to +make the process of proving that a breeze. + +We define rewriting relations for $\rrexp$s, which allows us to do the +same trick as we did for the correctness proof, +but this time we will have stronger equalities established. +\subsection{"hrewrite" relation} +List of 1-step rewrite rules for regular expressions simplification without bitcodes: +\begin{figure} +\caption{the "h-rewrite" rules} +\[ +r_1 \cdot \ZERO \rightarrow_h \ZERO \] + +\[ +\ZERO \cdot r_2 \rightarrow \ZERO +\] +\end{figure} +And we define an "grewrite" relation that works on lists: +\begin{center} +\begin{tabular}{lcl} +$ \ZERO :: rs$ & $\rightarrow_g$ & $rs$ +\end{tabular} +\end{center} + + + +With these relations we prove +\begin{lemma} +$rs \rightarrow rs' \implies \RALTS{rs} \rightarrow \RALTS{rs'}$ +\end{lemma} +which enables us to prove "closed forms" equalities such as +\begin{lemma} +$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\suffix \; s \; r_1 ))}$ +\end{lemma} + +But the most involved part of the above lemma is proving the following: +\begin{lemma} +$\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ +\end{lemma} +which is needed in proving things like +\begin{lemma} +$r \rightarrow_f r' \implies \rsimp{r} \rightarrow \rsimp{r'}$ +\end{lemma} + +Fortunately this is provable by induction on the list $rs$ + + + +%----------------------------------- +% SECTION 2 +%----------------------------------- + \begin{theorem} For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$ \end{theorem} diff -r 96e93df60954 -r 823d9b19d21c thys2/blexer2.sc --- a/thys2/blexer2.sc Mon May 30 17:24:52 2022 +0100 +++ b/thys2/blexer2.sc Mon May 30 20:36:15 2022 +0100 @@ -521,6 +521,7 @@ case (AZERO, _) => AZERO case (_, AZERO) => AZERO case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s) + case (r1s, AONE(bs2)) => fuse(bs1, r1s) //asserted bs2 == Nil case (r1s, r2s) => ASEQ(bs1, r1s, r2s) } case AALTS(bs1, rs) => { @@ -732,9 +733,9 @@ res.toSet } -def ABIncludedByC[A, B, C](a: A, b: B, c: C, f: (A, B) => C, inclusionPred: (C, C) => Boolean) : Boolean = { +def ABIncludedByC[A, B, C](a: A, b: B, c: C, f: (A, B) => C, subseteqPred: (C, C) => Boolean) : Boolean = { - inclusionPred(f(a, b), c) + subseteqPred(f(a, b), c) } def rexpListInclusion(rs1: Set[Rexp], rs2: Set[Rexp]) : Boolean = { // println("r+ctx---------") @@ -747,7 +748,7 @@ // println("end------------------") res } -def pAKC6(acc: Set[Rexp], r: ARexp, ctx: List[Rexp]) : ARexp = { +def prune6(acc: Set[Rexp], r: ARexp, ctx: List[Rexp]) : ARexp = { // println("pakc--------r") // println(shortRexpOutput(erase(r))) // println("ctx---------") @@ -766,7 +767,7 @@ else{ r match { case ASEQ(bs, r1, r2) => - (pAKC6(acc, r1, erase(r2) :: ctx)) match{ + (prune6(acc, r1, erase(r2) :: ctx)) match{ case AZERO => AZERO case AONE(bs1) => @@ -782,7 +783,7 @@ // rs0.foreach(r => println(shortRexpOutput(erase(r)))) // println(s"acc is ") // acc.foreach(r => println(shortRexpOutput(r))) - rs0.map(r => pAKC6(acc, r, ctx)).filter(_ != AZERO) match { + rs0.map(r => prune6(acc, r, ctx)).filter(_ != AZERO) match { case Nil => // println("after pruning Nil") AZERO @@ -809,7 +810,7 @@ distinctBy6(xs, acc) } else{ - val pruned = pAKC6(acc, x, Nil) + val pruned = prune6(acc, x, Nil) val newTerms = strongBreakIntoTerms(erase(pruned)) pruned match { case AZERO => @@ -1236,7 +1237,7 @@ //STAR(SEQ(ALTS(STAR(STAR(STAR(STAR(CHAR(c))))),ALTS(CHAR(c),CHAR(b))),ALTS(ZERO,SEQ(ALTS(ALTS(STAR(CHAR(c)),SEQ(CHAR(b),CHAR(a))),CHAR(c)),STAR(ALTS(ALTS(ONE,CHAR(a)),STAR(CHAR(c)))))))) //(depth5) //STAR(SEQ(ALTS(ALTS(STAR(CHAR(b)),SEQ(ONE,CHAR(b))),SEQ(STAR(CHAR(a)),CHAR(b))),ALTS(ZERO,ALTS(STAR(CHAR(b)),STAR(CHAR(a)))))) - test(rexp(4), 100000) { (r: Rexp) => + test(rexp(8), 10000) { (r: Rexp) => // ALTS(SEQ(SEQ(ONE,CHAR('a')),STAR(CHAR('a'))),SEQ(ALTS(CHAR('c'),ONE),STAR(ZERO))))))), 1) { (r: Rexp) => val ss = stringsFromRexp(r) val boolList = ss.filter(s => s != "").map(s => { @@ -1256,7 +1257,7 @@ } // small() -// generator_test() +generator_test() def counterexample_check() { val r = SEQ(STAR(CHAR('c')),STAR(SEQ(STAR(CHAR('c')),ONE)))//STAR(SEQ(ALTS(STAR(CHAR('c')),CHAR('c')),SEQ(ALTS(CHAR('c'),ONE),ONE))) @@ -1275,11 +1276,13 @@ } // counterexample_check() def linform_test() { - val r = STAR(SEQ(STAR(CHAR('c')), ONE)) + val r = STAR(SEQ(STAR(CHAR('c')),ONE))//SEQ(STAR(CHAR('c')),STAR(SEQ(STAR(CHAR('c')),ONE))) // val r_linforms = lf(r) println(r_linforms.size) + val pderuniv = pderUNIV(r) + println(pderuniv.size) } -linform_test() +// linform_test() // 1 def newStrong_test() { val r2 = (CHAR('b') | ONE)