diff -r 96e93df60954 -r 823d9b19d21c ChengsongTanPhdThesis/Chapters/Chapter2.tex --- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex Mon May 30 17:24:52 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Chapter2.tex Mon May 30 20:36:15 2022 +0100 @@ -277,10 +277,10 @@ \end{definition} \noindent -Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, +Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, this algorithm presented graphically is as follows: -\begin{equation}\label{graph:*} +\begin{equation}\label{graph:successive_ders} \begin{tikzcd} r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} \end{tikzcd} @@ -477,26 +477,145 @@ always match a subpart as much as possible before proceeding to the next token. \end{itemize} -The formal definition of a $\POSIX$ value can be described +The formal definition of a $\POSIX$ value $v$ for a regular expression +$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified in the following set of rules: -\ - +(TODO: write the entire set of inference rules for POSIX ) +\newcommand*{\inference}[3][t]{% + \begingroup + \def\and{\\}% + \begin{tabular}[#1]{@{\enspace}c@{\enspace}} + #2 \\ + \hline + #3 + \end{tabular}% + \endgroup +} +\begin{center} +\inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$ }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ } +\end{center} - For example, the above example has the POSIX value -$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. -The output of an algorithm we want would be a POSIX matching -encoded as a value. The reason why we are interested in $\POSIX$ values is that they can be practically used in the lexing phase of a compiler front end. For instance, when lexing a code snippet $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized as an identifier rather than a keyword. + For example, the above $r= (a^*\cdot a^*)^*$ and +$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value +$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. +The output of an algorithm we want would be a POSIX matching +encoded as a value. + + + + The contribution of Sulzmann and Lu is an extension of Brzozowski's algorithm by a second phase (the first phase being building successive -derivatives---see \eqref{graph:*}). In this second phase, a POSIX value -is generated in case the regular expression matches the string. -Pictorially, the Sulzmann and Lu algorithm is as follows: +derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value +is generated in case the regular expression matches the string. +How can we construct a value out of regular expressions and character +sequences only? +Two functions are involved: $\inj$ and $\mkeps$. +The function $\mkeps$ constructs a value from the last +one of all the successive derivatives: +\begin{ceqn} +\begin{equation}\label{graph:mkeps} +\begin{tikzcd} +r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\ + & & & v_n +\end{tikzcd} +\end{equation} +\end{ceqn} + +It tells us how can an empty string be matched by a +regular expression, in a $\POSIX$ way: + + \begin{center} + \begin{tabular}{lcl} + $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ + $\mkeps(r_{1}+r_{2})$ & $\dn$ + & \textit{if} $\nullable(r_{1})$\\ + & & \textit{then} $\Left(\mkeps(r_{1}))$\\ + & & \textit{else} $\Right(\mkeps(r_{2}))$\\ + $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ + $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ + \end{tabular} + \end{center} + + +\noindent +We favour the left to match an empty string in case there is a choice. +When there is a star for us to match the empty string, +we simply give the $\Stars$ constructor an empty list, meaning +no iterations are taken. + + +After the $\mkeps$-call, we inject back the characters one by one in order to build +the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ +($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. +After injecting back $n$ characters, we get the lexical value for how $r_0$ +matches $s$. The POSIX value is maintained throught out the process. +For this Sulzmann and Lu defined a function that reverses +the ``chopping off'' of characters during the derivative phase. The +corresponding function is called \emph{injection}, written +$\textit{inj}$; it takes three arguments: the first one is a regular +expression ${r_{i-1}}$, before the character is chopped off, the second +is a character ${c_{i-1}}$, the character we want to inject and the +third argument is the value ${v_i}$, into which one wants to inject the +character (it corresponds to the regular expression after the character +has been chopped off). The result of this function is a new value. +\begin{ceqn} +\begin{equation}\label{graph:inj} +\begin{tikzcd} +r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ +v_1 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed] +\end{tikzcd} +\end{equation} +\end{ceqn} + + +\noindent +The +definition of $\textit{inj}$ is as follows: + +\begin{center} +\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} + $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ + $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ + $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ + $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ + $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ + $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ + $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ +\end{tabular} +\end{center} + +\noindent This definition is by recursion on the ``shape'' of regular +expressions and values. +The clauses basically do one thing--identifying the ``holes'' on +value to inject the character back into. +For instance, in the last clause for injecting back to a value +that would turn into a new star value that corresponds to a star, +we know it must be a sequence value. And we know that the first +value of that sequence corresponds to the child regex of the star +with the first character being chopped off--an iteration of the star +that had just been unfolded. This value is followed by the already +matched star iterations we collected before. So we inject the character +back to the first value and form a new value with this new iteration +being added to the previous list of iterations, all under the $\Stars$ +top level. + +Putting all the functions $\inj$, $\mkeps$, $\backslash$ together, +we have a lexer with the following recursive definition: +\begin{center} +\begin{tabular}{lcr} +$\lexer \; r \; [] $ & $=$ & $\mkeps \; r$\\ +$\lexer \; r \;c::s$ & $=$ & $\inj \; r \; c (\lexer (r\backslash c) s)$ +\end{tabular} +\end{center} + +Pictorially, the algorithm is as follows: \begin{ceqn} \begin{equation}\label{graph:2} @@ -524,63 +643,8 @@ for how the empty string has been matched by the (nullable) regular expression $r_n$. This function is defined as - \begin{center} - \begin{tabular}{lcl} - $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ - $\mkeps(r_{1}+r_{2})$ & $\dn$ - & \textit{if} $\nullable(r_{1})$\\ - & & \textit{then} $\Left(\mkeps(r_{1}))$\\ - & & \textit{else} $\Right(\mkeps(r_{2}))$\\ - $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ - $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ - \end{tabular} - \end{center} -\noindent -After the $\mkeps$-call, we inject back the characters one by one in order to build -the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ -($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. -After injecting back $n$ characters, we get the lexical value for how $r_0$ -matches $s$. The POSIX value is maintained throught out the process. -For this Sulzmann and Lu defined a function that reverses -the ``chopping off'' of characters during the derivative phase. The -corresponding function is called \emph{injection}, written -$\textit{inj}$; it takes three arguments: the first one is a regular -expression ${r_{i-1}}$, before the character is chopped off, the second -is a character ${c_{i-1}}$, the character we want to inject and the -third argument is the value ${v_i}$, into which one wants to inject the -character (it corresponds to the regular expression after the character -has been chopped off). The result of this function is a new value. The -definition of $\textit{inj}$ is as follows: - -\begin{center} -\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} - $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ - $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ - $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ - $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ - $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ - $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ - $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ -\end{tabular} -\end{center} - -\noindent This definition is by recursion on the ``shape'' of regular -expressions and values. -The clauses basically do one thing--identifying the ``holes'' on -value to inject the character back into. -For instance, in the last clause for injecting back to a value -that would turn into a new star value that corresponds to a star, -we know it must be a sequence value. And we know that the first -value of that sequence corresponds to the child regex of the star -with the first character being chopped off--an iteration of the star -that had just been unfolded. This value is followed by the already -matched star iterations we collected before. So we inject the character -back to the first value and form a new value with this new iteration -being added to the previous list of iterations, all under the $Stars$ -top level. - We have mentioned before that derivatives without simplification can get clumsy, and this is true for values as well--they reflect the regular expressions size by definition. @@ -617,411 +681,9 @@ introducing additional informtaion to the regular expressions called \emph{bitcodes}. -\subsection*{Bit-coded Algorithm} -Bits and bitcodes (lists of bits) are defined as: - -\begin{center} - $b ::= 1 \mid 0 \qquad -bs ::= [] \mid b::bs -$ -\end{center} - -\noindent -The $1$ and $0$ are not in bold in order to avoid -confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or -bit-lists) can be used to encode values (or potentially incomplete values) in a -compact form. This can be straightforwardly seen in the following -coding function from values to bitcodes: - -\begin{center} -\begin{tabular}{lcl} - $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ - $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ - $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\ - $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\ - $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ - $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\ - $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\; - code(\Stars\,vs)$ -\end{tabular} -\end{center} - -\noindent -Here $\textit{code}$ encodes a value into a bitcodes by converting -$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty -star iteration by $1$. The border where a local star terminates -is marked by $0$. This coding is lossy, as it throws away the information about -characters, and also does not encode the ``boundary'' between two -sequence values. Moreover, with only the bitcode we cannot even tell -whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The -reason for choosing this compact way of storing information is that the -relatively small size of bits can be easily manipulated and ``moved -around'' in a regular expression. In order to recover values, we will -need the corresponding regular expression as an extra information. This -means the decoding function is defined as: - - -%\begin{definition}[Bitdecoding of Values]\mbox{} -\begin{center} -\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} - $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ - $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ - $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & - $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; - (\Left\,v, bs_1)$\\ - $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & - $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; - (\Right\,v, bs_1)$\\ - $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & - $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ - & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ - & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ - $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ - $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & - $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ - & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ - & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ - - $\textit{decode}\,bs\,r$ & $\dn$ & - $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ - & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; - \textit{else}\;\textit{None}$ -\end{tabular} -\end{center} -%\end{definition} - -Sulzmann and Lu's integrated the bitcodes into regular expressions to -create annotated regular expressions \cite{Sulzmann2014}. -\emph{Annotated regular expressions} are defined by the following -grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$} - -\begin{center} -\begin{tabular}{lcl} - $\textit{a}$ & $::=$ & $\ZERO$\\ - & $\mid$ & $_{bs}\ONE$\\ - & $\mid$ & $_{bs}{\bf c}$\\ - & $\mid$ & $_{bs}\sum\,as$\\ - & $\mid$ & $_{bs}a_1\cdot a_2$\\ - & $\mid$ & $_{bs}a^*$ -\end{tabular} -\end{center} -%(in \textit{ALTS}) - -\noindent -where $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regular -expressions and $as$ for a list of annotated regular expressions. -The alternative constructor($\sum$) has been generalized to -accept a list of annotated regular expressions rather than just 2. -We will show that these bitcodes encode information about -the (POSIX) value that should be generated by the Sulzmann and Lu -algorithm. - - -To do lexing using annotated regular expressions, we shall first -transform the usual (un-annotated) regular expressions into annotated -regular expressions. This operation is called \emph{internalisation} and -defined as follows: - -%\begin{definition} -\begin{center} -\begin{tabular}{lcl} - $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\ - $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\ - $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\ - $(r_1 + r_2)^\uparrow$ & $\dn$ & - $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\, - \textit{fuse}\,[1]\,r_2^\uparrow]$\\ - $(r_1\cdot r_2)^\uparrow$ & $\dn$ & - $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\ - $(r^*)^\uparrow$ & $\dn$ & - $_{[]}(r^\uparrow)^*$\\ -\end{tabular} -\end{center} -%\end{definition} - -\noindent -We use up arrows here to indicate that the basic un-annotated regular -expressions are ``lifted up'' into something slightly more complex. In the -fourth clause, $\textit{fuse}$ is an auxiliary function that helps to -attach bits to the front of an annotated regular expression. Its -definition is as follows: - -\begin{center} -\begin{tabular}{lcl} - $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ - $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & - $_{bs @ bs'}\ONE$\\ - $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ & - $_{bs@bs'}{\bf c}$\\ - $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ & - $_{bs@bs'}\sum\textit{as}$\\ - $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & - $_{bs@bs'}a_1 \cdot a_2$\\ - $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & - $_{bs @ bs'}a^*$ -\end{tabular} -\end{center} - -\noindent -After internalising the regular expression, we perform successive -derivative operations on the annotated regular expressions. This -derivative operation is the same as what we had previously for the -basic regular expressions, except that we beed to take care of -the bitcodes: - - -\iffalse - %\begin{definition}{bder} -\begin{center} - \begin{tabular}{@{}lcl@{}} - $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ - $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ - $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & - $\textit{if}\;c=d\; \;\textit{then}\; - \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ - $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & - $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\ - $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & - $\textit{if}\;\textit{bnullable}\,a_1$\\ - & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ - & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ - & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ - $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & - $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, - (\textit{STAR}\,[]\,r)$ -\end{tabular} -\end{center} -%\end{definition} - -\begin{center} - \begin{tabular}{@{}lcl@{}} - $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ - $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ - $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ & - $\textit{if}\;c=d\; \;\textit{then}\; - _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\ - $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ & - $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\ - $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ & - $\textit{if}\;\textit{bnullable}\,a_1$\\ - & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\ - & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ - & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\ - $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ & - $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\, - (_{bs}\textit{STAR}\,[]\,r)$ -\end{tabular} -\end{center} -%\end{definition} -\fi - -\begin{center} - \begin{tabular}{@{}lcl@{}} - $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ - $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ - $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & - $\textit{if}\;c=d\; \;\textit{then}\; - _{bs}\ONE\;\textit{else}\;\ZERO$\\ - $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ & - $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\ - $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & - $\textit{if}\;\textit{bnullable}\,a_1$\\ - & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ - & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ - & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ - $(_{bs}a^*)\,\backslash c$ & $\dn$ & - $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot - (_{[]}r^*))$ -\end{tabular} -\end{center} - -%\end{definition} -\noindent -For instance, when we do derivative of $_{bs}a^*$ with respect to c, -we need to unfold it into a sequence, -and attach an additional bit $0$ to the front of $r \backslash c$ -to indicate one more star iteration. Also the sequence clause -is more subtle---when $a_1$ is $\textit{bnullable}$ (here -\textit{bnullable} is exactly the same as $\textit{nullable}$, except -that it is for annotated regular expressions, therefore we omit the -definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how -$a_1$ matches the string prior to character $c$ (more on this later), -then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2 -\backslash c)$ will collapse the regular expression $a_1$(as it has -already been fully matched) and store the parsing information at the -head of the regular expression $a_2 \backslash c$ by fusing to it. The -bitsequence $\textit{bs}$, which was initially attached to the -first element of the sequence $a_1 \cdot a_2$, has -now been elevated to the top-level of $\sum$, as this information will be -needed whichever way the sequence is matched---no matter whether $c$ belongs -to $a_1$ or $ a_2$. After building these derivatives and maintaining all -the lexing information, we complete the lexing by collecting the -bitcodes using a generalised version of the $\textit{mkeps}$ function -for annotated regular expressions, called $\textit{bmkeps}$: - - -%\begin{definition}[\textit{bmkeps}]\mbox{} -\begin{center} -\begin{tabular}{lcl} - $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\ - $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ & - $\textit{if}\;\textit{bnullable}\,a$\\ - & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\ - & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\ - $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ & - $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ - $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ & - $bs \,@\, [0]$ -\end{tabular} -\end{center} -%\end{definition} - -\noindent -This function completes the value information by travelling along the -path of the regular expression that corresponds to a POSIX value and -collecting all the bitcodes, and using $S$ to indicate the end of star -iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and -decode them, we get the value we expect. The corresponding lexing -algorithm looks as follows: - -\begin{center} -\begin{tabular}{lcl} - $\textit{blexer}\;r\,s$ & $\dn$ & - $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ - & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ - & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ - & & $\;\;\textit{else}\;\textit{None}$ -\end{tabular} -\end{center} - -\noindent -In this definition $\_\backslash s$ is the generalisation of the derivative -operation from characters to strings (just like the derivatives for un-annotated -regular expressions). - -Now we introduce the simplifications, which is why we introduce the -bitcodes in the first place. - -\subsection*{Simplification Rules} - -This section introduces aggressive (in terms of size) simplification rules -on annotated regular expressions -to keep derivatives small. Such simplifications are promising -as we have -generated test data that show -that a good tight bound can be achieved. We could only -partially cover the search space as there are infinitely many regular -expressions and strings. - -One modification we introduced is to allow a list of annotated regular -expressions in the $\sum$ constructor. This allows us to not just -delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but -also unnecessary ``copies'' of regular expressions (very similar to -simplifying $r + r$ to just $r$, but in a more general setting). Another -modification is that we use simplification rules inspired by Antimirov's -work on partial derivatives. They maintain the idea that only the first -``copy'' of a regular expression in an alternative contributes to the -calculation of a POSIX value. All subsequent copies can be pruned away from -the regular expression. A recursive definition of our simplification function -that looks somewhat similar to our Scala code is given below: -%\comment{Use $\ZERO$, $\ONE$ and so on. -%Is it $ALTS$ or $ALTS$?}\\ - -\begin{center} - \begin{tabular}{@{}lcl@{}} - - $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ - &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ - &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ - &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ - &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ - &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ - - $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\ - &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ - &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ - &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ - - $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ -\end{tabular} -\end{center} - -\noindent -The simplification does a pattern matching on the regular expression. -When it detected that the regular expression is an alternative or -sequence, it will try to simplify its child regular expressions -recursively and then see if one of the children turns into $\ZERO$ or -$\ONE$, which might trigger further simplification at the current level. -The most involved part is the $\sum$ clause, where we use two -auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested -alternatives and reduce as many duplicates as possible. Function -$\textit{distinct}$ keeps the first occurring copy only and removes all later ones -when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. -Its recursive definition is given below: - - \begin{center} - \begin{tabular}{@{}lcl@{}} - $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; - (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ - $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ - $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) -\end{tabular} -\end{center} - -\noindent -Here $\textit{flatten}$ behaves like the traditional functional programming flatten -function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it -removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. - -Having defined the $\simp$ function, -we can use the previous notation of natural -extension from derivative w.r.t.~character to derivative -w.r.t.~string:%\comment{simp in the [] case?} - -\begin{center} -\begin{tabular}{lcl} -$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ -$r \backslash_{simp} [\,] $ & $\dn$ & $r$ -\end{tabular} -\end{center} - -\noindent -to obtain an optimised version of the algorithm: - - \begin{center} -\begin{tabular}{lcl} - $\textit{blexer\_simp}\;r\,s$ & $\dn$ & - $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ - & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ - & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ - & & $\;\;\textit{else}\;\textit{None}$ -\end{tabular} -\end{center} - -\noindent -This algorithm keeps the regular expression size small, for example, -with this simplification our previous $(a + aa)^*$ example's 8000 nodes -will be reduced to just 6 and stays constant, no matter how long the -input string is. - - -%----------------------------------- -% SUBSECTION 1 -%----------------------------------- -\section{Specifications of Certain Functions to be Used} -Here we give some functions' definitions, -which we will use later. -\begin{center} -\begin{tabular}{ccc} -$\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$ -\end{tabular} -\end{center} - - -