diff -r 50e590823220 -r 8016a2480704 ChengsongTanPhdThesis/Chapters/Bitcoded2.tex --- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Mon Jun 06 23:17:45 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Thu Jun 09 12:57:53 2022 +0100 @@ -11,6 +11,119 @@ +Now we introduce the simplifications, which is why we introduce the +bitcodes in the first place. + +\subsection*{Simplification Rules} + +This section introduces aggressive (in terms of size) simplification rules +on annotated regular expressions +to keep derivatives small. Such simplifications are promising +as we have +generated test data that show +that a good tight bound can be achieved. We could only +partially cover the search space as there are infinitely many regular +expressions and strings. + +One modification we introduced is to allow a list of annotated regular +expressions in the $\sum$ constructor. This allows us to not just +delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but +also unnecessary ``copies'' of regular expressions (very similar to +simplifying $r + r$ to just $r$, but in a more general setting). Another +modification is that we use simplification rules inspired by Antimirov's +work on partial derivatives. They maintain the idea that only the first +``copy'' of a regular expression in an alternative contributes to the +calculation of a POSIX value. All subsequent copies can be pruned away from +the regular expression. A recursive definition of our simplification function +that looks somewhat similar to our Scala code is given below: +%\comment{Use $\ZERO$, $\ONE$ and so on. +%Is it $ALTS$ or $ALTS$?}\\ + +\begin{center} + \begin{tabular}{@{}lcl@{}} + + $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ + &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ + &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ + &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ + &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ + &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ + + $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\ + &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ + &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ + &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ + + $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ +\end{tabular} +\end{center} + +\noindent +The simplification does a pattern matching on the regular expression. +When it detected that the regular expression is an alternative or +sequence, it will try to simplify its child regular expressions +recursively and then see if one of the children turns into $\ZERO$ or +$\ONE$, which might trigger further simplification at the current level. +The most involved part is the $\sum$ clause, where we use two +auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested +alternatives and reduce as many duplicates as possible. Function +$\textit{distinct}$ keeps the first occurring copy only and removes all later ones +when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. +Its recursive definition is given below: + + \begin{center} + \begin{tabular}{@{}lcl@{}} + $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; + (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ + $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ + $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) +\end{tabular} +\end{center} + +\noindent +Here $\textit{flatten}$ behaves like the traditional functional programming flatten +function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it +removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. + +Having defined the $\simp$ function, +we can use the previous notation of natural +extension from derivative w.r.t.~character to derivative +w.r.t.~string:%\comment{simp in the [] case?} + +\begin{center} +\begin{tabular}{lcl} +$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ +$r \backslash_{simp} [\,] $ & $\dn$ & $r$ +\end{tabular} +\end{center} + +\noindent +to obtain an optimised version of the algorithm: + + \begin{center} +\begin{tabular}{lcl} + $\textit{blexer\_simp}\;r\,s$ & $\dn$ & + $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ + & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ + & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ + & & $\;\;\textit{else}\;\textit{None}$ +\end{tabular} +\end{center} + +\noindent +This algorithm keeps the regular expression size small, for example, +with this simplification our previous $(a + aa)^*$ example's 8000 nodes +will be reduced to just 17 and stays constant, no matter how long the +input string is. + + + + + + + + + %---------------------------------------------------------------------------------------- % SECTION common identities