etnms/20200205.tex
changeset 123 fb7472a29058
child 124 d9d2da923b7f
equal deleted inserted replaced
122:dc0cdfc5fc66 123:fb7472a29058
       
     1 \documentclass[a4paper,UKenglish]{lipics}
       
     2 \usepackage{graphic}
       
     3 \usepackage{data}
       
     4 \usepackage{tikz-cd}
       
     5 \usepackage{tikz}
       
     6 \usetikzlibrary{graphs}
       
     7 \usetikzlibrary{graphdrawing}
       
     8 \usegdlibrary{trees}
       
     9 %\usepackage{algorithm}
       
    10 \usepackage{amsmath}
       
    11 \usepackage{xcolor}
       
    12 \usepackage[noend]{algpseudocode}
       
    13 \usepackage{enumitem}
       
    14 \usepackage{nccmath}
       
    15 \usepackage{soul}
       
    16 
       
    17 \definecolor{darkblue}{rgb}{0,0,0.6}
       
    18 \hypersetup{colorlinks=true,allcolors=darkblue}
       
    19 \newcommand{\comment}[1]%
       
    20 {{\color{red}$\Rightarrow$}\marginpar{\raggedright\small{\bf\color{red}#1}}}
       
    21 
       
    22 % \documentclass{article}
       
    23 %\usepackage[utf8]{inputenc}
       
    24 %\usepackage[english]{babel}
       
    25 %\usepackage{listings}
       
    26 % \usepackage{amsthm}
       
    27 %\usepackage{hyperref}
       
    28 % \usepackage[margin=0.5in]{geometry}
       
    29 %\usepackage{pmboxdraw}
       
    30  
       
    31 \title{POSIX Regular Expression Matching and Lexing}
       
    32 \author{Chengsong Tan}
       
    33 \affil{King's College London\\
       
    34 London, UK\\
       
    35 \texttt{chengsong.tan@kcl.ac.uk}}
       
    36 \authorrunning{Chengsong Tan}
       
    37 \Copyright{Chengsong Tan}
       
    38 
       
    39 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
       
    40 \newcommand{\ZERO}{\mbox{\bf 0}}
       
    41 \newcommand{\ONE}{\mbox{\bf 1}}
       
    42 \def\erase{\textit{erase}}
       
    43 \def\bders{\textit{bders}}
       
    44 \def\lexer{\mathit{lexer}}
       
    45 \def\blexer{\textit{blexer}}
       
    46 \def\fuse{\textit{fuse}}
       
    47 \def\flatten{\textit{flatten}}
       
    48 \def\map{\textit{map}}
       
    49 \def\blexers{\mathit{blexer\_simp}}
       
    50 \def\simp{\mathit{simp}}
       
    51 \def\mkeps{\mathit{mkeps}}
       
    52 \def\bmkeps{\textit{bmkeps}}
       
    53 \def\inj{\mathit{inj}}
       
    54 \def\Empty{\mathit{Empty}}
       
    55 \def\Left{\mathit{Left}}
       
    56 \def\Right{\mathit{Right}}
       
    57 \def\Stars{\mathit{Stars}}
       
    58 \def\Char{\mathit{Char}}
       
    59 \def\Seq{\mathit{Seq}}
       
    60 \def\Der{\mathit{Der}}
       
    61 \def\nullable{\mathit{nullable}}
       
    62 \def\Z{\mathit{Z}}
       
    63 \def\S{\mathit{S}}
       
    64 \def\flex{\textit{flex}}
       
    65 \def\rup{r^\uparrow}
       
    66 \def\retrieve{\textit{retrieve}}
       
    67 \def\AALTS{\textit{AALTS}}
       
    68 \def\AONE{\textit{AONE}}
       
    69 %\theoremstyle{theorem}
       
    70 %\newtheorem{theorem}{Theorem}
       
    71 %\theoremstyle{lemma}
       
    72 %\newtheorem{lemma}{Lemma}
       
    73 %\newcommand{\lemmaautorefname}{Lemma}
       
    74 %\theoremstyle{definition}
       
    75 %\newtheorem{definition}{Definition}
       
    76 \algnewcommand\algorithmicswitch{\textbf{switch}}
       
    77 \algnewcommand\algorithmiccase{\textbf{case}}
       
    78 \algnewcommand\algorithmicassert{\texttt{assert}}
       
    79 \algnewcommand\Assert[1]{\State \algorithmicassert(#1)}%
       
    80 % New "environments"
       
    81 \algdef{SE}[SWITCH]{Switch}{EndSwitch}[1]{\algorithmicswitch\ #1\ \algorithmicdo}{\algorithmicend\ \algorithmicswitch}%
       
    82 \algdef{SE}[CASE]{Case}{EndCase}[1]{\algorithmiccase\ #1}{\algorithmicend\ \algorithmiccase}%
       
    83 \algtext*{EndSwitch}%
       
    84 \algtext*{EndCase}%
       
    85 
       
    86 
       
    87 \begin{document}
       
    88 
       
    89 \maketitle
       
    90 
       
    91 \begin{abstract}
       
    92   Brzozowski introduced in 1964 a beautifully simple algorithm for
       
    93   regular expression matching based on the notion of derivatives of
       
    94   regular expressions. In 2014, Sulzmann and Lu extended this
       
    95   algorithm to not just give a YES/NO answer for whether or not a
       
    96   regular expression matches a string, but in case it does also
       
    97   answers with \emph{how} it matches the string.  This is important for
       
    98   applications such as lexing (tokenising a string). The problem is to
       
    99   make the algorithm by Sulzmann and Lu fast on all inputs without
       
   100   breaking its correctness. Being fast depends on a complete set of 
       
   101   simplification rules, some of which 
       
   102   have been put forward by Sulzmann and Lu. We have extended their
       
   103   rules in order to obtain a tight bound on the size of regular expressions.
       
   104   We have tested these extended rules, but have not 
       
   105   formally established their correctness. We have also not yet looked 
       
   106   at extended regular expressions, such as bounded repetitions,
       
   107   negation and back-references.
       
   108 \end{abstract}
       
   109 
       
   110 \section{Recapitulation of Concepts From the Last Report}
       
   111 
       
   112 \subsection*{Regular Expressions and Derivatives}
       
   113 
       
   114 Suppose (basic) regular expressions are given by the following grammar:
       
   115 
       
   116 \[			r ::=   \ZERO \mid  \ONE
       
   117 			 \mid  c  
       
   118 			 \mid  r_1 \cdot r_2
       
   119 			 \mid  r_1 + r_2   
       
   120 			 \mid r^*         
       
   121 \]
       
   122 
       
   123 \noindent
       
   124 The ingenious contribution of Brzozowski is the notion of \emph{derivatives} of
       
   125 regular expressions, written~$\_ \backslash \_$. It uses the auxiliary notion of
       
   126 $\nullable$ defined below.
       
   127 
       
   128 \begin{center}
       
   129 		\begin{tabular}{lcl}
       
   130 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
   131 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
   132 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
   133 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
   134 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
   135 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
   136 		\end{tabular}
       
   137 	\end{center}
       
   138 
       
   139 \begin{center}
       
   140 \begin{tabular}{lcl}
       
   141 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   142 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   143 		$d \backslash c$     & $\dn$ & 
       
   144 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   145 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   146 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
   147 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   148 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   149 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   150 \end{tabular}
       
   151 \end{center}
       
   152 
       
   153 %Assuming the classic notion of a
       
   154 %\emph{language} of a regular expression, written $L(\_)$, t
       
   155 
       
   156 \noindent
       
   157 The main property of the derivative operation is that
       
   158 
       
   159 \begin{center}
       
   160 $c\!::\!s \in L(r)$ holds
       
   161 if and only if $s \in L(r\backslash c)$.
       
   162 \end{center}
       
   163 
       
   164 \noindent
       
   165 We can generalise the derivative operation shown above for single characters
       
   166 to strings as follows:
       
   167 
       
   168 \begin{center}
       
   169 \begin{tabular}{lcl}
       
   170 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
       
   171 $r \backslash [\,] $ & $\dn$ & $r$
       
   172 \end{tabular}
       
   173 \end{center}
       
   174 
       
   175 \noindent
       
   176 and then define Brzozowski's  regular-expression matching algorithm as:
       
   177 
       
   178 \[
       
   179 match\;s\;r \;\dn\; nullable(r\backslash s)
       
   180 \]
       
   181 
       
   182 \noindent
       
   183 Assuming the a string is givane as a sequence of characters, say $c_0c_1..c_n$, 
       
   184 this algorithm presented graphically is as follows:
       
   185 
       
   186 \begin{equation}\label{graph:*}
       
   187 \begin{tikzcd}
       
   188 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
       
   189 \end{tikzcd}
       
   190 \end{equation}
       
   191 
       
   192 \noindent
       
   193 where we start with  a regular expression  $r_0$, build successive
       
   194 derivatives until we exhaust the string and then use \textit{nullable}
       
   195 to test whether the result can match the empty string. It can  be
       
   196 relatively  easily shown that this matcher is correct  (that is given
       
   197 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
       
   198 
       
   199  
       
   200 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
       
   201 
       
   202 One limitation of Brzozowski's algorithm is that it only produces a
       
   203 YES/NO answer for whether a string is being matched by a regular
       
   204 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
       
   205 to allow generation of an actual matching, called a \emph{value} or
       
   206 sometimes also \emph{lexical value}.  These values and regular
       
   207 expressions correspond to each other as illustrated in the following
       
   208 table:
       
   209 
       
   210 
       
   211 \begin{center}
       
   212 	\begin{tabular}{c@{\hspace{20mm}}c}
       
   213 		\begin{tabular}{@{}rrl@{}}
       
   214 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
       
   215 			$r$ & $::=$  & $\ZERO$\\
       
   216 			& $\mid$ & $\ONE$   \\
       
   217 			& $\mid$ & $c$          \\
       
   218 			& $\mid$ & $r_1 \cdot r_2$\\
       
   219 			& $\mid$ & $r_1 + r_2$   \\
       
   220 			\\
       
   221 			& $\mid$ & $r^*$         \\
       
   222 		\end{tabular}
       
   223 		&
       
   224 		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
   225 			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
       
   226 			$v$ & $::=$  & \\
       
   227 			&        & $\Empty$   \\
       
   228 			& $\mid$ & $\Char(c)$          \\
       
   229 			& $\mid$ & $\Seq\,v_1\, v_2$\\
       
   230 			& $\mid$ & $\Left(v)$   \\
       
   231 			& $\mid$ & $\Right(v)$  \\
       
   232 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
       
   233 		\end{tabular}
       
   234 	\end{tabular}
       
   235 \end{center}
       
   236 
       
   237 \noindent
       
   238 The contribution of Sulzmann and Lu is an extension of Brzozowski's
       
   239 algorithm by a second phase (the first phase being building successive
       
   240 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
       
   241 is generated in case the regular expression matches  the string. 
       
   242 Pictorially, the Sulzmann and Lu algorithm is as follows:
       
   243 
       
   244 \begin{ceqn}
       
   245 \begin{equation}\label{graph:2}
       
   246 \begin{tikzcd}
       
   247 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
   248 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
   249 \end{tikzcd}
       
   250 \end{equation}
       
   251 \end{ceqn}
       
   252 
       
   253 \noindent
       
   254 For convenience, we shall employ the following notations: the regular
       
   255 expression we start with is $r_0$, and the given string $s$ is composed
       
   256 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
       
   257 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
       
   258 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
       
   259 the derivative $r_n$. We test whether this derivative is
       
   260 $\textit{nullable}$ or not. If not, we know the string does not match
       
   261 $r$ and no value needs to be generated. If yes, we start building the
       
   262 values incrementally by \emph{injecting} back the characters into the
       
   263 earlier values $v_n, \ldots, v_0$. This is the second phase of the
       
   264 algorithm from the right to left. For the first value $v_n$, we call the
       
   265 function $\textit{mkeps}$, which builds the lexical value
       
   266 for how the empty string has been matched by the (nullable) regular
       
   267 expression $r_n$. This function is defined as
       
   268 
       
   269 	\begin{center}
       
   270 		\begin{tabular}{lcl}
       
   271 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
       
   272 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
       
   273 			& \textit{if} $\nullable(r_{1})$\\ 
       
   274 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
       
   275 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
       
   276 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
       
   277 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
       
   278 		\end{tabular}
       
   279 	\end{center}
       
   280 
       
   281 
       
   282 \noindent 
       
   283 After the $\mkeps$-call, we inject back the characters one by one in order to build
       
   284 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
       
   285 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
       
   286 After injecting back $n$ characters, we get the lexical value for how $r_0$
       
   287 matches $s$. For this Sulzmann and Lu defined a function that reverses
       
   288 the ``chopping off'' of characters during the derivative phase. The
       
   289 corresponding function is called \emph{injection}, written
       
   290 $\textit{inj}$; it takes three arguments: the first one is a regular
       
   291 expression ${r_{i-1}}$, before the character is chopped off, the second
       
   292 is a character ${c_{i-1}}$, the character we want to inject and the
       
   293 third argument is the value ${v_i}$, into which one wants to inject the
       
   294 character (it corresponds to the regular expression after the character
       
   295 has been chopped off). The result of this function is a new value. The
       
   296 definition of $\textit{inj}$ is as follows: 
       
   297 
       
   298 \begin{center}
       
   299 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   300   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
   301   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
   302   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
   303   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   304   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   305   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
   306   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
   307 \end{tabular}
       
   308 \end{center}
       
   309 
       
   310 \noindent This definition is by recursion on the ``shape'' of regular
       
   311 expressions and values. 
       
   312 
       
   313 
       
   314 \subsection*{Simplification of Regular Expressions}
       
   315 
       
   316 The main drawback of building successive derivatives according
       
   317 to Brzozowski's definition is that they can grow very quickly in size.
       
   318 This is mainly due to the fact that the derivative operation generates
       
   319 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result, if
       
   320 implemented naively both algorithms by Brzozowski and by Sulzmann and Lu
       
   321 are excruciatingly slow. For example when starting with the regular
       
   322 expression $(a + aa)^*$ and building 12 successive derivatives
       
   323 w.r.t.~the character $a$, one obtains a derivative regular expression
       
   324 with more than 8000 nodes (when viewed as a tree). Operations like
       
   325 $\textit{der}$ and $\nullable$ need to traverse such trees and
       
   326 consequently the bigger the size of the derivative the slower the
       
   327 algorithm. 
       
   328 
       
   329 Fortunately, one can simplify regular expressions after each derivative
       
   330 step. Various simplifications of regular expressions are possible, such
       
   331 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
       
   332 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
       
   333 affect the answer for whether a regular expression matches a string or
       
   334 not, but fortunately also do not affect the POSIX strategy of how
       
   335 regular expressions match strings---although the latter is much harder
       
   336 to establish. Some initial results in this regard have been
       
   337 obtained in \cite{AusafDyckhoffUrban2016}. 
       
   338 
       
   339 If we want the size of derivatives in Sulzmann and Lu's algorithm to
       
   340 stay below this bound, we would need more aggressive simplifications.
       
   341 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
       
   342 deleting duplicates whenever possible. For example, the parentheses in
       
   343 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
       
   344 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
       
   345 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
       
   346 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us
       
   347 to achieve the same size bound as that of the partial derivatives. 
       
   348 
       
   349 In order to implement the idea of ``spilling out alternatives'' and to
       
   350 make them compatible with the $\textit{inj}$-mechanism, we use
       
   351 \emph{bitcodes}. They were first introduced by Sulzmann and Lu.
       
   352 Here bits and bitcodes (lists of bits) are defined as:
       
   353 
       
   354 \begin{center}
       
   355 		$b ::=   1 \mid  0 \qquad
       
   356 bs ::= [] \mid b:bs    
       
   357 $
       
   358 \end{center}
       
   359 
       
   360 \noindent
       
   361 The $1$ and $0$ are not in bold in order to avoid 
       
   362 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
       
   363 bit-lists) can be used to encode values (or incomplete values) in a
       
   364 compact form. This can be straightforwardly seen in the following
       
   365 coding function from values to bitcodes: 
       
   366 
       
   367 \begin{center}
       
   368 \begin{tabular}{lcl}
       
   369   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
       
   370   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
       
   371   $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
       
   372   $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
       
   373   $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
       
   374   $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
       
   375   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
       
   376                                                  code(\Stars\,vs)$
       
   377 \end{tabular}    
       
   378 \end{center} 
       
   379 
       
   380 \noindent
       
   381 Here $\textit{code}$ encodes a value into a bitcodes by converting
       
   382 $\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
       
   383 star iteration by $1$. The border where a local star terminates
       
   384 is marked by $0$. This coding is lossy, as it throws away the information about
       
   385 characters, and also does not encode the ``boundary'' between two
       
   386 sequence values. Moreover, with only the bitcode we cannot even tell
       
   387 whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
       
   388 reason for choosing this compact way of storing information is that the
       
   389 relatively small size of bits can be easily manipulated and ``moved
       
   390 around'' in a regular expression. In order to recover values, we will 
       
   391 need the corresponding regular expression as an extra information. This
       
   392 means the decoding function is defined as:
       
   393 
       
   394 
       
   395 %\begin{definition}[Bitdecoding of Values]\mbox{}
       
   396 \begin{center}
       
   397 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
       
   398   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
       
   399   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
       
   400   $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
   401      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
       
   402        (\Left\,v, bs_1)$\\
       
   403   $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
   404      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
       
   405        (\Right\,v, bs_1)$\\                           
       
   406   $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
       
   407         $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
       
   408   & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
       
   409   & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
       
   410   $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
       
   411   $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
       
   412          $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
   413   & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
       
   414   & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
       
   415   
       
   416   $\textit{decode}\,bs\,r$ & $\dn$ &
       
   417      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
   418   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
       
   419        \textit{else}\;\textit{None}$                       
       
   420 \end{tabular}    
       
   421 \end{center}    
       
   422 %\end{definition}
       
   423 
       
   424 Sulzmann and Lu's integrated the bitcodes into regular expressions to
       
   425 create annotated regular expressions \cite{Sulzmann2014}.
       
   426 \emph{Annotated regular expressions} are defined by the following
       
   427 grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
       
   428 
       
   429 \begin{center}
       
   430 \begin{tabular}{lcl}
       
   431   $\textit{a}$ & $::=$  & $\ZERO$\\
       
   432                   & $\mid$ & $_{bs}\ONE$\\
       
   433                   & $\mid$ & $_{bs}{\bf c}$\\
       
   434                   & $\mid$ & $_{bs}\oplus\,as$\\
       
   435                   & $\mid$ & $_{bs}a_1\cdot a_2$\\
       
   436                   & $\mid$ & $_{bs}a^*$
       
   437 \end{tabular}    
       
   438 \end{center}  
       
   439 %(in \textit{ALTS})
       
   440 
       
   441 \noindent
       
   442 where $bs$ stands for bitcodes, $a$  for $\bold{a}$nnotated regular
       
   443 expressions and $as$ for a list of annotated regular expressions.
       
   444 The alternative constructor($\oplus$) has been generalized to 
       
   445 accept a list of annotated regular expressions rather than just 2.
       
   446 We will show that these bitcodes encode information about
       
   447 the (POSIX) value that should be generated by the Sulzmann and Lu
       
   448 algorithm.
       
   449 
       
   450 
       
   451 To do lexing using annotated regular expressions, we shall first
       
   452 transform the usual (un-annotated) regular expressions into annotated
       
   453 regular expressions. This operation is called \emph{internalisation} and
       
   454 defined as follows:
       
   455 
       
   456 %\begin{definition}
       
   457 \begin{center}
       
   458 \begin{tabular}{lcl}
       
   459   $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
       
   460   $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
       
   461   $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
       
   462   $(r_1 + r_2)^\uparrow$ & $\dn$ &
       
   463   $_{[]}\oplus[\textit{fuse}\,[0]\,r_1^\uparrow,\,
       
   464   \textit{fuse}\,[1]\,r_2^\uparrow]$\\
       
   465   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
       
   466          $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
       
   467   $(r^*)^\uparrow$ & $\dn$ &
       
   468          $_{[]}(r^\uparrow)^*$\\
       
   469 \end{tabular}    
       
   470 \end{center}    
       
   471 %\end{definition}
       
   472 
       
   473 \noindent
       
   474 We use up arrows here to indicate that the basic un-annotated regular
       
   475 expressions are ``lifted up'' into something slightly more complex. In the
       
   476 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
       
   477 attach bits to the front of an annotated regular expression. Its
       
   478 definition is as follows:
       
   479 
       
   480 \begin{center}
       
   481 \begin{tabular}{lcl}
       
   482   $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
       
   483   $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
       
   484      $_{bs @ bs'}\ONE$\\
       
   485   $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
       
   486      $_{bs@bs'}{\bf c}$\\
       
   487   $\textit{fuse}\;bs\,_{bs'}\oplus\textit{as}$ & $\dn$ &
       
   488      $_{bs@bs'}\oplus\textit{as}$\\
       
   489   $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
       
   490      $_{bs@bs'}a_1 \cdot a_2$\\
       
   491   $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
       
   492      $_{bs @ bs'}a^*$
       
   493 \end{tabular}    
       
   494 \end{center}  
       
   495 
       
   496 \noindent
       
   497 After internalising the regular expression, we perform successive
       
   498 derivative operations on the annotated regular expressions. This
       
   499 derivative operation is the same as what we had previously for the
       
   500 basic regular expressions, except that we beed to take care of
       
   501 the bitcodes:
       
   502 
       
   503 
       
   504 \iffalse
       
   505  %\begin{definition}{bder}
       
   506 \begin{center}
       
   507   \begin{tabular}{@{}lcl@{}}
       
   508   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   509   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   510   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
       
   511         $\textit{if}\;c=d\; \;\textit{then}\;
       
   512          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
       
   513   $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
       
   514   $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
       
   515   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
       
   516      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   517 					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
       
   518 					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
   519   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
       
   520   $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
       
   521       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
       
   522        (\textit{STAR}\,[]\,r)$
       
   523 \end{tabular}    
       
   524 \end{center}    
       
   525 %\end{definition}
       
   526 
       
   527 \begin{center}
       
   528   \begin{tabular}{@{}lcl@{}}
       
   529   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   530   $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   531   $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
       
   532         $\textit{if}\;c=d\; \;\textit{then}\;
       
   533          _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
       
   534   $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
       
   535   $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
       
   536   $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
       
   537      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   538 					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
       
   539 					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
   540   & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
       
   541   $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
       
   542       $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
       
   543        (_{bs}\textit{STAR}\,[]\,r)$
       
   544 \end{tabular}    
       
   545 \end{center}    
       
   546 %\end{definition}
       
   547 \fi
       
   548 
       
   549 \begin{center}
       
   550   \begin{tabular}{@{}lcl@{}}
       
   551   $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
       
   552   $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
       
   553   $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
       
   554         $\textit{if}\;c=d\; \;\textit{then}\;
       
   555          _{bs}\ONE\;\textit{else}\;\ZERO$\\  
       
   556   $(_{bs}\oplus \;\textit{as})\,\backslash c$ & $\dn$ &
       
   557   $_{bs}\oplus\;(\textit{as.map}(\backslash c))$\\
       
   558   $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
       
   559      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   560 					       & &$\textit{then}\;_{bs}\oplus\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
       
   561 					       & &$\phantom{\textit{then},\;_{bs}\oplus\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
       
   562   & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
       
   563   $(_{bs}a^*)\,\backslash c$ & $\dn$ &
       
   564       $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
       
   565        (_{[]}r^*))$
       
   566 \end{tabular}    
       
   567 \end{center}    
       
   568 
       
   569 %\end{definition}
       
   570 \noindent
       
   571 For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
       
   572 we need to unfold it into a sequence,
       
   573 and attach an additional bit $0$ to the front of $r \backslash c$
       
   574 to indicate that there is one more star iteration. Also the sequence clause
       
   575 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
       
   576 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
       
   577 that it is for annotated regular expressions, therefore we omit the
       
   578 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
       
   579 $a_1$ matches the string prior to character $c$ (more on this later),
       
   580 then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
       
   581 \backslash c)$ will collapse the regular expression $a_1$(as it has
       
   582 already been fully matched) and store the parsing information at the
       
   583 head of the regular expression $a_2 \backslash c$ by fusing to it. The
       
   584 bitsequence $\textit{bs}$, which was initially attached to the
       
   585 first element of the sequence $a_1 \cdot a_2$, has
       
   586 now been elevated to the top-level of $\oplus$, as this information will be
       
   587 needed whichever way the sequence is matched---no matter whether $c$ belongs
       
   588 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
       
   589 the lexing information, we complete the lexing by collecting the
       
   590 bitcodes using a generalised version of the $\textit{mkeps}$ function
       
   591 for annotated regular expressions, called $\textit{bmkeps}$:
       
   592 
       
   593 
       
   594 %\begin{definition}[\textit{bmkeps}]\mbox{}
       
   595 \begin{center}
       
   596 \begin{tabular}{lcl}
       
   597   $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
       
   598   $\textit{bmkeps}\,(_{bs}\oplus a::\textit{as})$ & $\dn$ &
       
   599      $\textit{if}\;\textit{bnullable}\,a$\\
       
   600   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
       
   601   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\oplus \textit{as})$\\
       
   602   $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
       
   603      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
       
   604   $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
       
   605      $bs \,@\, [0]$
       
   606 \end{tabular}    
       
   607 \end{center}    
       
   608 %\end{definition}
       
   609 
       
   610 \noindent
       
   611 This function completes the value information by travelling along the
       
   612 path of the regular expression that corresponds to a POSIX value and
       
   613 collecting all the bitcodes, and using $S$ to indicate the end of star
       
   614 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
       
   615 decode them, we get the value we expect. The corresponding lexing
       
   616 algorithm looks as follows:
       
   617 
       
   618 \begin{center}
       
   619 \begin{tabular}{lcl}
       
   620   $\textit{blexer}\;r\,s$ & $\dn$ &
       
   621       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
       
   622   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
   623   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   624   & & $\;\;\textit{else}\;\textit{None}$
       
   625 \end{tabular}
       
   626 \end{center}
       
   627 
       
   628 \noindent
       
   629 In this definition $\_\backslash s$ is the  generalisation  of the derivative
       
   630 operation from characters to strings (just like the derivatives for un-annotated
       
   631 regular expressions).
       
   632 
       
   633 
       
   634 \subsection*{Our Simplification Rules}
       
   635 
       
   636 The main point of the bitcodes and annotated regular expressions is that
       
   637 we can apply rather aggressive (in terms of size) simplification rules
       
   638 in order to keep derivatives small. We have developed such
       
   639 ``aggressive'' simplification rules and generated test data that show
       
   640 that the expected bound can be achieved. Obviously we could only
       
   641 partially cover  the search space as there are infinitely many regular
       
   642 expressions and strings. 
       
   643 
       
   644 One modification we introduced is to allow a list of annotated regular
       
   645 expressions in the $\oplus$ constructor. This allows us to not just
       
   646 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
       
   647 also unnecessary ``copies'' of regular expressions (very similar to
       
   648 simplifying $r + r$ to just $r$, but in a more general setting). Another
       
   649 modification is that we use simplification rules inspired by Antimirov's
       
   650 work on partial derivatives. They maintain the idea that only the first
       
   651 ``copy'' of a regular expression in an alternative contributes to the
       
   652 calculation of a POSIX value. All subsequent copies can be pruned away from
       
   653 the regular expression. A recursive definition of our  simplification function 
       
   654 that looks somewhat similar to our Scala code is given below:
       
   655 %\comment{Use $\ZERO$, $\ONE$ and so on. 
       
   656 %Is it $ALTS$ or $ALTS$?}\\
       
   657 
       
   658 \begin{center}
       
   659   \begin{tabular}{@{}lcl@{}}
       
   660    
       
   661   $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
       
   662    &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
       
   663    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
       
   664    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
       
   665    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
       
   666    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
       
   667 
       
   668   $\textit{simp} \; (_{bs}\oplus \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
       
   669   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
       
   670    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
       
   671    &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\oplus \textit{as'}$\\ 
       
   672 
       
   673    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
       
   674 \end{tabular}    
       
   675 \end{center}    
       
   676 
       
   677 \noindent
       
   678 The simplification does a pattern matching on the regular expression.
       
   679 When it detected that the regular expression is an alternative or
       
   680 sequence, it will try to simplify its children regular expressions
       
   681 recursively and then see if one of the children turn into $\ZERO$ or
       
   682 $\ONE$, which might trigger further simplification at the current level.
       
   683 The most involved part is the $\oplus$ clause, where we use two
       
   684 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
       
   685 alternatives and reduce as many duplicates as possible. Function
       
   686 $\textit{distinct}$  keeps the first occurring copy only and remove all later ones
       
   687 when detected duplicates. Function $\textit{flatten}$ opens up nested $\oplus$s.
       
   688 Its recursive definition is given below:
       
   689 
       
   690  \begin{center}
       
   691   \begin{tabular}{@{}lcl@{}}
       
   692   $\textit{flatten} \; (_{bs}\oplus \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
       
   693      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
       
   694   $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
       
   695     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
       
   696 \end{tabular}    
       
   697 \end{center}  
       
   698 
       
   699 \noindent
       
   700 Here $\textit{flatten}$ behaves like the traditional functional programming flatten
       
   701 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
       
   702 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
       
   703 
       
   704 Suppose we apply simplification after each derivative step, and view
       
   705 these two operations as an atomic one: $a \backslash_{simp}\,c \dn
       
   706 \textit{simp}(a \backslash c)$. Then we can use the previous natural
       
   707 extension from derivative w.r.t.~character to derivative
       
   708 w.r.t.~string:%\comment{simp in  the [] case?}
       
   709 
       
   710 \begin{center}
       
   711 \begin{tabular}{lcl}
       
   712 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
       
   713 $r \backslash_{simp} [\,] $ & $\dn$ & $r$
       
   714 \end{tabular}
       
   715 \end{center}
       
   716 
       
   717 \noindent
       
   718 we obtain an optimised version of the algorithm:
       
   719 
       
   720  \begin{center}
       
   721 \begin{tabular}{lcl}
       
   722   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
       
   723       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
       
   724   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
   725   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   726   & & $\;\;\textit{else}\;\textit{None}$
       
   727 \end{tabular}
       
   728 \end{center}
       
   729 
       
   730 \noindent
       
   731 This algorithm keeps the regular expression size small, for example,
       
   732 with this simplification our previous $(a + aa)^*$ example's 8000 nodes
       
   733 will be reduced to just 6 and stays constant, no matter how long the
       
   734 input string is.
       
   735 
       
   736 
       
   737 
       
   738 \section{Introduction}
       
   739 
       
   740 While we believe derivatives of regular expressions, written
       
   741 $r\backslash s$, are a beautiful concept (in terms of ease of
       
   742 implementing them in functional programming languages and in terms of
       
   743 reasoning about them formally), they have one major drawback: every
       
   744 derivative step can make regular expressions grow drastically in
       
   745 size. This in turn has negative effect on the runtime of the
       
   746 corresponding lexing algorithms. Consider for example the regular
       
   747 expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The
       
   748 corresponding derivative contains already 8668 nodes where we assume
       
   749 the derivative is given as a tree. The reason for the poor runtime of
       
   750 the derivative-based lexing algorithms is that they need to traverse
       
   751 such trees over and over again. The solution is to find a complete set
       
   752 of simplification rules that keep the sizes of derivatives uniformly
       
   753 small.
       
   754 
       
   755 For reasons beyond this report, it turns out that a complete set of
       
   756 simplification rules depends on values being encoded as
       
   757 bitsequences.\footnote{Values are the results the lexing algorithms
       
   758   generate; they encode how a regular expression matched a string.} We
       
   759 already know that the lexing algorithm using bitsequences but
       
   760 \emph{without} simplification is correct, albeilt horribly
       
   761 slow. Therefore in the past 6 months I was trying to prove that the
       
   762 algorithm using bitsequences plus our simplification rules is
       
   763 also correct. Formally this amounts to show that
       
   764 
       
   765 \begin{equation}\label{mainthm}
       
   766 \blexers \; r \; s = \blexer \;r\;s
       
   767 \end{equation}
       
   768 
       
   769 \noindent
       
   770 whereby $\blexers$ simplifies (makes derivatives smaller) in each
       
   771 step, whereas with $\blexer$ the size can grow exponentially. This
       
   772 would be an important milestone for my thesis, because we already
       
   773 have a very good idea how to establish that our set our simplification
       
   774 rules keeps the size of derivativs below a relatively tight bound.
       
   775 
       
   776 In order to prove the main theorem in \eqref{mainthm}, we need to prove that the
       
   777 two functions produce the same output. The definition of these two  functions 
       
   778 is shown below.
       
   779 
       
   780 \begin{center}
       
   781 \begin{tabular}{lcl}
       
   782   $\textit{blexer}\;r\,s$ & $\dn$ &
       
   783       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
       
   784   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
   785   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   786   & & $\;\;\textit{else}\;\textit{None}$
       
   787 \end{tabular}
       
   788 \end{center}
       
   789 
       
   790 \begin{center}
       
   791 \begin{tabular}{lcl}
       
   792   $\blexers \; r \, s$ &$\dn$ &
       
   793     $\textit{let} \; a = (r^\uparrow)\backslash_{simp}\, s\; \textit{in}$\\
       
   794   & & $\; \; \textit{if} \; \textit{bnullable}(a)$\\
       
   795   & & $\; \; \textit{then} \; \textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   796   & & $\;\;   \textit{else}\;\textit{None}$
       
   797 \end{tabular}
       
   798 \end{center}
       
   799 \noindent
       
   800 In these definitions $(r^\uparrow)$ is a kind of coding function that
       
   801 is the same in each case, similarly the decode and the \textit{bmkeps}
       
   802 are functions that are the same in each case. Our main
       
   803 theorem~\eqref{mainthm} therefore boils down to proving the following
       
   804 two propositions (depending on which branch the if-else clause
       
   805 takes). They establish how the derivatives \emph{with} simplification
       
   806 do not change the computed result:
       
   807 
       
   808 \begin{itemize}
       
   809 \item{(a)} If a string $s$ is in the language of $L(r)$, then \\
       
   810 $\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\
       
   811 \item{(b)} If a string $s$ is in the language $L(r)$, then 
       
   812 $\rup \backslash_{simp} \,s$ is not nullable.
       
   813 \end{itemize}
       
   814 
       
   815 \noindent
       
   816 We have already proved the second part  in Isabelle. This is actually
       
   817 not too difficult because we can show that simplification does not
       
   818 change the language of simplified regular expressions.
       
   819 
       
   820 If we can prove the first part, that is the bitsequence algorithm with
       
   821 simplification produces the same result as the one without
       
   822 simplification, then we are done.  Unfortunately that part requires
       
   823 more effort, because simplification does not only need to \emph{not}
       
   824 change the language, but also not change the value (that is the
       
   825 computed result).
       
   826 
       
   827 %\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
       
   828 %Do you want to keep this? You essentially want to say that the old
       
   829 %method used retrieve, which unfortunately cannot be adopted to 
       
   830 %the simplification rules. You could just say that and give an example.
       
   831 %However you have to think about how you give the example....nobody knows
       
   832 %about AZERO etc yet. Maybe it might be better to use normal regexes
       
   833 %like $a + aa$, but annotate bitsequences as subscript like $_1(_0a + _1aa)$.
       
   834 
       
   835 %\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
       
   836 %REPLY:\\
       
   837 %Yes, I am essentially saying that the old method
       
   838 %cannot be adopted without adjustments.
       
   839 %But this does not mean we should skip
       
   840 %the proof of the bit-coded algorithm
       
   841 %as it is still the main direction we are looking into
       
   842 %to prove things. We are trying to modify
       
   843 %the old proof to suit our needs, but not give 
       
   844 %up it totally, that is why i believe the old 
       
   845 %proof is fundamental in understanding
       
   846 %what we are doing in the past 6 months.
       
   847 %\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
       
   848 
       
   849 \subsubsection*{Existing Proof}
       
   850 
       
   851 For this we have started with looking at the original proof that
       
   852 established that the bitsequence algorrithm produces the same result
       
   853 as the algorithm not using bitsequences. Formally this proof
       
   854 established
       
   855 
       
   856 \begin{equation}\label{lexer}
       
   857 \blexer \; (r^\uparrow)  s = \lexer \;r \;s
       
   858 \end{equation}
       
   859 
       
   860 %\noindent
       
   861 %might provide us insight into proving 
       
   862 %\begin{center}
       
   863 %$\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$
       
   864 %\end{center}
       
   865 
       
   866 \noindent
       
   867 The proof uses two ``tricks''. One is that it uses a \flex-function
       
   868 
       
   869 \begin{center}
       
   870 \begin{tabular}{lcl}
       
   871 $\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \;  (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\
       
   872 $\textit{flex} \;r\; f\;  [\,] $ & $\dn$ & $f$
       
   873 \end{tabular}
       
   874 \end{center}
       
   875 
       
   876 \noindent
       
   877 and then proves for the right-hand side in \eqref{lexer}
       
   878 
       
   879 \begin{center}
       
   880 $\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; (r\backslash s))$
       
   881 \end{center}.
       
   882 
       
   883 
       
   884 
       
   885 
       
   886 \noindent
       
   887 The $\flex$-function essentially does lexing by
       
   888 stacking up injection functions while doing derivatives.
       
   889 
       
   890 explicitly showing the order of characters being
       
   891 injected back in each step.
       
   892 With $\flex$ we can write $\lexer$ this way: 
       
   893 
       
   894 \begin{center}
       
   895 $\lexer \;r\; s = \flex \;id \; r\;s \;(\mkeps r\backslash s)$
       
   896 \end{center}
       
   897 
       
   898 %\noindent
       
   899 %$\flex$ focuses on the injections instead of the derivatives ,
       
   900 %compared to the original definition of $\lexer$, which puts equal
       
   901 %amount of emphasis on injection and derivative with respect to each
       
   902 %character:
       
   903 
       
   904 %\begin{center}
       
   905 %\begin{tabular}{lcl}
       
   906 %$\textit{lexer} \; r\; (c\!::\!s) $ & $\dn$ & $\textit{case} \; \lexer \; (r\backslash c) \;s \; %\textit{of}$ \\
       
   907 % & & $\textit{None} \; \Longrightarrow \; \textit{None}$\\
       
   908 %  & & $\textbar \; v \; \Longrightarrow \; \inj \; r\;c\;v$\\
       
   909 %$\textit{lexer} \; r\;  [\,] $ & $\dn$ & $\textit{if} \; \nullable (r) \; \textit{then} \; \mkeps% (r) \; \textit{else} \;None$
       
   910 %\end{tabular}
       
   911 %\end{center}
       
   912 
       
   913 \noindent
       
   914 The crux in the existing proof is how $\flex$ relates to injection, namely
       
   915 
       
   916 \begin{center}
       
   917 $\flex \; r \; id \; (s@[c]) \; v = \flex \;  r \; id \; s \; (inj \; (r\backslash s) \; c\; v)$.
       
   918 \end{center}
       
   919 
       
   920 \noindent
       
   921 This property allows one to rewrite an induction hypothesis like 
       
   922 
       
   923 \begin{center} 
       
   924 $ \flex \; r\; id\; (s@[c])\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$
       
   925 \end{center}
       
   926 
       
   927 
       
   928 \subsubsection{Retrieve Function}
       
   929 The crucial point is to find the
       
   930 $\textit{POSIX}$  information of a regular expression and how it is modified,
       
   931 augmented and propagated 
       
   932 during simplification in parallel with the regular expression that
       
   933 has not been simplified in the subsequent derivative operations.  To aid this,
       
   934 we use the helper function retrieve described by Sulzmann and Lu:
       
   935 \begin{center}
       
   936 \begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}}
       
   937   $\textit{retrieve}\,(_{bs}\ONE)\,\Empty$ & $\dn$ & $bs$\\
       
   938   $\textit{retrieve}\,(_{bs}{\bf c})\,(\Char\,d)$ & $\dn$ & $bs$\\
       
   939   $\textit{retrieve}\,(_{bs}\oplus a::as)\,(\Left\,v)$ & $\dn$ &
       
   940      $bs \,@\, \textit{retrieve}\,a\,v$\\
       
   941   $\textit{retrieve}\,(_{bs}\oplus a::as)\,(\Right\,v)$ & $\dn$ &
       
   942   $\textit{bs} \,@\, \textit{retrieve}\,(_{[]}\oplus as)\,v$\\
       
   943   $\textit{retrieve}\,(_{bs}a_1\cdot a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ &
       
   944      $bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\
       
   945   $\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,[])$ & $\dn$ &
       
   946      $bs \,@\, [0]$\\
       
   947   $\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\
       
   948   \multicolumn{3}{l}{
       
   949      \hspace{3cm}$bs \,@\, [1] \,@\, \textit{retrieve}\,a\,v\,@\,
       
   950                     \textit{retrieve}\,(_{[]}a^*)\,(\Stars\,vs)$}\\
       
   951 \end{tabular}
       
   952 \end{center}
       
   953 %\comment{Did not read further}\\
       
   954 This function assembles the bitcode 
       
   955 %that corresponds to a lexical value for how
       
   956 %the current derivative matches the suffix of the string(the characters that
       
   957 %have not yet appeared, but will appear as the successive derivatives go on.
       
   958 %How do we get this "future" information? By the value $v$, which is
       
   959 %computed by a pass of the algorithm that uses
       
   960 %$inj$ as described in the previous section).  
       
   961 using information from both the derivative regular expression and the
       
   962 value. Sulzmann and Lu poroposed this function, but did not prove
       
   963 anything about it. Ausaf and Urban used it to connect the bitcoded
       
   964 algorithm to the older algorithm by the following equation:
       
   965  
       
   966  \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\;
       
   967 	 (r^\uparrow)\backslash_{simp} \,c)\,v)$ 
       
   968  \end{center} 
       
   969 
       
   970 \noindent
       
   971 whereby $r^\uparrow$ stands for the internalised version of $r$. Ausaf
       
   972 and Urban also used this fact to prove  the correctness of bitcoded
       
   973 algorithm without simplification.  Our purpose of using this, however,
       
   974 is to establish 
       
   975 
       
   976 \begin{center}
       
   977 $ \textit{retrieve} \;
       
   978 a \; v \;=\; \textit{retrieve}  \; (\textit{simp}\,a) \; v'.$
       
   979 \end{center}
       
   980 The idea is that using $v'$, a simplified version of $v$ that had gone
       
   981 through the same simplification step as $\textit{simp}(a)$, we are able
       
   982 to extract the bitcode that gives the same parsing information as the
       
   983 unsimplified one. However, we noticed that constructing such a  $v'$
       
   984 from $v$ is not so straightforward. The point of this is that  we might
       
   985 be able to finally bridge the gap by proving
       
   986 
       
   987 \noindent
       
   988 By using a property of retrieve we have the $\textit{RHS}$ of the above equality is
       
   989 $decode (retrieve (r^\uparrow \backslash(s @ [c])) v) r$, and this gives the 
       
   990 main lemma result:
       
   991 
       
   992 \begin{center}
       
   993 $ \flex \;r\;  id \; (s@[c]) \; v =\textit{decode}(\textit{retrieve} (\rup \backslash (s@[c])) \;v) r$
       
   994 \end{center}
       
   995 
       
   996 
       
   997 
       
   998 
       
   999 \noindent
       
  1000 To use this lemma result for our 
       
  1001 correctness proof, simply replace the $v$ in the
       
  1002 $\textit{RHS}$ of the above equality with
       
  1003 $\mkeps\;(r\backslash (s@[c]))$, and apply the lemma that
       
  1004  
       
  1005 \begin{center}
       
  1006 $\textit{decode} \; \bmkeps \; \rup \; r = \textit{decode} \; (\textit{retrieve} \; \rup \; \mkeps(r)) \;r$
       
  1007 \end{center}
       
  1008 \noindent
       
  1009 We get the correctness of our bit-coded algorithm:
       
  1010 \begin{center}
       
  1011 $\flex \;r\;  id \; s \; (\mkeps \; r\backslash s) = \textit{decode} \; \bmkeps \; \rup\backslash s \; r$
       
  1012 \end{center}
       
  1013 \noindent
       
  1014 The bridge between the above chain of equalities
       
  1015 is the use of $\retrieve$,
       
  1016 if we want to use a similar technique for the 
       
  1017 simplified version of algorithm,
       
  1018 we face the problem that in the above 
       
  1019 equalities,
       
  1020 $\retrieve \; a \; v$ is not always defined.
       
  1021 for example,
       
  1022 $\retrieve \; _0(_1a+_0a) \; \Left(\Empty)$
       
  1023 is defined, but not $\retrieve \; (_{01}a) \;\Left(\Empty)$,
       
  1024 though we can extract the same POSIX
       
  1025 bits from the two annotated regular expressions.
       
  1026 The latter might occur when we try to retrieve from 
       
  1027 a simplified regular expression using the same value
       
  1028 as the unsimplified one.
       
  1029 This is because $\Left(\Empty)$ corresponds to
       
  1030 the regular expression structure $\ONE+r_2$ instead of
       
  1031 $\ONE$.
       
  1032 That means, if we 
       
  1033 want to prove that 
       
  1034 \begin{center}
       
  1035 $\textit{decode} \; \bmkeps \; \rup\backslash s \; r = \textit{decode} \; \bmkeps \; \rup\backslash_{simp} s \; r$
       
  1036 \end{center}
       
  1037 \noindent
       
  1038 holds by using $\retrieve$,
       
  1039 we probably need to prove an equality like below:
       
  1040 \begin{center}
       
  1041 %$\retrieve \; \rup\backslash_{simp} s \; \mkeps(r\backslash_{simp} s)=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$
       
  1042 $\retrieve \; \rup\backslash_{simp} s \; \mkeps(f(r\backslash s))=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$
       
  1043 \end{center}
       
  1044 \noindent
       
  1045 $f$ rectifies $r\backslash s$ so the value $\mkeps(f(r\backslash s))$ becomes
       
  1046  something simpler
       
  1047 to make the retrieve function defined.\\
       
  1048 \subsubsection{Ways to Rectify Value}
       
  1049 One way to do this is to prove the following:
       
  1050 \begin{center}
       
  1051 $\retrieve \; \rup\backslash_{simp} s \; \mkeps(\simp(r\backslash s))=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$
       
  1052 \end{center}
       
  1053 \noindent
       
  1054 The reason why we choose $\simp$ as $f$ is because
       
  1055 $\rup\backslash_{simp} \, s$ and $\simp(\rup\backslash \, s)$
       
  1056 have the same shape:
       
  1057 \begin{center}
       
  1058 $\erase (\rup\backslash_{simp} \, s) = \erase(\simp(\rup\backslash s))$
       
  1059 \end{center}
       
  1060 
       
  1061 \noindent
       
  1062 $\erase$ in the above equality means to remove the bit-codes
       
  1063 in an annotated regular expression and only keep the original
       
  1064 regular expression(just like "erasing" the bits). Its definition is omitted.
       
  1065 $\rup\backslash_{simp} \, s$ and $\simp(\rup\backslash s)$
       
  1066 are very closely related, but not identical.
       
  1067 \subsubsection{Example for $\rup\backslash_{simp} \, s \neq \simp(\rup\backslash s)$}
       
  1068 For example, let $r$ be the regular expression
       
  1069 $(a+b)(a+a*)$ and $s$  be the string $aa$, then
       
  1070 both $\erase (\rup\backslash_{simp} \, s)$ and $\erase (\simp (\rup\backslash s))$
       
  1071 are $\ONE + a^*$. However, without $\erase$ 
       
  1072 \begin{center}
       
  1073 $\rup\backslash_{simp} \, s$ is equal to $_0(_0\ONE +_{11}a^*)$
       
  1074 \end{center}
       
  1075 \noindent
       
  1076 whereas
       
  1077 \begin{center}
       
  1078 $\simp(\rup\backslash  s)$ is equal to $(_{00}\ONE +_{011}a^*)$
       
  1079 \end{center}
       
  1080 \noindent
       
  1081 (For the sake of visual simplicity, we use numbers to denote the bits
       
  1082 in bitcodes as we have previously defined for annotated 
       
  1083 regular expressions. $\S$ is replaced by 
       
  1084 subscript $_1$ and $\Z$ by $_0$.)
       
  1085 
       
  1086 What makes the difference?
       
  1087 
       
  1088 %Two "rules" might be inferred from the above example.
       
  1089 
       
  1090 %First, after erasing the bits the two regular expressions
       
  1091 %are exactly the same: both become $1+a^*$. Here the 
       
  1092 %function $\simp$ exhibits the "one in the end equals many times
       
  1093 %at the front"
       
  1094 %property: one simplification in the end causes the 
       
  1095 %same regular expression structure as
       
  1096 %successive simplifications done alongside derivatives.
       
  1097 %$\rup\backslash_{simp} \, s$ unfolds to 
       
  1098 %$\simp((\simp(r\backslash a))\backslash a)$
       
  1099 %and $\simp(\rup\backslash s)$ unfolds to 
       
  1100 %$\simp((r\backslash a)\backslash a)$. The one simplification
       
  1101 %in the latter causes the resulting regular expression to 
       
  1102 %become $1+a^*$, exactly the same as the former with
       
  1103 %two simplifications.
       
  1104 
       
  1105 %Second, the bit-codes are different, but they are essentially
       
  1106 %the same: if we push the outmost bits ${\bf_0}(_0\ONE +_{11}a^*)$ of $\rup\backslash_{simp} \, s$
       
  1107 %inside then we get $(_{00}\ONE +_{011}a^*)$, exactly the 
       
  1108 %same as that of $\rup\backslash \, s$. And this difference 
       
  1109 %does not matter when we try to apply $\bmkeps$ or $\retrieve$
       
  1110 %to it. This seems a good news if we want to use $\retrieve$
       
  1111 %to prove things.
       
  1112 
       
  1113 %If we look into the difference above, we could see that the
       
  1114 %difference is not fundamental: the bits are just being moved
       
  1115 %around in a way that does not hurt the correctness.
       
  1116 During the first derivative operation, 
       
  1117 $\rup\backslash a=(_0\ONE  + \ZERO)(_0a  +  _1a^*)$  is
       
  1118 in the form of a sequence regular expression with
       
  1119 two components, the first
       
  1120 one $\ONE + \ZERO$ being nullable. 
       
  1121 Recall the simplification function definition:
       
  1122 \begin{center}
       
  1123   \begin{tabular}{@{}lcl@{}}
       
  1124    
       
  1125   $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
       
  1126    &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
       
  1127    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
       
  1128    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
       
  1129    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
       
  1130    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
       
  1131 
       
  1132   $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
       
  1133   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
       
  1134    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
       
  1135    &&$\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$\\ 
       
  1136 
       
  1137    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
       
  1138 \end{tabular}    
       
  1139 \end{center}    
       
  1140 
       
  1141 \noindent
       
  1142 
       
  1143 and the difinition of $\flatten$:
       
  1144  \begin{center}
       
  1145  \begin{tabular}{c c c}
       
  1146  $\flatten \; []$ & $\dn$ & $[]$\\
       
  1147  $\flatten \; \ZERO::rs$ & $\dn$ & $rs$\\
       
  1148  $\flatten \;(_{\textit{bs}_1}\oplus \textit{rs}_1 ::rs)$ & $\dn$ & $(\map \, (\fuse \, \textit{bs}_1) \,\textit{rs}_1) ::: \flatten(rs)$\\
       
  1149  $\flatten \; r :: rs$ & $\dn$ & $r::\flatten(rs)$
       
  1150  \end{tabular}
       
  1151  \end{center}
       
  1152  
       
  1153  \noindent
       
  1154 If we call $\simp$ on $\rup\backslash a$, just as $\backslash_{simp}$
       
  1155 requires, then we would go throught the third clause of 
       
  1156 the sequence case:$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$.
       
  1157 The $\ZERO$ of $(_0\ONE  + \ZERO)$ is thrown away 
       
  1158 by $\flatten$ and 
       
  1159 $_0\ONE$ merged into $(_0a  +  _1a^*)$ by simply
       
  1160 putting its bits($_0$) to the front of the second component:
       
  1161  ${\bf_0}(_0a  +  _1a^*)$. 
       
  1162  After a second derivative operation,
       
  1163  namely, $(_0(_0a  +  _1a^*))\backslash a$, we get 
       
  1164  $
       
  1165  _0(_0 \ONE  +  _1(_1\ONE \cdot a^*))
       
  1166  $, and this simplifies to $_0(_0 \ONE  +  _{11} a^*)$
       
  1167  by the third clause of the alternative case:
       
  1168  $\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$.
       
  1169 The outmost bit $_0$ stays with 
       
  1170 the outmost regular expression, rather than being fused to
       
  1171 its child regular expressions, as what we will later see happens
       
  1172 to $\simp(\rup\backslash \, s)$.
       
  1173 If we choose to not simplify in the midst of derivative operations,
       
  1174 but only do it at the end after the string has been exhausted, 
       
  1175 namely, $\simp(\rup\backslash \, s)=\simp((\rup\backslash a)\backslash a)$,
       
  1176 then at the {\bf second} derivative of 
       
  1177 $(\rup\backslash a)\bf{\backslash a}$
       
  1178 we will go throught the clause of $\backslash$:
       
  1179 \begin{center}
       
  1180 \begin{tabular}{lcl}
       
  1181 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
       
  1182      $(when \; \textit{bnullable}\,a_1)$\\
       
  1183 					       & &$\textit{ALTS}\,bs\,List(\;\;(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
       
  1184 					       & &$(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))\;\;)$\\
       
  1185 \end{tabular}
       
  1186 \end{center}
       
  1187 
       
  1188 because
       
  1189 $\rup\backslash a = (_0\ONE  + \ZERO)(_0a  +  _1a^*)$  
       
  1190 is a sequence
       
  1191 with the first component being nullable
       
  1192 (unsimplified, unlike the first round of running$\backslash_{simp}$).
       
  1193 Therefore $((_0\ONE  + \ZERO)(_0a  +  _1a^*))\backslash a$ splits into
       
  1194 $([(\ZERO + \ZERO)\cdot(_0a  +  _1a^*)] + _0( _0\ONE  + _1[_1\ONE \cdot a^*]))$.
       
  1195 After these two successive derivatives without simplification,
       
  1196 we apply $\simp$ to this regular expression, which goes through
       
  1197 the alternative clause, and each component of 
       
  1198 $([(\ZERO + \ZERO)\cdot(_0a  +  _1a^*)] + _0( _0\ONE  + _1[_1\ONE \cdot a^*]))$ 
       
  1199 will be simplified, giving us the list:$[\ZERO, _0(_0\ONE  + _{11}a^*)]$
       
  1200 This list is then "flattened"--$\ZERO$ will be
       
  1201 thrown away by $\textit{flatten}$; $ _0(_0\ONE  + _{11}a^*)$
       
  1202 is opened up to make the list consisting of two separate elements 
       
  1203 $_{00}\ONE$ and $_{011}a^*$, note that $flatten$ 
       
  1204 $\fuse$s the bit(s) $_0$ to the front of $_0\ONE $ and $_{11}a^*$.
       
  1205 The order of simplification, which impacts the order that alternatives
       
  1206 are  opened up, causes
       
  1207 the bits to be moved differently.
       
  1208  
       
  1209  \subsubsection{A Failed Attempt To Remedy the Problem Above}
       
  1210 A simple class of regular expression and string
       
  1211 pairs $(r, s)$ can be deduced from the above example 
       
  1212 which trigger the difference between 
       
  1213 $\rup\backslash_{simp} \, s$
       
  1214 and  $\simp(\rup\backslash s)$:
       
  1215 \begin{center}
       
  1216 \begin{tabular}{lcl}
       
  1217 $D =\{ (r_1 \cdot r_2,\; [c_1c_2]) \mid $ & $\simp(r_2) = r_2, \simp(r_1 \backslash c_1) = \ONE,$\\
       
  1218  $r_1 \; not \; \nullable, c_2 \in L(r_2),$ & $\exists \textit{rs},\textit{bs}.\;  r_2 \backslash c_2 = _{bs}{\oplus rs}$\\
       
  1219 $\exists \textit{rs}_1. \; \simp(r_2 \backslash c_2) = _{bs}{\oplus \textit{rs}_1}$ &  $and \;\simp(r_1 \backslash [c_1c_2]) = \ZERO\}$\\
       
  1220 \end{tabular}
       
  1221 \end{center}
       
  1222 We take a pair $(r, \;s)$ from the set $D$.
       
  1223 
       
  1224 Now we compute ${\bf \rup \backslash_{simp} s}$, we get:
       
  1225 \begin{center}
       
  1226 \begin{tabular}{lcl}
       
  1227 $(r_1\cdot r_2)\backslash_{simp} \, [c_1c_2]$ & $= \simp\left[ \big(\simp\left[ \left( r_1\cdot r_2 \right) \backslash c_1\right] \big)\backslash c_2\right]$\\
       
  1228 								      & $= \simp\left[ \big(\simp \left[  \left(r_1 \backslash c_1\right) \cdot r_2 \right] \big) \backslash c_2 \right]$\\
       
  1229 								      & $= \simp \left[  (\fuse \; \bmkeps(r_1\backslash c_1) \; \simp(r_2) ) \backslash c_2 \right]$,\\
       
  1230 								      & $= \simp \left[  (\fuse \; \bmkeps(r_1\backslash c_1) \; r_2 ) \backslash c_2 \right]$,
       
  1231 \end{tabular}
       
  1232 \end{center}
       
  1233 \noindent
       
  1234 from the definition of $D$ we know $r_1 \backslash c_1$ is nullable, therefore
       
  1235 $\bmkeps(r_1\backslash c_1)$  returns a bitcode, we shall call it
       
  1236  $\textit{bs}_2$. 
       
  1237 The above term can be rewritten as
       
  1238 \begin{center}
       
  1239 $ \simp \left[  \fuse \; \textit{bs}_2\; r_2  \backslash c_2 \right]$,
       
  1240 \end{center}
       
  1241 which is equal to 
       
  1242 \begin{center}
       
  1243 $\simp \left[ \fuse \; \textit{bs}_2 \; _{bs}{\oplus rs} \right]$\\
       
  1244 $=\simp \left[  \; _{bs_2++bs}{\oplus rs} \right]$\\
       
  1245 $=  \; _{bs_2++bs}{\oplus \textit{rs}_1} $
       
  1246 \end{center}
       
  1247 \noindent
       
  1248 by using the properties from the set $D$ again
       
  1249 and again(The reason why we set so many conditions 
       
  1250 that the pair $(r,s)$ need to satisfy is because we can
       
  1251 rewrite them easily to construct the difference.)
       
  1252 
       
  1253 Now we compute ${\bf \simp(\rup \backslash s)}$:
       
  1254 \begin{center}
       
  1255 $\simp \big[(r_1\cdot r_2) \backslash [c_1c_2] \big]= \simp \left[ ((r_1 \cdot r_2 )\backslash c_1) \backslash c_2 \right]$
       
  1256 \end{center}
       
  1257 \noindent
       
  1258 Again, using the properties above, we obtain
       
  1259 the following chain of equalities:
       
  1260 \begin{center}
       
  1261 $\simp(\rup \backslash s)= \simp \left[ ((r_1 \cdot r_2 )\backslash c_1) \backslash c_2 \right]= \simp\left[    \left(r_1 \backslash c_1\right) \cdot r_2  \big) \backslash c_2 \right]$\\
       
  1262 $= \simp \left[ \oplus[\big( \left(r_1 \backslash c_1\right) \backslash c_2 \big) \cdot r_2 \; , \; \fuse \; (\bmkeps \;r_1\backslash c_1) \; r_2 \backslash c_2 ] \right]$,
       
  1263 \end{center}
       
  1264 \noindent
       
  1265 as before, we call the bitcode returned by 
       
  1266 $\bmkeps(r_1\backslash c_1)$ as
       
  1267 $\textit{bs}_2$. 
       
  1268 Also, $\simp(r_2 \backslash c_2)$ is 
       
  1269 $_{bs}\oplus \textit{rs}_1$, 
       
  1270 and $( \left(r_1 \backslash c_1\right) \backslash c_2  \cdot r_2)$
       
  1271 simplifies to $\ZERO$,
       
  1272 so the above term can be expanded as
       
  1273 \begin{center}
       
  1274 \begin{tabular}{l}
       
  1275 $\textit{distinct}(\flatten[\ZERO\;, \; _{\textit{bs}_2++\textit{bs}}\oplus \textit{rs}_1] ) \; \textit{match} $ \\
       
  1276   $\textit{case} \; [] \Rightarrow  \ZERO$ \\
       
  1277    $\textit{case} \; a :: [] \Rightarrow  \textit{\fuse \; \textit{bs} a}$ \\
       
  1278     $\textit{case} \;  as' \Rightarrow  _{[]}\oplus as'$\\ 
       
  1279 \end{tabular}
       
  1280 \end{center}
       
  1281 \noindent
       
  1282 Applying the definition of $\flatten$, we get
       
  1283 \begin{center}
       
  1284 $_{[]}\oplus (\textit{map} \; \fuse (\textit{bs}_2 ++ bs) rs_1)$
       
  1285 \end{center}
       
  1286 \noindent
       
  1287 compared to the result 
       
  1288 \begin{center}
       
  1289 $ \; _{bs_2++bs}{\oplus \textit{rs}_1} $
       
  1290 \end{center}
       
  1291 \noindent
       
  1292 Note how these two regular expressions only
       
  1293 differ in terms of the position of the bits 
       
  1294 $\textit{bs}_2++\textit{bs}$. They are the same otherwise.
       
  1295 What caused this difference to happen?
       
  1296 The culprit is the $\flatten$ function, which spills
       
  1297 out the bitcodes in the inner alternatives when 
       
  1298 there exists an outer alternative.
       
  1299 Note how the absence of simplification
       
  1300 caused $\simp(\rup \backslash s)$ to
       
  1301 generate the nested alternatives structure:
       
  1302 \begin{center}
       
  1303 $  \oplus[\ZERO \;, \; _{bs}\oplus \textit{rs} ]$
       
  1304 \end{center}
       
  1305 and this will always trigger the $\flatten$ to 
       
  1306 spill out the inner alternative's bitcode $\textit{bs}$,
       
  1307 whereas when
       
  1308 simplification is done along the way, 
       
  1309 the structure of nested alternatives is never created(we can
       
  1310 actually prove that simplification function never allows nested
       
  1311 alternatives to happen, more on this later).
       
  1312 
       
  1313 How about we do not allow the function $\simp$
       
  1314 to fuse out the bits when it is unnecessary?
       
  1315 Like, for the above regular expression, we might
       
  1316 just delete the outer layer of alternative
       
  1317 \begin{center}
       
  1318 \st{$ {\oplus[\ZERO \;,}$} $_{bs}\oplus \textit{rs}$ \st{$]$}
       
  1319 \end{center}
       
  1320 and get $_{bs}\oplus \textit{rs}$ instead, without
       
  1321 fusing the bits $\textit{bs}$ inside to every element 
       
  1322 of $\textit{rs}$.
       
  1323 This idea can be realized by making the following
       
  1324 changes to the $\simp$-function:
       
  1325 \begin{center}
       
  1326   \begin{tabular}{@{}lcl@{}}
       
  1327    
       
  1328   $\textit{simp}' \; (_{\textit{bs}}(a_1 \cdot a_2))$ & $\dn$ & $\textit{as} \; \simp \; \textit{was} \; \textit{before} $ \\
       
  1329 
       
  1330   $\textit{simp}' \; (_{bs}\oplus as)$ & $\dn$ & \st{$\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $} \\
       
  1331   &&\st{$\quad\textit{case} \; [] \Rightarrow  \ZERO$} \\
       
  1332    &&\st{$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$} \\
       
  1333    &&\st{$\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$}\\ 
       
  1334    &&$\textit{if}(\textit{hollowAlternatives}( \textit{map \; simp \; as}))$\\
       
  1335    &&$\textit{then} \; \fuse  \; \textit{bs}\; \textit{extractAlt}(\textit{map} \; \simp \; \textit{as} )$\\
       
  1336    &&$\textit{else} \; \simp(_{bs} \oplus \textit{as})$\\
       
  1337    
       
  1338 
       
  1339    $\textit{simp}' \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
       
  1340 \end{tabular}    
       
  1341 \end{center}    
       
  1342 
       
  1343 \noindent
       
  1344 given the definition of $\textit{hollowAlternatives}$ and  $\textit{extractAlt}$ :
       
  1345 \begin{center}
       
  1346 $\textit{hollowAlternatives}( \textit{rs}) \dn 
       
  1347 \exists r = (_{\textit{bs}_1}\oplus \textit{rs}_1)  \in \textit{rs}.  \forall r' \in \textit{rs}, \;
       
  1348 \textit{either} \; r' = \ZERO \; \textit{or} \; r' = r $
       
  1349 $\textit{extractAlt}( \textit{rs}) \dn \textit{if}\big(
       
  1350 \exists r = (_{\textit{bs}_1}\oplus \textit{rs}_1)  \in \textit{rs}.  \forall r' \in \textit{rs}, \;
       
  1351 \textit{either} \; r' = \ZERO \; \textit{or} \; r' = r \big)\; \textit{then} \; \textit{return} \; r$
       
  1352 \end{center}
       
  1353 \noindent
       
  1354 Basically, $\textit{hollowAlternatives}$ captures the feature of
       
  1355 a list of regular expression of the shape 
       
  1356 \begin{center}
       
  1357 $  \oplus[\ZERO \;, \; _{bs}\oplus \textit{rs} ]$
       
  1358 \end{center}
       
  1359 and this means we can simply elevate the 
       
  1360 inner regular expression $_{bs}\oplus \textit{rs}$
       
  1361  to the outmost
       
  1362 and throw away the useless $\ZERO$s and
       
  1363 the outer $\oplus$ wrapper.
       
  1364 Using this new definition of simp, 
       
  1365 under the example where  $r$ is the regular expression
       
  1366 $(a+b)(a+a*)$ and $s$  is the string $aa$
       
  1367 the problem of $\rup\backslash_{simp} \, s \neq \simp(\rup\backslash s)$
       
  1368 is resolved.
       
  1369 
       
  1370 Unfortunately this causes new problems:
       
  1371 for the counterexample where 
       
  1372 \begin{center}
       
  1373 $r$ is the regular expression
       
  1374 $(ab+(a^*+aa))$ and $s$  is the string $aa$
       
  1375 \end{center}
       
  1376 
       
  1377 \noindent
       
  1378 $\rup\backslash_{simp} \, s$ is equal to 
       
  1379 $ _1(_{011}a^* +  _1\ONE)  $ whereas
       
  1380 $ \simp(\rup\backslash s) = (_{1011}a^* +  _{11}\ONE)$.
       
  1381 This discrepancy does not appear for the old
       
  1382 version of $\simp$.
       
  1383 
       
  1384 
       
  1385 During the first derivative operation, 
       
  1386 $\rup\backslash a=(_0\ONE  + \ZERO)(_0a  +  _1a^*)$  is
       
  1387 in the form of a sequence regular expression with
       
  1388 two components, the first
       
  1389 one $\ONE + \ZERO$ being nullable. 
       
  1390 Recall the simplification function definition:
       
  1391 \begin{center}
       
  1392   \begin{tabular}{@{}lcl@{}}
       
  1393    
       
  1394   $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
       
  1395    &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
       
  1396    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
       
  1397    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
       
  1398    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
       
  1399    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
       
  1400 
       
  1401   $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
       
  1402   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
       
  1403    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
       
  1404    &&$\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$\\ 
       
  1405 
       
  1406    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
       
  1407 \end{tabular}    
       
  1408 \end{center}    
       
  1409 
       
  1410 \noindent
       
  1411 
       
  1412 and the difinition of $\flatten$:
       
  1413  \begin{center}
       
  1414  \begin{tabular}{c c c}
       
  1415  $\flatten \; []$ & $\dn$ & $[]$\\
       
  1416  $\flatten \; \ZERO::rs$ & $\dn$ & $rs$\\
       
  1417  $\flatten \;(_{\textit{bs}_1}\oplus \textit{rs}_1 ::rs)$ & $\dn$ & $(\map \, (\fuse \, \textit{bs}_1) \,\textit{rs}_1) ::: \flatten(rs)$\\
       
  1418  $\flatten \; r :: rs$ & $\dn$ & $r::\flatten(rs)$
       
  1419  \end{tabular}
       
  1420  \end{center}
       
  1421  
       
  1422  \noindent
       
  1423 
       
  1424 If we call $\simp$ on $\rup\backslash a$,
       
  1425 then we would go throught the third clause of 
       
  1426 the sequence case:$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$.
       
  1427 The $\ZERO$ of $(_0\ONE  + \ZERO)$ is thrown away 
       
  1428 by $\flatten$ and 
       
  1429 $_0\ONE$ merged into $(_0a  +  _1a^*)$ by simply
       
  1430 putting its bits($_0$) to the front of the second component:
       
  1431  ${\bf_0}(_0a  +  _1a^*)$. 
       
  1432  After a second derivative operation,
       
  1433  namely, $(_0(_0a  +  _1a^*))\backslash a$, we get 
       
  1434  $
       
  1435  _0(_0 \ONE  +  _1(_1\ONE \cdot a^*))
       
  1436  $, and this simplifies to $_0(_0 \ONE  +  _{11} a^*)$
       
  1437  by the third clause of the alternative case:
       
  1438  $\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$.
       
  1439 The outmost bit $_0$ stays with 
       
  1440 the outmost regular expression, rather than being fused to
       
  1441 its child regular expressions, as what we will later see happens
       
  1442 to $\simp(\rup\backslash \, s)$.
       
  1443 If we choose to not simplify in the midst of derivative operations,
       
  1444 but only do it at the end after the string has been exhausted, 
       
  1445 namely, $\simp(\rup\backslash \, s)=\simp((\rup\backslash a)\backslash a)$,
       
  1446 then at the {\bf second} derivative of 
       
  1447 $(\rup\backslash a)\bf{\backslash a}$
       
  1448 we will go throught the clause of $\backslash$:
       
  1449 \begin{center}
       
  1450 \begin{tabular}{lcl}
       
  1451 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
       
  1452      $(when \; \textit{bnullable}\,a_1)$\\
       
  1453 					       & &$\textit{ALTS}\,bs\,List(\;\;(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
       
  1454 					       & &$(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))\;\;)$\\
       
  1455 \end{tabular}
       
  1456 \end{center}
       
  1457 
       
  1458 because
       
  1459 $\rup\backslash a = (_0\ONE  + \ZERO)(_0a  +  _1a^*)$  
       
  1460 is a sequence
       
  1461 with the first component being nullable
       
  1462 (unsimplified, unlike the first round of running$\backslash_{simp}$).
       
  1463 Therefore $((_0\ONE  + \ZERO)(_0a  +  _1a^*))\backslash a$ splits into
       
  1464 $([(\ZERO + \ZERO)\cdot(_0a  +  _1a^*)] + _0( _0\ONE  + _1[_1\ONE \cdot a^*]))$.
       
  1465 After these two successive derivatives without simplification,
       
  1466 we apply $\simp$ to this regular expression, which goes through
       
  1467 the alternative clause, and each component of 
       
  1468 $([(\ZERO + \ZERO)\cdot(_0a  +  _1a^*)] + _0( _0\ONE  + _1[_1\ONE \cdot a^*]))$ 
       
  1469 will be simplified, giving us the list:$[\ZERO, _0(_0\ONE  + _{11}a^*)]$
       
  1470 This list is then "flattened"--$\ZERO$ will be
       
  1471 thrown away by $\textit{flatten}$; $ _0(_0\ONE  + _{11}a^*)$
       
  1472 is opened up to make the list consisting of two separate elements 
       
  1473 $_{00}\ONE$ and $_{011}a^*$, note that $flatten$ 
       
  1474 $\fuse$s the bit(s) $_0$ to the front of $_0\ONE $ and $_{11}a^*$.
       
  1475 The order of simplification, which impacts the order that alternatives
       
  1476 are  opened up, causes
       
  1477 the bits to be moved differently.
       
  1478 
       
  1479 
       
  1480 
       
  1481 We have changed the algorithm to suppress the old
       
  1482 counterexample, but this gives rise to new counterexamples.
       
  1483 This dilemma causes this amendment not a successful 
       
  1484 attempt to make $\rup\backslash_{simp} \, s = \simp(\rup\backslash s)$
       
  1485 under every possible regular expression and string.
       
  1486 \subsection{Properties of the Function $\simp$}
       
  1487 
       
  1488 We have proved in Isabelle quite a few properties
       
  1489 of the $\simp$-function, which helps the proof to go forward
       
  1490 and we list them here to aid comprehension.
       
  1491 
       
  1492 To start, we need a bit of auxiliary notations,
       
  1493 which is quite basic and is only written here
       
  1494 for clarity.
       
  1495 
       
  1496 $\textit{sub}(r)$ computes the set of the 
       
  1497 sub-regular expression of $r$:
       
  1498 \begin{center}
       
  1499 $\textit{sub}(\ONE) \dn \{\ONE\}$\\
       
  1500 $\textit{sub}(r_1 \cdot r_2) \dn \textit{sub}(r_1) \cup \textit{sub}(r_2) \cup \{r_1 \cdot r_2\}$\\
       
  1501 $\textit{sub}(r_1 + r_2) \dn \textit{sub}(r_1) \cup \textit{sub}(r_2) \cup \{r_1+r_2\}$\\
       
  1502 \end{center}
       
  1503 $\textit{good}(r) \dn r \neq \ZERO \land \\
       
  1504 \forall r' \in \textit{sub}(r), \textit{if} \; r' = _{bs_1}\oplus(rs_1), \;
       
  1505 \textit{then} \nexists r'' \in \textit{rs}_1 \; s.t.\;
       
  1506 r'' = _{bs_2}\oplus \textit{rs}_2 $
       
  1507 
       
  1508 The properties are mainly the ones below:
       
  1509 \begin{itemize}
       
  1510 \item
       
  1511 \begin{center}
       
  1512 $\simp(\simp(r)) = \simp(r)$
       
  1513 \end{center}
       
  1514 \item
       
  1515 \begin{center}
       
  1516 $\textit{if} r = \simp(r') \textit{then} \; \textit{good}(r) $
       
  1517 \end{center}
       
  1518 \end{itemize}
       
  1519 \subsection{the Contains relation}
       
  1520 $\retrieve$ is a too strong relation in that
       
  1521 it only extracts one bitcode instead of a set of them.
       
  1522 Therefore we try to define another relation(predicate)
       
  1523 to capture the fact the regular expression has bits
       
  1524 being moved around but still has all the bits needed.
       
  1525 The contains symbol, written$\gg$, is  a relation that
       
  1526 takes two arguments in an infix form 
       
  1527 and returns a truth value. 
       
  1528 
       
  1529 
       
  1530 In other words, from the set of regular expression and 
       
  1531 bitcode pairs 
       
  1532 $\textit{RV} = \{(r, v) \mid r \text{r is a regular expression, v is a value}\}$,
       
  1533 those that satisfy the following requirements are in the set
       
  1534 $\textit{RV}_Contains$.
       
  1535 Unlike the $\retrieve$
       
  1536 function, which takes two arguments $r$ and $v$ and 
       
  1537 produces an only answer $\textit{bs}$, it takes only 
       
  1538 one argument $r$ and returns a set of bitcodes that 
       
  1539 can be generated by $r$.
       
  1540 \begin{center}
       
  1541 \begin{tabular}{llclll}
       
  1542 & & & $_{bs}\ONE$ & $\gg$ & $\textit{bs}$\\
       
  1543 & & & $_{bs}{\bf c}$ & $\gg$ & $\textit{bs}$\\
       
  1544 $\textit{if} \; r_1 \gg \textit{bs}_1$ & $r_2 \; \gg \textit{bs}_2$ 
       
  1545 & $\textit{then}$  &
       
  1546  $_{bs}{r_1 \cdot r_2}$ & 
       
  1547  $\gg$ & 
       
  1548  $\textit{bs} @ \textit{bs}_1 @ \textit{bs}_2$\\
       
  1549 
       
  1550  $\textit{if} \; r \gg \textit{bs}_1$ &  & $\textit{then}$  &
       
  1551  $_{bs}{\oplus(r :: \textit{rs}})$ & 
       
  1552  $\gg$ & 
       
  1553  $\textit{bs} @ \textit{bs}_1 $\\ 
       
  1554 
       
  1555  $\textit{if} \; _{bs}(\oplus \textit{rs}) \gg \textit{bs} @ \textit{bs}_1$ &  & $\textit{then}$  &
       
  1556  $_{bs}{\oplus(r :: \textit{rs}})$ & 
       
  1557  $\gg$ & 
       
  1558  $\textit{bs} @ \textit{bs}_1 $\\  
       
  1559 
       
  1560  $\textit{if} \; r \gg \textit{bs}_1\; \textit{and}$ &  $_{bs}r^* \gg \textit{bs} @ \textit{bs}_2$ & $\textit{then}$  &
       
  1561  $_{bs}r^* $ & 
       
  1562  $\gg$ & 
       
  1563  $\textit{bs} @ [0] @ \textit{bs}_1@ \textit{bs}_2 $\\
       
  1564  
       
  1565  & & & $_{bs}r^*$ & $\gg$ & $\textit{bs} @ [1]$\\
       
  1566 \end{tabular}
       
  1567 \end{center}
       
  1568 It is a predicate in the sense that given 
       
  1569 a regular expression and a bitcode, it 
       
  1570 returns true or false, depending on whether 
       
  1571 or not the regular expression can actually produce that
       
  1572 value. If the predicates returns a true, then 
       
  1573 we say that the regular expression $r$ contains
       
  1574 the bitcode $\textit{bs}$, written 
       
  1575 $r \gg \textit{bs}$.
       
  1576 The $\gg$ operator with the
       
  1577 regular expression $r$ may also be seen as a 
       
  1578 machine that does a derivative of regular expressions
       
  1579 on all strings simultaneously, taking 
       
  1580 the bits by going throught the regular expression tree
       
  1581  structure in a depth first manner, regardless of whether
       
  1582  the part being traversed is nullable or not.
       
  1583  It put all possible bits that can be produced on such a traversal
       
  1584  into a set.
       
  1585  For example, if we are given the regular expression 
       
  1586 $((a+b)(c+d))^*$, the tree structure may be written as
       
  1587 \begin{center}
       
  1588 \begin{tikzpicture}
       
  1589 \tikz[tree layout]\graph[nodes={draw, circle}] {
       
  1590 * -> 
       
  1591     {@-> {
       
  1592     {+1 ->
       
  1593         {a , b}
       
  1594         },
       
  1595     {+ ->
       
  1596         {c , d }
       
  1597         }
       
  1598         }
       
  1599     }
       
  1600 };
       
  1601 \end{tikzpicture}
       
  1602 \end{center}
       
  1603 \subsection{the $\textit{ders}_2$ Function}
       
  1604 If we want to prove the result 
       
  1605 \begin{center}
       
  1606 	$ \textit{blexer}\_{simp}(r, \; s) =  \textit{blexer}(r, \; s)$
       
  1607 \end{center}
       
  1608 inductively 
       
  1609 on the structure of the regular expression,
       
  1610 then we need to induct on the  case $r_1 \cdot r_2$,
       
  1611 it can be good if we could express $(r_1 \cdot r_2) \backslash s$
       
  1612 in terms of $r_1 \backslash s$ and $r_2 \backslash s$,
       
  1613 and this naturally induces the $ders2$ function,
       
  1614 which does a "convolution" on $r_1$ and $r_2$ using the string
       
  1615 $s$.
       
  1616 It is based on the observation that the derivative of $r_1 \cdot r_2$
       
  1617 with respect to a string $s$ can actually be written in an "explicit form"
       
  1618 composed of $r_1$'s derivative of $s$ and $r_2$'s derivative of $s$.
       
  1619 This can be illustrated in the following procedure execution 
       
  1620 \begin{center}
       
  1621 	$ (r_1 \cdot r_2) \backslash [c_1c_2] =  (\textit{if} \; \nullable(r_1)\;  \textit{then} \; ((r_1 \backslash c_1) \cdot r_2 + r_2 \backslash c_1) \; \textit{else} \; (r_1\backslash c_1) \cdot r_2) \backslash c_2$
       
  1622 \end{center}
       
  1623 which can also be written in a "convoluted sum"
       
  1624 format:
       
  1625 \begin{center}
       
  1626 	$ (r_1 \cdot r_2) \backslash [c_1c_2] =  \sum{r_1 \backslash s_i \cdot r_2 \backslash s_j}$
       
  1627 \end{center}
       
  1628 In a more serious manner, we should write
       
  1629 \begin{center}
       
  1630 	$ (r_1 \cdot r_2) \backslash [c_1c_2] =  \sum{r_1 \backslash s_i \cdot r_2 \backslash s_j}$
       
  1631 \end{center}
       
  1632 Note this differentiates from the previous form in the sense that
       
  1633 it calculates the results $r_1\backslash s_i , r_2 \backslash s_j$ first and then glue them together
       
  1634 through nested alternatives whereas the $r_1 \cdot r_2 \backslash s$ procedure,
       
  1635 used by algorithm $\lexer$, can only produce each component of the 
       
  1636 resulting alternatives regular expression altogether rather than 
       
  1637 generating each of the children nodes one by one
       
  1638 n a single recursive call that is only for generating that
       
  1639 very expression itself.
       
  1640 We have this 
       
  1641 \section{Conclusion}
       
  1642 Under the exhaustive tests we believe the main
       
  1643 result holds, yet a proof still seems elusive.
       
  1644 We have tried out different approaches, and 
       
  1645 found a lot of properties of the function $\simp$.
       
  1646 The counterexamples where $\rup\backslash_{simp} \, s \neq \simp(\rup\backslash s)$
       
  1647 are also valuable in the sense that 
       
  1648 we get to know better why they are not equal and what 
       
  1649 are the subtle differences between a 
       
  1650 nested simplified regular expression and a 
       
  1651 regular expression that is simplified at the final moment.
       
  1652 We are almost there, but a last step is needed to make the proof work.
       
  1653 Hopefully in the next few weeks we will be able to find one.
       
  1654 %CONSTRUCTION SITE HERE
       
  1655 that is to say, despite the bits are being moved around on the regular expression
       
  1656 (difference in bits), the structure of the (unannotated)regular expression
       
  1657 after one simplification is exactly the same after the 
       
  1658 same sequence of derivative operations 
       
  1659 regardless of whether we did simplification
       
  1660 along the way.
       
  1661  One way would be to give a function that calls
       
  1662 
       
  1663 fuse is the culprit: it causes the order in which alternatives
       
  1664 are opened up to be remembered and finally the difference
       
  1665 appear in $\simp(\rup \backslash s)$ and $\rup \backslash{simp} \,s$.
       
  1666 but we have to use them as they are essential in the simplification:
       
  1667 flatten needs them.
       
  1668 
       
  1669 
       
  1670 
       
  1671 However, without erase the above equality does not hold:
       
  1672 for the regular expression  
       
  1673 $(a+b)(a+a*)$,
       
  1674 if we do derivative with respect to string $aa$,
       
  1675 we get 
       
  1676 
       
  1677 \subsection{Another Proof Strategy}
       
  1678 sdddddr does not equal sdsdsdsr sometimes.\\
       
  1679 For example,
       
  1680 
       
  1681 This equicalence class method might still have the potential of proving this,
       
  1682 but not yet
       
  1683 i parallelly tried another method of using retrieve\\
       
  1684 
       
  1685 
       
  1686 The vsimp function, defined as follows
       
  1687 tries to simplify the value in lockstep with 
       
  1688 regular expression:\\
       
  1689 
       
  1690 
       
  1691 The problem here is that 
       
  1692 
       
  1693 we used retrieve for the key induction:
       
  1694 $decode (retrieve (r\backslash (s @ [c])) v) r $
       
  1695 $decode (retrieve (r\backslash s) (inj (r\backslash s) c v)) r$
       
  1696 Here, decode recovers a value that corresponds to a match(possibly partial)
       
  1697 from bits, and the bits are extracted by retrieve,
       
  1698 and the key value $v$ that guides retrieve is
       
  1699 $mkeps r\backslash s$, $inj r c (mkeps r\backslash s)$, $inj (inj (v))$, ......
       
  1700 if we can 
       
  1701 the problem is that 
       
  1702 need vsiimp to make a value that is suitable for decoding
       
  1703 $Some(flex rid(s@[c])v) = Some(flex rids(inj (r\backslash s)cv))$
       
  1704 another way that christian came up with that might circumvent the 
       
  1705 prblem of finding suitable value is by not stating the visimp
       
  1706 function but include all possible value in a set that a regex is able to produce,
       
  1707 and proving that both r and sr are able to produce the bits that correspond the POSIX value
       
  1708 
       
  1709 produced by feeding the same initial regular expression $r$ and string $s$ to the
       
  1710  two functions $ders$ and $ders\_simp$.
       
  1711 The reason why
       
  1712 Namely, if $bmkeps( r_1) = bmkeps(r_2)$, then we 
       
  1713 
       
  1714 
       
  1715 If we define the equivalence relation $\sim_{m\epsilon}$ between two regular expressions
       
  1716 $r_1$ and $r_2$as follows:
       
  1717 $r_1 \sim_{m\epsilon} r_2  \iff bmkeps(r_1)= bmkeps(r_2)$
       
  1718 (in other words, they $r1$ and $r2$ produce the same output under the function $bmkeps$.)
       
  1719 Then the first goal 
       
  1720 might be restated as 
       
  1721 $(r^\uparrow)\backslash_{simp}\, s  \sim_{m\epsilon} (r^\uparrow)\backslash s$.
       
  1722 I tried to establish an equivalence relation between the regular experssions 
       
  1723 like dddr dddsr,.....
       
  1724 but right now i am only able to establish dsr and dr, using structural induction on r.
       
  1725 Those involve multiple derivative operations are harder to prove.
       
  1726 Two attempts have been made:
       
  1727 (1)induction on the number of der operations(or in other words, the length of the string s),
       
  1728 the inductive hypothesis was initially specified as 
       
  1729 "For an arbitrary regular expression r, 
       
  1730 For all string s in the language of r whose length do not exceed 
       
  1731 the number n, ders s r me derssimp s r"
       
  1732 and the proof goal may be stated as
       
  1733 "For an arbitrary regular expression r, 
       
  1734 For all string s in the language of r whose length do not exceed 
       
  1735 the number n+1, ders s r me derssimp s r"
       
  1736 the problem here is that although we can easily break down
       
  1737 a string s of length n+1 into s1@list(c), it is not that easy
       
  1738 to use the i.h. as a stepping stone to prove anything because s1 may well be not
       
  1739 in the language L(r). This inhibits us from obtaining the fact that
       
  1740 ders s1 r me derssimps s1 r.
       
  1741 Further exploration is needed to amend this hypothesis so it includes the
       
  1742 situation when s1 is not nullable.
       
  1743 For example, what information(bits? 
       
  1744 values?) can be extracted
       
  1745 from the regular expression ders(s1,r) so that we can compute or predict the possible 
       
  1746 result of bmkeps after another derivative operation. What function f can used to 
       
  1747 carry out the task? The possible way of exploration can be 
       
  1748 more directly perceived throught the graph below:
       
  1749 find a function
       
  1750 f
       
  1751 such that
       
  1752 f(bders s1 r)
       
  1753 = re1
       
  1754 f(bderss s1 r)
       
  1755 = re2
       
  1756 bmkeps(bders s r) = g(re1,c)
       
  1757 bmkeps(bderssimp s r) = g(re2,c)
       
  1758 and g(re1,c) = g(re2,c)
       
  1759 The inductive hypothesis would be
       
  1760 "For all strings s1 of length <= n, 
       
  1761 f(bders s1 r)
       
  1762 = re1
       
  1763 f(bderss s1 r)
       
  1764 = re2"
       
  1765 proving this would be a lemma for the main proof:
       
  1766 the main proof would be 
       
  1767 "
       
  1768 bmkeps(bders s r) = g(re1,c)
       
  1769 bmkeps(bderssimp s r) = g(re2,c)
       
  1770 for s = s1@c
       
  1771 "
       
  1772 and f need to be a recursive property for the lemma to be proved:
       
  1773 it needs to store not only the "after one char nullable info",
       
  1774 but also the "after two char nullable info",
       
  1775 and so on so that it is able to  predict what f will compute after a derivative operation,
       
  1776 in other words, it needs to be "infinitely recursive"\\
       
  1777 To prove the lemma, in other words, to get
       
  1778 "For all strings s1 of length <= n+1, 
       
  1779 f(bders s1 r)
       
  1780 = re3
       
  1781 f(bderss s1 r)
       
  1782 = re4"\\
       
  1783 from\\
       
  1784 "For all strings s1 of length <= n, 
       
  1785 f(bders s1 r)
       
  1786 = re1
       
  1787 f(bderss s1 r)
       
  1788 = re2"\\
       
  1789 it might be best to construct an auxiliary function h such that\\
       
  1790 h(re1, c) = re3\\
       
  1791 h(re2, c) = re4\\
       
  1792 and re3 = f(bder c (bders s1 r))\\
       
  1793 re4 = f(simp(bder c (bderss s1 r)))
       
  1794 The key point here is that we are not satisfied with what bders s r will produce under
       
  1795 bmkeps, but also how it will perform after a derivative operation and then bmkeps, and two 
       
  1796 derivative operations and so on. In essence, we are preserving the regular expression 
       
  1797 itself under the function f, in a less compact way than the regluar expression: we are
       
  1798 not just recording but also interpreting what the regular expression matches.
       
  1799 In other words, we need to prove the properties of bderss s r beyond the bmkeps result,
       
  1800 i.e., not just the nullable ones, but also those containing remaining characters.\\
       
  1801 (2)we observed the fact that 
       
  1802 erase sdddddr= erase sdsdsdsr
       
  1803 that is to say, despite the bits are being moved around on the regular expression
       
  1804 (difference in bits), the structure of the (unannotated)regular expression
       
  1805 after one simplification is exactly the same after the 
       
  1806 same sequence of derivative operations 
       
  1807 regardless of whether we did simplification
       
  1808 along the way.
       
  1809 However, without erase the above equality does not hold:
       
  1810 for the regular expression  
       
  1811 $(a+b)(a+a*)$,
       
  1812 if we do derivative with respect to string $aa$,
       
  1813 we get
       
  1814 %TODO
       
  1815 sdddddr does not equal sdsdsdsr sometimes.\\
       
  1816 For example,
       
  1817 
       
  1818 This equicalence class method might still have the potential of proving this,
       
  1819 but not yet
       
  1820 i parallelly tried another method of using retrieve\\
       
  1821 
       
  1822 
       
  1823 
       
  1824 \noindent\rule[0.5ex]{\linewidth}{1pt}
       
  1825 
       
  1826 
       
  1827 
       
  1828 
       
  1829 
       
  1830 
       
  1831 \bibliographystyle{plain}
       
  1832 \bibliography{root}
       
  1833 
       
  1834 
       
  1835 \end{document}
       
  1836