lexing: ChengsongTanPhdThesis/Chapters/Cubic.tex@cc54ce075db5


% Chapter Template

\chapter{A Better Bound and Other Extensions} % Main chapter title

\label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
%in Chapter 4 to a polynomial one, and demonstrate how one can extend the
%algorithm to include constructs such as bounded repetitions and negations.
 
%----------------------------------------------------------------------------------------
%	SECTION strongsimp
%----------------------------------------------------------------------------------------
\section{A Stronger Version of Simplification}
%TODO: search for isabelle proofs of algorithms that check equivalence 
Two alternative (distinct) sub-matches for the (sub-)string and (sub-)regex pair
are always expressed in the "derivative regular expression" as two
disjoint alternative terms at the current (sub-)regex level. The fact that they
are parallel tells us when we retrieve the information from this (sub-)regex 
there will always be a choice of which alternative term to take.

Here is an example for this.

Assume we have the derivative regex 
\[(a^* + (aa)^* + (aaa)^*)\cdot(a^* + (aa)^* + (aaa)^*)^* + 
(a^* + a\cdot(aa)^* + (aa)\cdot (aaa)^*)\cdot(a^* + (aa)^* + (aaa)^*)^*
\]

occurring in an intermediate step in successive
derivatives of an input string $\underbrace{aa\ldots a}_{\text{n \; a's}}$.
In this expression, there will be 6 "parallel" terms if we break up regexes 
of shape $(a+b)\cdot c$ using the distributivity law (into $ac + bc$).
\begin{align}
(\nonumber \\
a^* + &    \label{term:1} \\
(aa)^* + &  \label{term:2} \\
(aaa)^* &  \label{term:3} \\
&	)\cdot(a^* + (aa)^* + (aaa)^*)^* \; + \; (\nonumber \\
a^* + & \label{term:4} \\
a \cdot (aa)^* + & \label{term:5} \\
aa \cdot (aaa)^* \label{term:6}\\
&      )\cdot(a^* + (aa)^* + (aaa)^*)^*\nonumber
\end{align}


They have been labelled, and each label denotes  
one term, for example, \ref{term:1} denotes
\[
a^*\cdot(a^* + (aa)^* + (aaa)^*)^* 
\]
\noindent
and \ref{term:3} denotes
\[
(aaa)^*\cdot(a^* + (aa)^* + (aaa)^*)^*.
\]
These terms can be interpreted in terms of 
their current input position's most recent 
partial match.
Term \ref{term:1}, \ref{term:2}, and \ref{term:3}
denote the partial-match ending at the current position:
\mybox{previous input $\ldots$}\mybox{$aaa$ }\mybox{rest of input $\ldots$}



%----------------------------------------------------------------------------------------
%	SECTION 1
%----------------------------------------------------------------------------------------

\section{Adding Support for the Negation Construct, and its Correctness Proof}
We now add support for the negation regular expression:
\[			r ::=   \ZERO \mid  \ONE
			 \mid  c  
			 \mid  r_1 \cdot r_2
			 \mid  r_1 + r_2   
			 \mid r^*    
			 \mid \sim r
\]
The $\textit{nullable}$ function's clause for it would be 
\[
\textit{nullable}(~r) = \neg \nullable(r)
\]
The derivative would be
\[
~r \backslash c = ~ (r \backslash c)
\]
 
The most tricky part of lexing for the $~r$ regex
 is creating a value for it.
 For other regular expressions, the value aligns with the 
 structure of the regex:
 \[
 \vdash \Seq(\Char(a), \Char(b)) : a \cdot b
 \]
But for the $~r$ regex, $s$ is a member of it if and only if
$s$ does not belong to $L(r)$. 
That means when there
is a match for the not regex, it is not possible to generate how the string $s$ matched
with $r$. 
What we can do is preserve the information of how $s$ was not matched by $r$,
and there are a number of options to do this.

We could give a partial value when there is a partial match for the regex inside
the $\mathbf{not}$ construct.
For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,
A value for it could be 
 \[
 \vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)
 \]
 The above example demonstrates what value to construct 
 when the string $s$ is at most a real prefix
 of the strings in $L(r)$. When $s$ instead is not a prefix of any strings
 in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$
 constructor.
 
 Another option would be to either store the string $s$ that resulted in 
 a mis-match for $r$ or a dummy value as a placeholder:
 \[
 \vdash \textit{Not}(abcd) : ~(a^*)
 \]
or
 \[
 \vdash \textit{Not}(\textit{Dummy}) : ~(a^*)
 \] 
 We choose to implement this as it is most straightforward:
 \[
 \mkeps(~(r))  = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})
 \]
 
%----------------------------------------------------------------------------------------
%	SECTION 2
%----------------------------------------------------------------------------------------

\section{Bounded Repetitions}
author	Chengsong
	Fri, 03 Jun 2022 16:45:30 +0100
changeset 532	cc54ce075db5
child 533	6acbc939af6a
permissions	-rwxr-xr-x