lexing: ChengsongTanPhdThesis/Chapters/Cubic.tex@6acbc939af6a


% Chapter Template

\chapter{A Better Bound and Other Extensions} % Main chapter title

\label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
%in Chapter 4 to a polynomial one, and demonstrate how one can extend the
%algorithm to include constructs such as bounded repetitions and negations.
 
%----------------------------------------------------------------------------------------
%	SECTION strongsimp
%----------------------------------------------------------------------------------------
\section{A Stronger Version of Simplification}
%TODO: search for isabelle proofs of algorithms that check equivalence 
In our bit-coded lexing algorithm, with or without simplification, 
two alternative (distinct) sub-matches for the (sub-)string and (sub-)regex pair
are always expressed in the "derivative regular expression" as two
disjoint alternative terms at the current (sub-)regex level. The fact that they
are parallel tells us when we retrieve the information from this (sub-)regex 
there will always be a choice of which alternative term to take.
As an example, the regular expression $aa \cdot a^*+ a \cdot a^*$ (omitting bit-codes)
expresses two possibilities it will match future input.
It will either match 2 $a$'s, then 0 or more $a$'s, in other words, at least 2 more $a$'s
\begin{figure}\label{string_3parts1}
\begin{center}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=3]
        {Consumed Input
        \nodepart{two} Expects $aa$
         \nodepart{three} Then expects $a^*$};

\end{tikzpicture}
\end{center}
\end{figure}
\noindent
Or it will match at least 1 more $a$:
\begin{figure}\label{string_3parts2}
\begin{center}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=3]
        {Consumed
        \nodepart{two} Expects $a$
         \nodepart{three} Then expects $a^*$};

\end{tikzpicture}
\end{center}
\end{figure}
If these two possibilities are identical, we can eliminate
the second one as we know the second one corresponds to 
a match that is not $\POSIX$.


If two identical regexes 
happen to be grouped into different sequences, one cannot use
the $a + a \rightsquigarrow a$ rule to eliminate them, even if they 
are "parallel" to each other:
\[
(a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2
\]
and that's because they are followed by possibly 
different "suffixes" $r_1$ and $r_2$, and if 
they do turn out to be different then doing 
\[
(a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2
\]
might cause a possible match where the string is in $L(a \cdot r_2)$ to be lost.

Here is an example for this.
Assume we have the derivative regex 
\[( r_1  +  r_2  +  r_3)\cdot r+ 
( r_1  +  r_5  +  r_6)\cdot( r_1  +  r_2  +  r_3)^*
\]

occurring in an intermediate step in our bit-coded lexing algorithm.

In this expression, there will be 6 "parallel" terms if we break up regexes 
of shape $(a+b)\cdot c$ using the distributivity law (into $ac + bc$).
\begin{align}
(\nonumber \\
 r_1  + &    \label{term:1} \\
 r_2  + &  \label{term:2} \\
 r_3 &  \label{term:3} \\
&	)\cdot r\; + \; (\nonumber \\
 r_1  + & \label{term:4} \\
 r_5  + & \label{term:5} \\
 r_6 \label{term:6}\\
&      )\cdot r\nonumber
\end{align}

They have been labelled, and each label denotes  
one term, for example, \ref{term:1} denotes
\[
 r_1 \cdot r
\]
\noindent
and \ref{term:3} denotes
\[
 r_3\cdot r.
\]
From a lexer's point of view, \ref{term:4} will never 
be picked up in later phases of matching because there
is a term \ref{term:1} giving identical matching information.
The first term \ref{term:1} will match a string in $L(r_1 \cdot  r)$, 
and so on for term \ref{term:2}, \ref{term:3}, etc.

\mybox{previous input $\ldots$}\mybox{$aaa$ }\mybox{rest of input $\ldots$}
\begin{center}\label{string_2parts}

\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2]
        {$r_{x1}$
         \nodepart{two} $r_1\cdot r$ };
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}

\end{center}
For term 1 \ref{term:1}, whatever was before the current
input position was fully matched and the regex corresponding
to it reduced to $\ONE$, 
and in the next input token, it will start on $r_1\cdot r$.
The resulting value will be something of a similar configuration:
\begin{center}\label{value_2parts}
%\caption{term 1 \ref{term:1}'s matching configuration}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2]
        {$v_{x1}$
         \nodepart{two} $v_{r_1\cdot r}$ };
\end{tikzpicture}
\end{center}
For term 2 \ref{term:2} we have a similar value partition:
\begin{center}\label{value_2parts2}
%\caption{term 1 \ref{term:1}'s matching configuration}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2]
        {$v_{x2}$
         \nodepart{two} $v_{r_2\cdot r}$ };
\end{tikzpicture}
\end{center}
\noindent
and so on.
We note that for term 4 \ref{term:4} its result value 
after any position beyond  the current one will always
be the same:
\begin{center}\label{value_2parts4}
%\caption{term 1 \ref{term:1}'s matching configuration}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2]
        {$v_{x4}$
         \nodepart{two} $v_{r_1\cdot r}$ };
\end{tikzpicture}
\end{center}
And $\POSIX$ rules says that we can eliminate at least one of them.
Our algorithm always puts the regex with the longest initial sub-match at the left of the 
alternatives, so we safely throw away \ref{term:4}.
The fact that term 1 and 4 are not immediately in the same alternative
expression does not prevent them from being two duplicate matches at
the current point of view.
To implement this idea into an algorithm, we define a "pruning function"
$\textit{prune}$, that takes away parts of all terms in $r$ that belongs to
$\textit{acc}$, which acts as an element
is a stronger version of $\textit{distinctBy}$.
Here is a concise version of it (in the style of Scala):
\begin{verbatim}
def distinctByPlus(rs: List[ARexp], acc: Set[Rexp] = Set()) : 
List[ARexp] =  rs match {
  case Nil => Nil
  case r :: rs =>
    if(acc.contains(erase(r)))
      distinctByPlus(rs, acc)
    else 
      prune(r, acc) :: distinctByPlus(rs, prune(r, acc) +: acc)
}

\end{verbatim}
But for the function $\textit{prune}$ there is a difficulty:
how could one define the $L$ function in a "computable" way,
so that they generate a (lazy infinite) set of strings in $L(r)$.
We also need a function that tests whether $L(r_1) \subseteq L(r_2)$
is true.
For the moment we cut corners and do not define these two functions
as they are not used by Antimirov (and will probably not contribute
to a bound better than Antimirov's cubic bound).
Rather, we do not try to eliminate in every instance of regular expressions
that have "language duplicates", but only eliminate the "exact duplicates".
For this we need to break up terms $(a+b)\cdot c$ into two
terms $a\cdot c$ and $b\cdot c$ before we add them to the accumulator:
\begin{verbatim}
def distinctWith(rs: List[ARexp], 
                pruneFunction: (ARexp, Set[Rexp]) => ARexp, 
                acc: Set[Rexp] = Set()) : List[ARexp] = 
  rs match{
    case Nil => Nil
    case r :: rs => 
      if(acc(erase(r)))
        distinctWith(rs, pruneFunction, acc)
      else {
        val pruned_r = pruneFunction(r, acc)
        pruned_r :: 
        distinctWith(rs, 
          pruneFunction, 
          turnIntoTerms(erase(pruned_r)) ++: acc
        )
      }
  }
\end{verbatim}
\noindent
This process is done by the function $\textit{turnIntoTerms}$:
\begin{verbatim}
def turnIntoTerms(r: Rexp): List[Rexp] = r match {
  case SEQ(r1, r2)  => if(isOne(r1)) 
                          turnIntoTerms(r2) 
                       else 
                          turnIntoTerms(r1).map(r11 => SEQ(r11, r2))
  case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2)
  case ZERO => Nil
  case _ => r :: Nil
}
\end{verbatim}
\noindent
The pruning function can be defined recursively:
\begin{verbatim}
def prune7(r: ARexp, acc: Set[Rexp]) : ARexp = r match{
  case AALTS(bs, rs) => rs.map(r => prune7(r, acc)).filter(_ != ZERO) match
  {
    case Nil => AZERO
    case r::Nil => fuse(bs, r)
    case rs1 => AALTS(bs, rs1)
  }
  case ASEQ(bs, r1, r2) => prune7(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match {
    case AZERO => AZERO
    case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2)
    case r1p => ASEQ(bs, r1p, r2)
  }
  case r => if(acc(erase(r))) AZERO else r
}
\end{verbatim}

\begin{figure}
\centering
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    x label style={at={(1.05,-0.05)}},
    ylabel={size},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=30,
    ymax=800,
    ytick={0,200,...,800},
    scaled ticks=false,
    axis lines=left,
    width=5cm,
    height=4cm, 
    legend entries={$bsimpStrong$ size growth},  
    legend pos=north west,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data};
\end{axis}
\end{tikzpicture}
  &
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    x label style={at={(1.05,-0.05)}},
    ylabel={size},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=30,
    ymax=42000,
    ytick={0,10000,...,40000},
    scaled ticks=true,
    axis lines=left,
    width=5cm,
    height=4cm, 
    legend entries={$bsimp$ size growth},  
    legend pos=north west,
    legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
\end{axis}
\end{tikzpicture}\\
\multicolumn{2}{c}{Graphs: Runtime for matching $((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings 
           of the form $\underbrace{aa..a}_{n}$.}
\end{tabular}    
\caption{aaaaaStarStar} \label{fig:aaaaaStarStar}
\end{figure}



%----------------------------------------------------------------------------------------
%	SECTION 1
%----------------------------------------------------------------------------------------

\section{Adding Support for the Negation Construct, and its Correctness Proof}
We now add support for the negation regular expression:
\[			r ::=   \ZERO \mid  \ONE
			 \mid  c  
			 \mid  r_1 \cdot r_2
			 \mid  r_1 + r_2   
			 \mid r^*    
			 \mid \sim r
\]
The $\textit{nullable}$ function's clause for it would be 
\[
\textit{nullable}(~r) = \neg \nullable(r)
\]
The derivative would be
\[
~r \backslash c = ~ (r \backslash c)
\]
 
The most tricky part of lexing for the $~r$ regex
 is creating a value for it.
 For other regular expressions, the value aligns with the 
 structure of the regex:
 \[
 \vdash \Seq(\Char(a), \Char(b)) : a \cdot b
 \]
But for the $~r$ regex, $s$ is a member of it if and only if
$s$ does not belong to $L(r)$. 
That means when there
is a match for the not regex, it is not possible to generate how the string $s$ matched
with $r$. 
What we can do is preserve the information of how $s$ was not matched by $r$,
and there are a number of options to do this.

We could give a partial value when there is a partial match for the regex inside
the $\mathbf{not}$ construct.
For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,
A value for it could be 
 \[
 \vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)
 \]
 The above example demonstrates what value to construct 
 when the string $s$ is at most a real prefix
 of the strings in $L(r)$. When $s$ instead is not a prefix of any strings
 in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$
 constructor.
 
 Another option would be to either store the string $s$ that resulted in 
 a mis-match for $r$ or a dummy value as a placeholder:
 \[
 \vdash \textit{Not}(abcd) : ~( r_1 )
 \]
or
 \[
 \vdash \textit{Not}(\textit{Dummy}) : ~( r_1 )
 \] 
 We choose to implement this as it is most straightforward:
 \[
 \mkeps(~(r))  = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})
 \]
 
%----------------------------------------------------------------------------------------
%	SECTION 2
%----------------------------------------------------------------------------------------

\section{Bounded Repetitions}
author	Chengsong
	Mon, 06 Jun 2022 03:05:31 +0100
changeset 533	6acbc939af6a
parent 532	cc54ce075db5
child 535	ce91c29d2885
permissions	-rwxr-xr-x