ChengsongTanPhdThesis/Chapters/Cubic.tex
author Chengsong
Wed, 31 Aug 2022 23:57:42 +0100
changeset 590 988e92a70704
parent 538 8016a2480704
child 591 b2d0de6aee18
permissions -rwxr-xr-x
more chap5 and chap6 bsimp_idem
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     1
% Chapter Template
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     2
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     3
\chapter{A Better Bound and Other Extensions} % Main chapter title
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     4
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     5
\label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     6
%in Chapter 4 to a polynomial one, and demonstrate how one can extend the
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     7
%algorithm to include constructs such as bounded repetitions and negations.
590
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
     8
\lstset{style=myScalastyle}
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
     9
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    10
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    11
This chapter is a ``miscellaneous''
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    12
chapter which records various
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    13
extensions to our $\blexersimp$'s formalisations.\\
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    14
Firstly we present further improvements
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    15
made to our lexer algorithm $\blexersimp$.
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    16
We devise a stronger simplification algorithm,
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    17
called $\bsimpStrong$, which can prune away
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    18
similar components in two regular expressions at the same 
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    19
alternative level,
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    20
even if these regular expressions are not exactly the same.
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    21
We call the lexer that uses this stronger simplification function
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    22
$\blexerStrong$.
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    23
We conjecture that both
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    24
\begin{center}
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    25
	$\blexerStrong \;r \; s = \blexer\; r\;s$
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    26
\end{center}
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    27
and
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    28
\begin{center}
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    29
	$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    30
\end{center}
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    31
hold, but formalising
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    32
them is still work in progress.
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    33
We give reasons why the correctness and cubic size bound proofs
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    34
can be achieved,
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    35
by exploring the connection between the internal
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    36
data structure of our $\blexerStrong$ and
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    37
Animirov's partial derivatives.\\
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    38
Secondly, we extend our $\blexersimp$
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    39
to support bounded repetitions ($r^{\{n\}}$).
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    40
We update our formalisation of 
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    41
the correctness and finiteness properties to
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    42
include this new construct. With bounded repetitions
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    43
we are able to out-compete other verified lexers such as
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    44
Verbatim++ on regular expressions which involve a lot of
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    45
repetitions. We also present the idempotency property proof
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    46
of $\bsimp$, which leverages the idempotency proof of $\rsimp$.
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    47
This reinforces our claim that the fixpoint construction
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    48
originally required by Sulzmann and Lu can be removed in $\blexersimp$.
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    49
\\
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    50
Last but not least, we present our efforts and challenges we met
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    51
in further improving the algorithm by data
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    52
structures such as zippers.
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    53
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    54
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    55
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    56
%----------------------------------------------------------------------------------------
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    57
%	SECTION strongsimp
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    58
%----------------------------------------------------------------------------------------
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    59
\section{A Stronger Version of Simplification}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    60
%TODO: search for isabelle proofs of algorithms that check equivalence 
590
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    61
In our bitcoded lexing algorithm, (sub)terms represent (sub)matches.
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    62
For example, the regular expression $aa \cdot a^*+ a \cdot a^*$ contains two terms, 
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
    63
which expresses two possibilities it will match future input.
533
Chengsong
parents: 532
diff changeset
    64
It will either match 2 $a$'s, then 0 or more $a$'s, in other words, at least 2 more $a$'s
Chengsong
parents: 532
diff changeset
    65
\begin{figure}\label{string_3parts1}
Chengsong
parents: 532
diff changeset
    66
\begin{center}
Chengsong
parents: 532
diff changeset
    67
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
Chengsong
parents: 532
diff changeset
    68
    \node [rectangle split, rectangle split horizontal, rectangle split parts=3]
Chengsong
parents: 532
diff changeset
    69
        {Consumed Input
Chengsong
parents: 532
diff changeset
    70
        \nodepart{two} Expects $aa$
Chengsong
parents: 532
diff changeset
    71
         \nodepart{three} Then expects $a^*$};
Chengsong
parents: 532
diff changeset
    72
Chengsong
parents: 532
diff changeset
    73
\end{tikzpicture}
Chengsong
parents: 532
diff changeset
    74
\end{center}
Chengsong
parents: 532
diff changeset
    75
\end{figure}
Chengsong
parents: 532
diff changeset
    76
\noindent
Chengsong
parents: 532
diff changeset
    77
Or it will match at least 1 more $a$:
Chengsong
parents: 532
diff changeset
    78
\begin{figure}\label{string_3parts2}
Chengsong
parents: 532
diff changeset
    79
\begin{center}
Chengsong
parents: 532
diff changeset
    80
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
Chengsong
parents: 532
diff changeset
    81
    \node [rectangle split, rectangle split horizontal, rectangle split parts=3]
Chengsong
parents: 532
diff changeset
    82
        {Consumed
Chengsong
parents: 532
diff changeset
    83
        \nodepart{two} Expects $a$
Chengsong
parents: 532
diff changeset
    84
         \nodepart{three} Then expects $a^*$};
Chengsong
parents: 532
diff changeset
    85
Chengsong
parents: 532
diff changeset
    86
\end{tikzpicture}
Chengsong
parents: 532
diff changeset
    87
\end{center}
Chengsong
parents: 532
diff changeset
    88
\end{figure}
Chengsong
parents: 532
diff changeset
    89
If these two possibilities are identical, we can eliminate
Chengsong
parents: 532
diff changeset
    90
the second one as we know the second one corresponds to 
Chengsong
parents: 532
diff changeset
    91
a match that is not $\POSIX$.
Chengsong
parents: 532
diff changeset
    92
Chengsong
parents: 532
diff changeset
    93
Chengsong
parents: 532
diff changeset
    94
If two identical regexes 
Chengsong
parents: 532
diff changeset
    95
happen to be grouped into different sequences, one cannot use
Chengsong
parents: 532
diff changeset
    96
the $a + a \rightsquigarrow a$ rule to eliminate them, even if they 
Chengsong
parents: 532
diff changeset
    97
are "parallel" to each other:
Chengsong
parents: 532
diff changeset
    98
\[
Chengsong
parents: 532
diff changeset
    99
(a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2
Chengsong
parents: 532
diff changeset
   100
\]
Chengsong
parents: 532
diff changeset
   101
and that's because they are followed by possibly 
Chengsong
parents: 532
diff changeset
   102
different "suffixes" $r_1$ and $r_2$, and if 
Chengsong
parents: 532
diff changeset
   103
they do turn out to be different then doing 
Chengsong
parents: 532
diff changeset
   104
\[
Chengsong
parents: 532
diff changeset
   105
(a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2
Chengsong
parents: 532
diff changeset
   106
\]
Chengsong
parents: 532
diff changeset
   107
might cause a possible match where the string is in $L(a \cdot r_2)$ to be lost.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   108
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   109
Here is an example for this.
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   110
Assume we have the derivative regex 
533
Chengsong
parents: 532
diff changeset
   111
\[( r_1  +  r_2  +  r_3)\cdot r+ 
Chengsong
parents: 532
diff changeset
   112
( r_1  +  r_5  +  r_6)\cdot( r_1  +  r_2  +  r_3)^*
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   113
\]
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   114
533
Chengsong
parents: 532
diff changeset
   115
occurring in an intermediate step in our bit-coded lexing algorithm.
Chengsong
parents: 532
diff changeset
   116
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   117
In this expression, there will be 6 "parallel" terms if we break up regexes 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   118
of shape $(a+b)\cdot c$ using the distributivity law (into $ac + bc$).
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   119
\begin{align}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   120
(\nonumber \\
533
Chengsong
parents: 532
diff changeset
   121
 r_1  + &    \label{term:1} \\
Chengsong
parents: 532
diff changeset
   122
 r_2  + &  \label{term:2} \\
Chengsong
parents: 532
diff changeset
   123
 r_3 &  \label{term:3} \\
Chengsong
parents: 532
diff changeset
   124
&	)\cdot r\; + \; (\nonumber \\
Chengsong
parents: 532
diff changeset
   125
 r_1  + & \label{term:4} \\
Chengsong
parents: 532
diff changeset
   126
 r_5  + & \label{term:5} \\
Chengsong
parents: 532
diff changeset
   127
 r_6 \label{term:6}\\
Chengsong
parents: 532
diff changeset
   128
&      )\cdot r\nonumber
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   129
\end{align}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   130
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   131
They have been labelled, and each label denotes  
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   132
one term, for example, \ref{term:1} denotes
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   133
\[
533
Chengsong
parents: 532
diff changeset
   134
 r_1 \cdot r
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   135
\]
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   136
\noindent
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   137
and \ref{term:3} denotes
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   138
\[
533
Chengsong
parents: 532
diff changeset
   139
 r_3\cdot r.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   140
\]
533
Chengsong
parents: 532
diff changeset
   141
From a lexer's point of view, \ref{term:4} will never 
Chengsong
parents: 532
diff changeset
   142
be picked up in later phases of matching because there
Chengsong
parents: 532
diff changeset
   143
is a term \ref{term:1} giving identical matching information.
Chengsong
parents: 532
diff changeset
   144
The first term \ref{term:1} will match a string in $L(r_1 \cdot  r)$, 
Chengsong
parents: 532
diff changeset
   145
and so on for term \ref{term:2}, \ref{term:3}, etc.
Chengsong
parents: 532
diff changeset
   146
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   147
\mybox{previous input $\ldots$}\mybox{$aaa$ }\mybox{rest of input $\ldots$}
533
Chengsong
parents: 532
diff changeset
   148
\begin{center}\label{string_2parts}
Chengsong
parents: 532
diff changeset
   149
Chengsong
parents: 532
diff changeset
   150
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
Chengsong
parents: 532
diff changeset
   151
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2]
Chengsong
parents: 532
diff changeset
   152
        {$r_{x1}$
Chengsong
parents: 532
diff changeset
   153
         \nodepart{two} $r_1\cdot r$ };
Chengsong
parents: 532
diff changeset
   154
%\caption{term 1 \ref{term:1}'s matching configuration}
Chengsong
parents: 532
diff changeset
   155
\end{tikzpicture}
Chengsong
parents: 532
diff changeset
   156
Chengsong
parents: 532
diff changeset
   157
\end{center}
Chengsong
parents: 532
diff changeset
   158
For term 1 \ref{term:1}, whatever was before the current
Chengsong
parents: 532
diff changeset
   159
input position was fully matched and the regex corresponding
Chengsong
parents: 532
diff changeset
   160
to it reduced to $\ONE$, 
Chengsong
parents: 532
diff changeset
   161
and in the next input token, it will start on $r_1\cdot r$.
Chengsong
parents: 532
diff changeset
   162
The resulting value will be something of a similar configuration:
Chengsong
parents: 532
diff changeset
   163
\begin{center}\label{value_2parts}
Chengsong
parents: 532
diff changeset
   164
%\caption{term 1 \ref{term:1}'s matching configuration}
Chengsong
parents: 532
diff changeset
   165
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
Chengsong
parents: 532
diff changeset
   166
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2]
Chengsong
parents: 532
diff changeset
   167
        {$v_{x1}$
Chengsong
parents: 532
diff changeset
   168
         \nodepart{two} $v_{r_1\cdot r}$ };
Chengsong
parents: 532
diff changeset
   169
\end{tikzpicture}
Chengsong
parents: 532
diff changeset
   170
\end{center}
Chengsong
parents: 532
diff changeset
   171
For term 2 \ref{term:2} we have a similar value partition:
Chengsong
parents: 532
diff changeset
   172
\begin{center}\label{value_2parts2}
Chengsong
parents: 532
diff changeset
   173
%\caption{term 1 \ref{term:1}'s matching configuration}
Chengsong
parents: 532
diff changeset
   174
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
Chengsong
parents: 532
diff changeset
   175
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2]
Chengsong
parents: 532
diff changeset
   176
        {$v_{x2}$
Chengsong
parents: 532
diff changeset
   177
         \nodepart{two} $v_{r_2\cdot r}$ };
Chengsong
parents: 532
diff changeset
   178
\end{tikzpicture}
Chengsong
parents: 532
diff changeset
   179
\end{center}
Chengsong
parents: 532
diff changeset
   180
\noindent
Chengsong
parents: 532
diff changeset
   181
and so on.
Chengsong
parents: 532
diff changeset
   182
We note that for term 4 \ref{term:4} its result value 
Chengsong
parents: 532
diff changeset
   183
after any position beyond  the current one will always
Chengsong
parents: 532
diff changeset
   184
be the same:
Chengsong
parents: 532
diff changeset
   185
\begin{center}\label{value_2parts4}
Chengsong
parents: 532
diff changeset
   186
%\caption{term 1 \ref{term:1}'s matching configuration}
Chengsong
parents: 532
diff changeset
   187
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
Chengsong
parents: 532
diff changeset
   188
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2]
Chengsong
parents: 532
diff changeset
   189
        {$v_{x4}$
Chengsong
parents: 532
diff changeset
   190
         \nodepart{two} $v_{r_1\cdot r}$ };
Chengsong
parents: 532
diff changeset
   191
\end{tikzpicture}
Chengsong
parents: 532
diff changeset
   192
\end{center}
Chengsong
parents: 532
diff changeset
   193
And $\POSIX$ rules says that we can eliminate at least one of them.
Chengsong
parents: 532
diff changeset
   194
Our algorithm always puts the regex with the longest initial sub-match at the left of the 
Chengsong
parents: 532
diff changeset
   195
alternatives, so we safely throw away \ref{term:4}.
Chengsong
parents: 532
diff changeset
   196
The fact that term 1 and 4 are not immediately in the same alternative
Chengsong
parents: 532
diff changeset
   197
expression does not prevent them from being two duplicate matches at
Chengsong
parents: 532
diff changeset
   198
the current point of view.
Chengsong
parents: 532
diff changeset
   199
To implement this idea into an algorithm, we define a "pruning function"
Chengsong
parents: 532
diff changeset
   200
$\textit{prune}$, that takes away parts of all terms in $r$ that belongs to
Chengsong
parents: 532
diff changeset
   201
$\textit{acc}$, which acts as an element
Chengsong
parents: 532
diff changeset
   202
is a stronger version of $\textit{distinctBy}$.
Chengsong
parents: 532
diff changeset
   203
Here is a concise version of it (in the style of Scala):
590
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
   204
\begin{figure}[H]
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
   205
\begin{lstlisting}
533
Chengsong
parents: 532
diff changeset
   206
def distinctByPlus(rs: List[ARexp], acc: Set[Rexp] = Set()) : 
Chengsong
parents: 532
diff changeset
   207
List[ARexp] =  rs match {
Chengsong
parents: 532
diff changeset
   208
  case Nil => Nil
Chengsong
parents: 532
diff changeset
   209
  case r :: rs =>
Chengsong
parents: 532
diff changeset
   210
    if(acc.contains(erase(r)))
Chengsong
parents: 532
diff changeset
   211
      distinctByPlus(rs, acc)
Chengsong
parents: 532
diff changeset
   212
    else 
Chengsong
parents: 532
diff changeset
   213
      prune(r, acc) :: distinctByPlus(rs, prune(r, acc) +: acc)
Chengsong
parents: 532
diff changeset
   214
}
Chengsong
parents: 532
diff changeset
   215
590
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
   216
\end{lstlisting}
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
   217
\caption{the distinctByPlus function}
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
   218
\end{figure}
533
Chengsong
parents: 532
diff changeset
   219
But for the function $\textit{prune}$ there is a difficulty:
Chengsong
parents: 532
diff changeset
   220
how could one define the $L$ function in a "computable" way,
Chengsong
parents: 532
diff changeset
   221
so that they generate a (lazy infinite) set of strings in $L(r)$.
Chengsong
parents: 532
diff changeset
   222
We also need a function that tests whether $L(r_1) \subseteq L(r_2)$
Chengsong
parents: 532
diff changeset
   223
is true.
Chengsong
parents: 532
diff changeset
   224
For the moment we cut corners and do not define these two functions
Chengsong
parents: 532
diff changeset
   225
as they are not used by Antimirov (and will probably not contribute
Chengsong
parents: 532
diff changeset
   226
to a bound better than Antimirov's cubic bound).
Chengsong
parents: 532
diff changeset
   227
Rather, we do not try to eliminate in every instance of regular expressions
Chengsong
parents: 532
diff changeset
   228
that have "language duplicates", but only eliminate the "exact duplicates".
Chengsong
parents: 532
diff changeset
   229
For this we need to break up terms $(a+b)\cdot c$ into two
Chengsong
parents: 532
diff changeset
   230
terms $a\cdot c$ and $b\cdot c$ before we add them to the accumulator:
590
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
   231
\begin{figure}[H]
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
   232
\begin{lstlisting}
533
Chengsong
parents: 532
diff changeset
   233
def distinctWith(rs: List[ARexp], 
Chengsong
parents: 532
diff changeset
   234
                pruneFunction: (ARexp, Set[Rexp]) => ARexp, 
Chengsong
parents: 532
diff changeset
   235
                acc: Set[Rexp] = Set()) : List[ARexp] = 
Chengsong
parents: 532
diff changeset
   236
  rs match{
Chengsong
parents: 532
diff changeset
   237
    case Nil => Nil
Chengsong
parents: 532
diff changeset
   238
    case r :: rs => 
Chengsong
parents: 532
diff changeset
   239
      if(acc(erase(r)))
Chengsong
parents: 532
diff changeset
   240
        distinctWith(rs, pruneFunction, acc)
Chengsong
parents: 532
diff changeset
   241
      else {
Chengsong
parents: 532
diff changeset
   242
        val pruned_r = pruneFunction(r, acc)
Chengsong
parents: 532
diff changeset
   243
        pruned_r :: 
Chengsong
parents: 532
diff changeset
   244
        distinctWith(rs, 
Chengsong
parents: 532
diff changeset
   245
          pruneFunction, 
Chengsong
parents: 532
diff changeset
   246
          turnIntoTerms(erase(pruned_r)) ++: acc
Chengsong
parents: 532
diff changeset
   247
        )
Chengsong
parents: 532
diff changeset
   248
      }
Chengsong
parents: 532
diff changeset
   249
  }
590
988e92a70704 more chap5 and chap6 bsimp_idem
Chengsong
parents: 538
diff changeset
   250
\end{lstlisting}
535
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   251
\caption{A Stronger Version of $\textit{distinctBy}$}
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   252
\end{figure}
533
Chengsong
parents: 532
diff changeset
   253
\noindent
Chengsong
parents: 532
diff changeset
   254
This process is done by the function $\textit{turnIntoTerms}$:
535
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   255
\begin{figure}
533
Chengsong
parents: 532
diff changeset
   256
\begin{verbatim}
Chengsong
parents: 532
diff changeset
   257
def turnIntoTerms(r: Rexp): List[Rexp] = r match {
Chengsong
parents: 532
diff changeset
   258
  case SEQ(r1, r2)  => if(isOne(r1)) 
Chengsong
parents: 532
diff changeset
   259
                          turnIntoTerms(r2) 
Chengsong
parents: 532
diff changeset
   260
                       else 
Chengsong
parents: 532
diff changeset
   261
                          turnIntoTerms(r1).map(r11 => SEQ(r11, r2))
Chengsong
parents: 532
diff changeset
   262
  case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2)
Chengsong
parents: 532
diff changeset
   263
  case ZERO => Nil
Chengsong
parents: 532
diff changeset
   264
  case _ => r :: Nil
Chengsong
parents: 532
diff changeset
   265
}
Chengsong
parents: 532
diff changeset
   266
\end{verbatim}
535
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   267
\caption{function that breaks up regular expression into "atomic" terms}
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   268
\end{figure}
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   269
533
Chengsong
parents: 532
diff changeset
   270
\noindent
Chengsong
parents: 532
diff changeset
   271
The pruning function can be defined recursively:
535
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   272
\begin{figure}
533
Chengsong
parents: 532
diff changeset
   273
\begin{verbatim}
Chengsong
parents: 532
diff changeset
   274
def prune7(r: ARexp, acc: Set[Rexp]) : ARexp = r match{
Chengsong
parents: 532
diff changeset
   275
  case AALTS(bs, rs) => rs.map(r => prune7(r, acc)).filter(_ != ZERO) match
Chengsong
parents: 532
diff changeset
   276
  {
Chengsong
parents: 532
diff changeset
   277
    case Nil => AZERO
Chengsong
parents: 532
diff changeset
   278
    case r::Nil => fuse(bs, r)
Chengsong
parents: 532
diff changeset
   279
    case rs1 => AALTS(bs, rs1)
Chengsong
parents: 532
diff changeset
   280
  }
Chengsong
parents: 532
diff changeset
   281
  case ASEQ(bs, r1, r2) => prune7(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match {
Chengsong
parents: 532
diff changeset
   282
    case AZERO => AZERO
Chengsong
parents: 532
diff changeset
   283
    case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2)
Chengsong
parents: 532
diff changeset
   284
    case r1p => ASEQ(bs, r1p, r2)
Chengsong
parents: 532
diff changeset
   285
  }
Chengsong
parents: 532
diff changeset
   286
  case r => if(acc(erase(r))) AZERO else r
Chengsong
parents: 532
diff changeset
   287
}
Chengsong
parents: 532
diff changeset
   288
\end{verbatim}
535
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   289
\caption{pruning function}
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   290
\end{figure}
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   291
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   292
533
Chengsong
parents: 532
diff changeset
   293
Chengsong
parents: 532
diff changeset
   294
\begin{figure}
Chengsong
parents: 532
diff changeset
   295
\centering
Chengsong
parents: 532
diff changeset
   296
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
Chengsong
parents: 532
diff changeset
   297
\begin{tikzpicture}
Chengsong
parents: 532
diff changeset
   298
\begin{axis}[
535
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   299
    %xlabel={$n$},
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   300
    myplotstyle,
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   301
    xlabel={input length},
533
Chengsong
parents: 532
diff changeset
   302
    ylabel={size},
535
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   303
    ]
533
Chengsong
parents: 532
diff changeset
   304
\addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data};
Chengsong
parents: 532
diff changeset
   305
\end{axis}
Chengsong
parents: 532
diff changeset
   306
\end{tikzpicture}
Chengsong
parents: 532
diff changeset
   307
  &
Chengsong
parents: 532
diff changeset
   308
\begin{tikzpicture}
Chengsong
parents: 532
diff changeset
   309
\begin{axis}[
535
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   310
    %xlabel={$n$},
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   311
    myplotstyle,
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   312
    xlabel={input length},
533
Chengsong
parents: 532
diff changeset
   313
    ylabel={size},
535
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   314
    ]
533
Chengsong
parents: 532
diff changeset
   315
\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
Chengsong
parents: 532
diff changeset
   316
\end{axis}
Chengsong
parents: 532
diff changeset
   317
\end{tikzpicture}\\
535
ce91c29d2885 fixed plotting error 6.1
Chengsong
parents: 533
diff changeset
   318
\multicolumn{2}{l}{Graphs: Runtime for matching $((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings 
533
Chengsong
parents: 532
diff changeset
   319
           of the form $\underbrace{aa..a}_{n}$.}
Chengsong
parents: 532
diff changeset
   320
\end{tabular}    
Chengsong
parents: 532
diff changeset
   321
\caption{aaaaaStarStar} \label{fig:aaaaaStarStar}
Chengsong
parents: 532
diff changeset
   322
\end{figure}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   323
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   324
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   325
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   326
%----------------------------------------------------------------------------------------
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   327
%	SECTION 1
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   328
%----------------------------------------------------------------------------------------
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   329
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   330
\section{Adding Support for the Negation Construct, and its Correctness Proof}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   331
We now add support for the negation regular expression:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   332
\[			r ::=   \ZERO \mid  \ONE
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   333
			 \mid  c  
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   334
			 \mid  r_1 \cdot r_2
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   335
			 \mid  r_1 + r_2   
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   336
			 \mid r^*    
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   337
			 \mid \sim r
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   338
\]
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   339
The $\textit{nullable}$ function's clause for it would be 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   340
\[
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   341
\textit{nullable}(~r) = \neg \nullable(r)
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   342
\]
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   343
The derivative would be
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   344
\[
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   345
~r \backslash c = ~ (r \backslash c)
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   346
\]
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   347
 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   348
The most tricky part of lexing for the $~r$ regex
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   349
 is creating a value for it.
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   350
 For other regular expressions, the value aligns with the 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   351
 structure of the regex:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   352
 \[
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   353
 \vdash \Seq(\Char(a), \Char(b)) : a \cdot b
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   354
 \]
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   355
But for the $~r$ regex, $s$ is a member of it if and only if
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   356
$s$ does not belong to $L(r)$. 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   357
That means when there
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   358
is a match for the not regex, it is not possible to generate how the string $s$ matched
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   359
with $r$. 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   360
What we can do is preserve the information of how $s$ was not matched by $r$,
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   361
and there are a number of options to do this.
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   362
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   363
We could give a partial value when there is a partial match for the regex inside
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   364
the $\mathbf{not}$ construct.
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   365
For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   366
A value for it could be 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   367
 \[
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   368
 \vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   369
 \]
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   370
 The above example demonstrates what value to construct 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   371
 when the string $s$ is at most a real prefix
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   372
 of the strings in $L(r)$. When $s$ instead is not a prefix of any strings
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   373
 in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   374
 constructor.
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   375
 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   376
 Another option would be to either store the string $s$ that resulted in 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   377
 a mis-match for $r$ or a dummy value as a placeholder:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   378
 \[
533
Chengsong
parents: 532
diff changeset
   379
 \vdash \textit{Not}(abcd) : ~( r_1 )
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   380
 \]
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   381
or
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   382
 \[
533
Chengsong
parents: 532
diff changeset
   383
 \vdash \textit{Not}(\textit{Dummy}) : ~( r_1 )
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   384
 \] 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   385
 We choose to implement this as it is most straightforward:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   386
 \[
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   387
 \mkeps(~(r))  = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   388
 \]
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   389
 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   390
%----------------------------------------------------------------------------------------
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   391
%	SECTION 2
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   392
%----------------------------------------------------------------------------------------
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   393
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   394
\section{Bounded Repetitions}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   395
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   396