532
|
1 |
% Chapter Template
|
|
2 |
|
|
3 |
\chapter{A Better Bound and Other Extensions} % Main chapter title
|
|
4 |
|
|
5 |
\label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
|
|
6 |
%in Chapter 4 to a polynomial one, and demonstrate how one can extend the
|
|
7 |
%algorithm to include constructs such as bounded repetitions and negations.
|
590
|
8 |
\lstset{style=myScalastyle}
|
|
9 |
|
|
10 |
|
|
11 |
This chapter is a ``miscellaneous''
|
|
12 |
chapter which records various
|
|
13 |
extensions to our $\blexersimp$'s formalisations.\\
|
|
14 |
Firstly we present further improvements
|
|
15 |
made to our lexer algorithm $\blexersimp$.
|
|
16 |
We devise a stronger simplification algorithm,
|
|
17 |
called $\bsimpStrong$, which can prune away
|
|
18 |
similar components in two regular expressions at the same
|
|
19 |
alternative level,
|
|
20 |
even if these regular expressions are not exactly the same.
|
|
21 |
We call the lexer that uses this stronger simplification function
|
|
22 |
$\blexerStrong$.
|
|
23 |
We conjecture that both
|
|
24 |
\begin{center}
|
|
25 |
$\blexerStrong \;r \; s = \blexer\; r\;s$
|
|
26 |
\end{center}
|
|
27 |
and
|
|
28 |
\begin{center}
|
|
29 |
$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$
|
|
30 |
\end{center}
|
|
31 |
hold, but formalising
|
|
32 |
them is still work in progress.
|
|
33 |
We give reasons why the correctness and cubic size bound proofs
|
|
34 |
can be achieved,
|
|
35 |
by exploring the connection between the internal
|
|
36 |
data structure of our $\blexerStrong$ and
|
|
37 |
Animirov's partial derivatives.\\
|
|
38 |
Secondly, we extend our $\blexersimp$
|
|
39 |
to support bounded repetitions ($r^{\{n\}}$).
|
|
40 |
We update our formalisation of
|
|
41 |
the correctness and finiteness properties to
|
|
42 |
include this new construct. With bounded repetitions
|
|
43 |
we are able to out-compete other verified lexers such as
|
|
44 |
Verbatim++ on regular expressions which involve a lot of
|
|
45 |
repetitions. We also present the idempotency property proof
|
|
46 |
of $\bsimp$, which leverages the idempotency proof of $\rsimp$.
|
|
47 |
This reinforces our claim that the fixpoint construction
|
|
48 |
originally required by Sulzmann and Lu can be removed in $\blexersimp$.
|
|
49 |
\\
|
|
50 |
Last but not least, we present our efforts and challenges we met
|
|
51 |
in further improving the algorithm by data
|
|
52 |
structures such as zippers.
|
|
53 |
|
|
54 |
|
|
55 |
|
532
|
56 |
%----------------------------------------------------------------------------------------
|
|
57 |
% SECTION strongsimp
|
|
58 |
%----------------------------------------------------------------------------------------
|
|
59 |
\section{A Stronger Version of Simplification}
|
|
60 |
%TODO: search for isabelle proofs of algorithms that check equivalence
|
590
|
61 |
In our bitcoded lexing algorithm, (sub)terms represent (sub)matches.
|
|
62 |
For example, the regular expression $aa \cdot a^*+ a \cdot a^*$ contains two terms,
|
|
63 |
which expresses two possibilities it will match future input.
|
533
|
64 |
It will either match 2 $a$'s, then 0 or more $a$'s, in other words, at least 2 more $a$'s
|
|
65 |
\begin{figure}\label{string_3parts1}
|
|
66 |
\begin{center}
|
|
67 |
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
|
|
68 |
\node [rectangle split, rectangle split horizontal, rectangle split parts=3]
|
|
69 |
{Consumed Input
|
|
70 |
\nodepart{two} Expects $aa$
|
|
71 |
\nodepart{three} Then expects $a^*$};
|
|
72 |
|
|
73 |
\end{tikzpicture}
|
|
74 |
\end{center}
|
|
75 |
\end{figure}
|
|
76 |
\noindent
|
|
77 |
Or it will match at least 1 more $a$:
|
|
78 |
\begin{figure}\label{string_3parts2}
|
|
79 |
\begin{center}
|
|
80 |
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
|
|
81 |
\node [rectangle split, rectangle split horizontal, rectangle split parts=3]
|
|
82 |
{Consumed
|
|
83 |
\nodepart{two} Expects $a$
|
|
84 |
\nodepart{three} Then expects $a^*$};
|
|
85 |
|
|
86 |
\end{tikzpicture}
|
|
87 |
\end{center}
|
|
88 |
\end{figure}
|
|
89 |
If these two possibilities are identical, we can eliminate
|
|
90 |
the second one as we know the second one corresponds to
|
|
91 |
a match that is not $\POSIX$.
|
|
92 |
|
|
93 |
|
|
94 |
If two identical regexes
|
|
95 |
happen to be grouped into different sequences, one cannot use
|
|
96 |
the $a + a \rightsquigarrow a$ rule to eliminate them, even if they
|
|
97 |
are "parallel" to each other:
|
|
98 |
\[
|
|
99 |
(a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2
|
|
100 |
\]
|
|
101 |
and that's because they are followed by possibly
|
|
102 |
different "suffixes" $r_1$ and $r_2$, and if
|
|
103 |
they do turn out to be different then doing
|
|
104 |
\[
|
|
105 |
(a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2
|
|
106 |
\]
|
|
107 |
might cause a possible match where the string is in $L(a \cdot r_2)$ to be lost.
|
532
|
108 |
|
|
109 |
Here is an example for this.
|
|
110 |
Assume we have the derivative regex
|
533
|
111 |
\[( r_1 + r_2 + r_3)\cdot r+
|
|
112 |
( r_1 + r_5 + r_6)\cdot( r_1 + r_2 + r_3)^*
|
532
|
113 |
\]
|
|
114 |
|
533
|
115 |
occurring in an intermediate step in our bit-coded lexing algorithm.
|
|
116 |
|
532
|
117 |
In this expression, there will be 6 "parallel" terms if we break up regexes
|
|
118 |
of shape $(a+b)\cdot c$ using the distributivity law (into $ac + bc$).
|
|
119 |
\begin{align}
|
|
120 |
(\nonumber \\
|
533
|
121 |
r_1 + & \label{term:1} \\
|
|
122 |
r_2 + & \label{term:2} \\
|
|
123 |
r_3 & \label{term:3} \\
|
|
124 |
& )\cdot r\; + \; (\nonumber \\
|
|
125 |
r_1 + & \label{term:4} \\
|
|
126 |
r_5 + & \label{term:5} \\
|
|
127 |
r_6 \label{term:6}\\
|
|
128 |
& )\cdot r\nonumber
|
532
|
129 |
\end{align}
|
|
130 |
|
|
131 |
They have been labelled, and each label denotes
|
|
132 |
one term, for example, \ref{term:1} denotes
|
|
133 |
\[
|
533
|
134 |
r_1 \cdot r
|
532
|
135 |
\]
|
|
136 |
\noindent
|
|
137 |
and \ref{term:3} denotes
|
|
138 |
\[
|
533
|
139 |
r_3\cdot r.
|
532
|
140 |
\]
|
533
|
141 |
From a lexer's point of view, \ref{term:4} will never
|
|
142 |
be picked up in later phases of matching because there
|
|
143 |
is a term \ref{term:1} giving identical matching information.
|
|
144 |
The first term \ref{term:1} will match a string in $L(r_1 \cdot r)$,
|
|
145 |
and so on for term \ref{term:2}, \ref{term:3}, etc.
|
|
146 |
|
532
|
147 |
\mybox{previous input $\ldots$}\mybox{$aaa$ }\mybox{rest of input $\ldots$}
|
533
|
148 |
\begin{center}\label{string_2parts}
|
|
149 |
|
|
150 |
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
|
|
151 |
\node [rectangle split, rectangle split horizontal, rectangle split parts=2]
|
|
152 |
{$r_{x1}$
|
|
153 |
\nodepart{two} $r_1\cdot r$ };
|
|
154 |
%\caption{term 1 \ref{term:1}'s matching configuration}
|
|
155 |
\end{tikzpicture}
|
|
156 |
|
|
157 |
\end{center}
|
|
158 |
For term 1 \ref{term:1}, whatever was before the current
|
|
159 |
input position was fully matched and the regex corresponding
|
|
160 |
to it reduced to $\ONE$,
|
|
161 |
and in the next input token, it will start on $r_1\cdot r$.
|
|
162 |
The resulting value will be something of a similar configuration:
|
|
163 |
\begin{center}\label{value_2parts}
|
|
164 |
%\caption{term 1 \ref{term:1}'s matching configuration}
|
|
165 |
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
|
|
166 |
\node [rectangle split, rectangle split horizontal, rectangle split parts=2]
|
|
167 |
{$v_{x1}$
|
|
168 |
\nodepart{two} $v_{r_1\cdot r}$ };
|
|
169 |
\end{tikzpicture}
|
|
170 |
\end{center}
|
|
171 |
For term 2 \ref{term:2} we have a similar value partition:
|
|
172 |
\begin{center}\label{value_2parts2}
|
|
173 |
%\caption{term 1 \ref{term:1}'s matching configuration}
|
|
174 |
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
|
|
175 |
\node [rectangle split, rectangle split horizontal, rectangle split parts=2]
|
|
176 |
{$v_{x2}$
|
|
177 |
\nodepart{two} $v_{r_2\cdot r}$ };
|
|
178 |
\end{tikzpicture}
|
|
179 |
\end{center}
|
|
180 |
\noindent
|
|
181 |
and so on.
|
|
182 |
We note that for term 4 \ref{term:4} its result value
|
|
183 |
after any position beyond the current one will always
|
|
184 |
be the same:
|
|
185 |
\begin{center}\label{value_2parts4}
|
|
186 |
%\caption{term 1 \ref{term:1}'s matching configuration}
|
|
187 |
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
|
|
188 |
\node [rectangle split, rectangle split horizontal, rectangle split parts=2]
|
|
189 |
{$v_{x4}$
|
|
190 |
\nodepart{two} $v_{r_1\cdot r}$ };
|
|
191 |
\end{tikzpicture}
|
|
192 |
\end{center}
|
|
193 |
And $\POSIX$ rules says that we can eliminate at least one of them.
|
|
194 |
Our algorithm always puts the regex with the longest initial sub-match at the left of the
|
|
195 |
alternatives, so we safely throw away \ref{term:4}.
|
|
196 |
The fact that term 1 and 4 are not immediately in the same alternative
|
|
197 |
expression does not prevent them from being two duplicate matches at
|
|
198 |
the current point of view.
|
|
199 |
To implement this idea into an algorithm, we define a "pruning function"
|
|
200 |
$\textit{prune}$, that takes away parts of all terms in $r$ that belongs to
|
|
201 |
$\textit{acc}$, which acts as an element
|
|
202 |
is a stronger version of $\textit{distinctBy}$.
|
|
203 |
Here is a concise version of it (in the style of Scala):
|
590
|
204 |
\begin{figure}[H]
|
|
205 |
\begin{lstlisting}
|
533
|
206 |
def distinctByPlus(rs: List[ARexp], acc: Set[Rexp] = Set()) :
|
|
207 |
List[ARexp] = rs match {
|
|
208 |
case Nil => Nil
|
|
209 |
case r :: rs =>
|
|
210 |
if(acc.contains(erase(r)))
|
|
211 |
distinctByPlus(rs, acc)
|
|
212 |
else
|
|
213 |
prune(r, acc) :: distinctByPlus(rs, prune(r, acc) +: acc)
|
|
214 |
}
|
|
215 |
|
590
|
216 |
\end{lstlisting}
|
|
217 |
\caption{the distinctByPlus function}
|
|
218 |
\end{figure}
|
533
|
219 |
But for the function $\textit{prune}$ there is a difficulty:
|
|
220 |
how could one define the $L$ function in a "computable" way,
|
|
221 |
so that they generate a (lazy infinite) set of strings in $L(r)$.
|
|
222 |
We also need a function that tests whether $L(r_1) \subseteq L(r_2)$
|
|
223 |
is true.
|
|
224 |
For the moment we cut corners and do not define these two functions
|
|
225 |
as they are not used by Antimirov (and will probably not contribute
|
|
226 |
to a bound better than Antimirov's cubic bound).
|
|
227 |
Rather, we do not try to eliminate in every instance of regular expressions
|
|
228 |
that have "language duplicates", but only eliminate the "exact duplicates".
|
|
229 |
For this we need to break up terms $(a+b)\cdot c$ into two
|
|
230 |
terms $a\cdot c$ and $b\cdot c$ before we add them to the accumulator:
|
590
|
231 |
\begin{figure}[H]
|
|
232 |
\begin{lstlisting}
|
533
|
233 |
def distinctWith(rs: List[ARexp],
|
|
234 |
pruneFunction: (ARexp, Set[Rexp]) => ARexp,
|
|
235 |
acc: Set[Rexp] = Set()) : List[ARexp] =
|
|
236 |
rs match{
|
|
237 |
case Nil => Nil
|
|
238 |
case r :: rs =>
|
|
239 |
if(acc(erase(r)))
|
|
240 |
distinctWith(rs, pruneFunction, acc)
|
|
241 |
else {
|
|
242 |
val pruned_r = pruneFunction(r, acc)
|
|
243 |
pruned_r ::
|
|
244 |
distinctWith(rs,
|
|
245 |
pruneFunction,
|
|
246 |
turnIntoTerms(erase(pruned_r)) ++: acc
|
|
247 |
)
|
|
248 |
}
|
|
249 |
}
|
590
|
250 |
\end{lstlisting}
|
535
|
251 |
\caption{A Stronger Version of $\textit{distinctBy}$}
|
|
252 |
\end{figure}
|
533
|
253 |
\noindent
|
|
254 |
This process is done by the function $\textit{turnIntoTerms}$:
|
535
|
255 |
\begin{figure}
|
533
|
256 |
\begin{verbatim}
|
|
257 |
def turnIntoTerms(r: Rexp): List[Rexp] = r match {
|
|
258 |
case SEQ(r1, r2) => if(isOne(r1))
|
|
259 |
turnIntoTerms(r2)
|
|
260 |
else
|
|
261 |
turnIntoTerms(r1).map(r11 => SEQ(r11, r2))
|
|
262 |
case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2)
|
|
263 |
case ZERO => Nil
|
|
264 |
case _ => r :: Nil
|
|
265 |
}
|
|
266 |
\end{verbatim}
|
535
|
267 |
\caption{function that breaks up regular expression into "atomic" terms}
|
|
268 |
\end{figure}
|
|
269 |
|
533
|
270 |
\noindent
|
|
271 |
The pruning function can be defined recursively:
|
535
|
272 |
\begin{figure}
|
533
|
273 |
\begin{verbatim}
|
|
274 |
def prune7(r: ARexp, acc: Set[Rexp]) : ARexp = r match{
|
|
275 |
case AALTS(bs, rs) => rs.map(r => prune7(r, acc)).filter(_ != ZERO) match
|
|
276 |
{
|
|
277 |
case Nil => AZERO
|
|
278 |
case r::Nil => fuse(bs, r)
|
|
279 |
case rs1 => AALTS(bs, rs1)
|
|
280 |
}
|
|
281 |
case ASEQ(bs, r1, r2) => prune7(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match {
|
|
282 |
case AZERO => AZERO
|
|
283 |
case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2)
|
|
284 |
case r1p => ASEQ(bs, r1p, r2)
|
|
285 |
}
|
|
286 |
case r => if(acc(erase(r))) AZERO else r
|
|
287 |
}
|
|
288 |
\end{verbatim}
|
535
|
289 |
\caption{pruning function}
|
|
290 |
\end{figure}
|
|
291 |
|
|
292 |
|
533
|
293 |
|
|
294 |
\begin{figure}
|
|
295 |
\centering
|
|
296 |
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
|
|
297 |
\begin{tikzpicture}
|
|
298 |
\begin{axis}[
|
535
|
299 |
%xlabel={$n$},
|
|
300 |
myplotstyle,
|
|
301 |
xlabel={input length},
|
533
|
302 |
ylabel={size},
|
535
|
303 |
]
|
533
|
304 |
\addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data};
|
|
305 |
\end{axis}
|
|
306 |
\end{tikzpicture}
|
|
307 |
&
|
|
308 |
\begin{tikzpicture}
|
|
309 |
\begin{axis}[
|
535
|
310 |
%xlabel={$n$},
|
|
311 |
myplotstyle,
|
|
312 |
xlabel={input length},
|
533
|
313 |
ylabel={size},
|
535
|
314 |
]
|
533
|
315 |
\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
|
|
316 |
\end{axis}
|
|
317 |
\end{tikzpicture}\\
|
535
|
318 |
\multicolumn{2}{l}{Graphs: Runtime for matching $((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings
|
533
|
319 |
of the form $\underbrace{aa..a}_{n}$.}
|
|
320 |
\end{tabular}
|
|
321 |
\caption{aaaaaStarStar} \label{fig:aaaaaStarStar}
|
|
322 |
\end{figure}
|
532
|
323 |
|
|
324 |
|
|
325 |
|
|
326 |
%----------------------------------------------------------------------------------------
|
|
327 |
% SECTION 1
|
|
328 |
%----------------------------------------------------------------------------------------
|
|
329 |
|
|
330 |
\section{Adding Support for the Negation Construct, and its Correctness Proof}
|
|
331 |
We now add support for the negation regular expression:
|
|
332 |
\[ r ::= \ZERO \mid \ONE
|
|
333 |
\mid c
|
|
334 |
\mid r_1 \cdot r_2
|
|
335 |
\mid r_1 + r_2
|
|
336 |
\mid r^*
|
|
337 |
\mid \sim r
|
|
338 |
\]
|
|
339 |
The $\textit{nullable}$ function's clause for it would be
|
|
340 |
\[
|
|
341 |
\textit{nullable}(~r) = \neg \nullable(r)
|
|
342 |
\]
|
|
343 |
The derivative would be
|
|
344 |
\[
|
|
345 |
~r \backslash c = ~ (r \backslash c)
|
|
346 |
\]
|
|
347 |
|
|
348 |
The most tricky part of lexing for the $~r$ regex
|
|
349 |
is creating a value for it.
|
|
350 |
For other regular expressions, the value aligns with the
|
|
351 |
structure of the regex:
|
|
352 |
\[
|
|
353 |
\vdash \Seq(\Char(a), \Char(b)) : a \cdot b
|
|
354 |
\]
|
|
355 |
But for the $~r$ regex, $s$ is a member of it if and only if
|
|
356 |
$s$ does not belong to $L(r)$.
|
|
357 |
That means when there
|
|
358 |
is a match for the not regex, it is not possible to generate how the string $s$ matched
|
|
359 |
with $r$.
|
|
360 |
What we can do is preserve the information of how $s$ was not matched by $r$,
|
|
361 |
and there are a number of options to do this.
|
|
362 |
|
|
363 |
We could give a partial value when there is a partial match for the regex inside
|
|
364 |
the $\mathbf{not}$ construct.
|
|
365 |
For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,
|
|
366 |
A value for it could be
|
|
367 |
\[
|
|
368 |
\vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)
|
|
369 |
\]
|
|
370 |
The above example demonstrates what value to construct
|
|
371 |
when the string $s$ is at most a real prefix
|
|
372 |
of the strings in $L(r)$. When $s$ instead is not a prefix of any strings
|
|
373 |
in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$
|
|
374 |
constructor.
|
|
375 |
|
|
376 |
Another option would be to either store the string $s$ that resulted in
|
|
377 |
a mis-match for $r$ or a dummy value as a placeholder:
|
|
378 |
\[
|
533
|
379 |
\vdash \textit{Not}(abcd) : ~( r_1 )
|
532
|
380 |
\]
|
|
381 |
or
|
|
382 |
\[
|
533
|
383 |
\vdash \textit{Not}(\textit{Dummy}) : ~( r_1 )
|
532
|
384 |
\]
|
|
385 |
We choose to implement this as it is most straightforward:
|
|
386 |
\[
|
|
387 |
\mkeps(~(r)) = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})
|
|
388 |
\]
|
|
389 |
|
|
390 |
%----------------------------------------------------------------------------------------
|
|
391 |
% SECTION 2
|
|
392 |
%----------------------------------------------------------------------------------------
|
|
393 |
|
|
394 |
\section{Bounded Repetitions}
|
|
395 |
|
|
396 |
|