ChengsongTanPhdThesis/Chapters/Future.tex
author Chengsong
Sat, 24 Dec 2022 11:55:04 +0000
changeset 637 e3752aac8ec2
child 640 bd1354127574
permissions -rw-r--r--
more
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
637
Chengsong
parents:
diff changeset
     1
% Chapter Template
Chengsong
parents:
diff changeset
     2
Chengsong
parents:
diff changeset
     3
\chapter{Future Work and Conclusion} % Main chapter title
Chengsong
parents:
diff changeset
     4
Chengsong
parents:
diff changeset
     5
\label{Future} 
Chengsong
parents:
diff changeset
     6
Chengsong
parents:
diff changeset
     7
In this thesis, in order to solve the ReDoS attacks
Chengsong
parents:
diff changeset
     8
once and for all, we have set out to formalise the correctness
Chengsong
parents:
diff changeset
     9
proof of Sulzmann and Lu's
Chengsong
parents:
diff changeset
    10
lexing algorithm with simplifications \cite{Sulzmann2014}
Chengsong
parents:
diff changeset
    11
in Isabelle/HOL. 
Chengsong
parents:
diff changeset
    12
We have fixed inefficiencies in  the original simplification
Chengsong
parents:
diff changeset
    13
algorithm, and established the correctness by proving
Chengsong
parents:
diff changeset
    14
that our algorithm outputs the same result as the original bit-coded
Chengsong
parents:
diff changeset
    15
lexer without simplifications (whose correctness was established in 
Chengsong
parents:
diff changeset
    16
previous work by Ausaf et al. in
Chengsong
parents:
diff changeset
    17
\cite{Ausaf}  and \cite{AusafDyckhoffUrban2016}).
Chengsong
parents:
diff changeset
    18
The proof technique used in \cite{Ausaf} does not work in our case
Chengsong
parents:
diff changeset
    19
because the simplification function messes with the structure of
Chengsong
parents:
diff changeset
    20
regular expressions.
Chengsong
parents:
diff changeset
    21
Despite having to try out several workarounds and being
Chengsong
parents:
diff changeset
    22
stuck for months or even years looking for proofs, 
Chengsong
parents:
diff changeset
    23
we were delighted to have discovered the simple yet
Chengsong
parents:
diff changeset
    24
effective proof method
Chengsong
parents:
diff changeset
    25
by modelling individual
Chengsong
parents:
diff changeset
    26
simplification steps as small-step rewriting rules and proving
Chengsong
parents:
diff changeset
    27
equivalence between terms linked by these rules.
Chengsong
parents:
diff changeset
    28
Chengsong
parents:
diff changeset
    29
In addition, we have proved a formal size bound on the 
Chengsong
parents:
diff changeset
    30
regular expressions. The technique was by establishing
Chengsong
parents:
diff changeset
    31
a ``closed form'' informally
Chengsong
parents:
diff changeset
    32
described by
Chengsong
parents:
diff changeset
    33
Murugesan and Shanmuga Sundaram \cite{Murugesan2014} 
Chengsong
parents:
diff changeset
    34
for compound regular expressions' derivatives
Chengsong
parents:
diff changeset
    35
and using those closed forms to control the size.
Chengsong
parents:
diff changeset
    36
The Isabelle/HOL code for our formalisation can be 
Chengsong
parents:
diff changeset
    37
found at
Chengsong
parents:
diff changeset
    38
\begin{center}
Chengsong
parents:
diff changeset
    39
\url{https://github.com/hellotommmy/posix}
Chengsong
parents:
diff changeset
    40
\end{center}
Chengsong
parents:
diff changeset
    41
\noindent
Chengsong
parents:
diff changeset
    42
Thanks to our theorem-prover-friendly approach,
Chengsong
parents:
diff changeset
    43
we believe that 
Chengsong
parents:
diff changeset
    44
this finiteness bound can be improved to a bound
Chengsong
parents:
diff changeset
    45
linear to input and
Chengsong
parents:
diff changeset
    46
cubic to the regular expression size using a technique by
Chengsong
parents:
diff changeset
    47
Antimirov\cite{Antimirov95}.
Chengsong
parents:
diff changeset
    48
Once formalised, this would be a guarantee for the absence of all super-linear behavious.
Chengsong
parents:
diff changeset
    49
We are working out the
Chengsong
parents:
diff changeset
    50
details.
Chengsong
parents:
diff changeset
    51
Chengsong
parents:
diff changeset
    52
Chengsong
parents:
diff changeset
    53
To our best knowledge, no lexing libraries using Brzozowski derivatives
Chengsong
parents:
diff changeset
    54
have similar complexity-related bounds, 
Chengsong
parents:
diff changeset
    55
and claims about running time are usually speculative and backed by empirical
Chengsong
parents:
diff changeset
    56
evidence on a few test cases.
Chengsong
parents:
diff changeset
    57
If a matching or lexing algorithm
Chengsong
parents:
diff changeset
    58
does not come with certain basic complexity related 
Chengsong
parents:
diff changeset
    59
guarantees (for examaple the internal data structure size
Chengsong
parents:
diff changeset
    60
does not grow indefinitely), 
Chengsong
parents:
diff changeset
    61
then one cannot claim with confidence having solved the problem
Chengsong
parents:
diff changeset
    62
of catastrophic backtracking.
Chengsong
parents:
diff changeset
    63
Chengsong
parents:
diff changeset
    64
We believe our proof method is not specific to this
Chengsong
parents:
diff changeset
    65
algorithm, and intend to extend this approach
Chengsong
parents:
diff changeset
    66
to prove the correctness of a faster version
Chengsong
parents:
diff changeset
    67
of the lexer. The closed
Chengsong
parents:
diff changeset
    68
forms can then be re-used as well. Together
Chengsong
parents:
diff changeset
    69
with the idea from Antimirov \cite{Antimirov95}
Chengsong
parents:
diff changeset
    70
we plan to reduce the size bound to become polynomial
Chengsong
parents:
diff changeset
    71
with respect to the size of the regular expression.
Chengsong
parents:
diff changeset
    72
%We have tried out a variety of other optimisations 
Chengsong
parents:
diff changeset
    73
%to make lexing/parsing with derivatives faster,
Chengsong
parents:
diff changeset
    74
%for example in \cite{Edelmann2021}, \cite{Darragh2020}
Chengsong
parents:
diff changeset
    75
%and \cite{Verbatimpp}.
Chengsong
parents:
diff changeset
    76
Chengsong
parents:
diff changeset
    77
We have learned quite a few lessons in this process.
Chengsong
parents:
diff changeset
    78
As simple as the end result may seem,
Chengsong
parents:
diff changeset
    79
coming up with the initial proof idea with rewriting relations
Chengsong
parents:
diff changeset
    80
was the hardest for us, as it requires a deep understanding of what information
Chengsong
parents:
diff changeset
    81
is preserved during the simplification process.
Chengsong
parents:
diff changeset
    82
Having the first formalisation gives us much more confidence----it took
Chengsong
parents:
diff changeset
    83
us much less time to obtain the closed forms and size bound, although they
Chengsong
parents:
diff changeset
    84
are more involved than the correctness proof.
Chengsong
parents:
diff changeset
    85
As having been tested numerous times, 
Chengsong
parents:
diff changeset
    86
the usefulness of formal approaches cannot be overstated:
Chengsong
parents:
diff changeset
    87
they not only allow us to find bugs in Sulzmann and Lu's simplification functions,
Chengsong
parents:
diff changeset
    88
but also helped us set up realistic expectations about performance
Chengsong
parents:
diff changeset
    89
of algorithms. We believed in the beginning that
Chengsong
parents:
diff changeset
    90
the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2}
Chengsong
parents:
diff changeset
    91
was already able to achieve the linear bound claimed by
Chengsong
parents:
diff changeset
    92
Sulzmann and Lu.
Chengsong
parents:
diff changeset
    93
We then attempted to prove that $\llbracket \blexersimp\; r \; s\rrbracket$ 
Chengsong
parents:
diff changeset
    94
is $O(\llbracket r\rrbracket^c\cdot |s|)$,
Chengsong
parents:
diff changeset
    95
where $c$ is a small constant number, making
Chengsong
parents:
diff changeset
    96
$\llbracket r\rrbracket^c$ a small constant factor.
Chengsong
parents:
diff changeset
    97
We were quite surprised that this did not go through despite a lot of efforts.
Chengsong
parents:
diff changeset
    98
This led us to discover the example where $\llbracket r\rrbracket^c$
Chengsong
parents:
diff changeset
    99
can in fact be exponentially large in chapter \ref{Finite}.
Chengsong
parents:
diff changeset
   100
We therefore learned the necessity to introduce the stronger simplifications
Chengsong
parents:
diff changeset
   101
in chapter \ref{Cubic}.
Chengsong
parents:
diff changeset
   102
Without formal proofs we could not have found out this so soon.
Chengsong
parents:
diff changeset
   103
Chengsong
parents:
diff changeset
   104
Chengsong
parents:
diff changeset
   105
In the future, we plan to incorporate many extensions
Chengsong
parents:
diff changeset
   106
to this work.
Chengsong
parents:
diff changeset
   107
Chengsong
parents:
diff changeset
   108
\section{Future Work}
Chengsong
parents:
diff changeset
   109
The most obvious next-step is to implement the cubic bound and
Chengsong
parents:
diff changeset
   110
correctness of $\blexerStrong$
Chengsong
parents:
diff changeset
   111
in chapter \ref{Cubic} formally.
Chengsong
parents:
diff changeset
   112
A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to 
Chengsong
parents:
diff changeset
   113
regular expression size will fulfill
Chengsong
parents:
diff changeset
   114
the linear complexity claim made by Sulzmann and Lu.
Chengsong
parents:
diff changeset
   115
Chengsong
parents:
diff changeset
   116
With a linear size bound theoretically, the next challenge would be
Chengsong
parents:
diff changeset
   117
to generate code that are competitive with respect to commercial
Chengsong
parents:
diff changeset
   118
matchers. For that a lot of optimisations are needed.
Chengsong
parents:
diff changeset
   119
We aim to  integrate the zipper data structure into our lexer.
Chengsong
parents:
diff changeset
   120
The idea is very simple: using a zipper with multiple focuses
Chengsong
parents:
diff changeset
   121
just like Darragh \cite{Darragh2020} did, we could represent
Chengsong
parents:
diff changeset
   122
\[
Chengsong
parents:
diff changeset
   123
	x\cdot r + y \cdot r + \ldots
Chengsong
parents:
diff changeset
   124
\]
Chengsong
parents:
diff changeset
   125
as 
Chengsong
parents:
diff changeset
   126
\[
Chengsong
parents:
diff changeset
   127
	(x+y + \ldots) \cdot r.
Chengsong
parents:
diff changeset
   128
\]
Chengsong
parents:
diff changeset
   129
This would greatly reduce the amount of space needed
Chengsong
parents:
diff changeset
   130
when we have many terms like $x\cdot r$.
Chengsong
parents:
diff changeset
   131
Some initial results have been obtained, but 
Chengsong
parents:
diff changeset
   132
significant changes are needed to make sure 
Chengsong
parents:
diff changeset
   133
that the lexer output conforms to the POSIX standard. 
Chengsong
parents:
diff changeset
   134
We aim to make use of Okui and Suzuki's labelling
Chengsong
parents:
diff changeset
   135
system \cite{Okui10} to ensure regular expressions represented as zippers
Chengsong
parents:
diff changeset
   136
always maintain the POSIX orderings.
Chengsong
parents:
diff changeset
   137
%Formalising correctness of 
Chengsong
parents:
diff changeset
   138
%such a lexer using zippers will probably require using an imperative
Chengsong
parents:
diff changeset
   139
%framework with separation logic as it involves
Chengsong
parents:
diff changeset
   140
%mutable data structures, which is also interesting to look into.  
Chengsong
parents:
diff changeset
   141
Chengsong
parents:
diff changeset
   142
To further optimise the algorithm, 
Chengsong
parents:
diff changeset
   143
we plan to add a deduplicate function that tells 
Chengsong
parents:
diff changeset
   144
whether two regular expressions denote the same 
Chengsong
parents:
diff changeset
   145
language using
Chengsong
parents:
diff changeset
   146
an efficient and verified equivalence checker like \cite{Krauss2011}.
Chengsong
parents:
diff changeset
   147
In this way, the internal data structures can be pruned more aggressively, 
Chengsong
parents:
diff changeset
   148
leading to better simplifications and ultimately faster algorithms.
Chengsong
parents:
diff changeset
   149
Chengsong
parents:
diff changeset
   150
Traditional automaton approaches can be sped up by compiling
Chengsong
parents:
diff changeset
   151
multiple rules into the same automaton. This has been done
Chengsong
parents:
diff changeset
   152
in \cite{Kumar2006} and \cite{Becchi08}, for example.
Chengsong
parents:
diff changeset
   153
Currently our lexer only deals with a single regular expression each time.
Chengsong
parents:
diff changeset
   154
Putting multiple regular expressions might open up more
Chengsong
parents:
diff changeset
   155
possibilities of simplifications.
Chengsong
parents:
diff changeset
   156
Chengsong
parents:
diff changeset
   157
As already mentioned in chapter \ref{RelatedWork},
Chengsong
parents:
diff changeset
   158
reducing the number of memory accesses can accelerate the 
Chengsong
parents:
diff changeset
   159
matching speed. It would be interesting to see the memory
Chengsong
parents:
diff changeset
   160
bandwidth of our derivative-based matching and improve accordingly.
Chengsong
parents:
diff changeset
   161
Chengsong
parents:
diff changeset
   162
Memoization has been used frequently in lexing and parsing to achieve
Chengsong
parents:
diff changeset
   163
better complexity results, for example in \cite{Reps1998},
Chengsong
parents:
diff changeset
   164
\cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}.
Chengsong
parents:
diff changeset
   165
We plan to explore the performance enhancements by memoization in our algorithm
Chengsong
parents:
diff changeset
   166
in a correctness-preserving way.
Chengsong
parents:
diff changeset
   167
The Monadic data refinement technique that Lammich and Tuerk used
Chengsong
parents:
diff changeset
   168
in \cite{Lammich2012} to optimise the Hopcroft automaton minimisation
Chengsong
parents:
diff changeset
   169
algorithm seem quite interesting.
Chengsong
parents:
diff changeset
   170
We aim to learn from their refinement framework
Chengsong
parents:
diff changeset
   171
which generates high performance code with proofs
Chengsong
parents:
diff changeset
   172
that can be broken down into uncluttered small steps.
Chengsong
parents:
diff changeset
   173
Chengsong
parents:
diff changeset
   174
Chengsong
parents:
diff changeset
   175
Chengsong
parents:
diff changeset
   176
Chengsong
parents:
diff changeset
   177
%We establish the correctness 
Chengsong
parents:
diff changeset
   178
%claim made by them about the 
Chengsong
parents:
diff changeset
   179
%, which were only proven with pencil and paper.
Chengsong
parents:
diff changeset
   180
Chengsong
parents:
diff changeset
   181
Chengsong
parents:
diff changeset
   182
%\section{Work on Back-References}
Chengsong
parents:
diff changeset
   183
%We introduced regular expressions with back-references
Chengsong
parents:
diff changeset
   184
%in chapter \ref{Introduction}.
Chengsong
parents:
diff changeset
   185
%We adopt the common practice of calling them rewbrs (Regular Expressions
Chengsong
parents:
diff changeset
   186
%With Back References) in this section for brevity.
Chengsong
parents:
diff changeset
   187
%It has been shown by Aho \cite{Aho1990}
Chengsong
parents:
diff changeset
   188
%that the k-vertex cover problem can be reduced
Chengsong
parents:
diff changeset
   189
%to the problem of rewbrs matching, and is therefore NP-complete.
Chengsong
parents:
diff changeset
   190
%Given the depth of the problem, the progress made at the full generality
Chengsong
parents:
diff changeset
   191
%of arbitrary rewbrs matching has been slow, with
Chengsong
parents:
diff changeset
   192
%theoretical work on them being 
Chengsong
parents:
diff changeset
   193
%fairly recent. 
Chengsong
parents:
diff changeset
   194
%
Chengsong
parents:
diff changeset
   195
%Campaneu et al. studied regexes with back-references
Chengsong
parents:
diff changeset
   196
%in the context of formal languages theory in 
Chengsong
parents:
diff changeset
   197
%their 2003 work \cite{campeanu2003formal}.
Chengsong
parents:
diff changeset
   198
%They devised a pumping lemma for Perl-like regexes, 
Chengsong
parents:
diff changeset
   199
%proving that the langugages denoted by them
Chengsong
parents:
diff changeset
   200
%is  properly  contained in context-sensitive
Chengsong
parents:
diff changeset
   201
%languages.
Chengsong
parents:
diff changeset
   202
%More interesting questions such as 
Chengsong
parents:
diff changeset
   203
%whether the language denoted by Perl-like regexes
Chengsong
parents:
diff changeset
   204
%can express palindromes ($\{w \mid w = w^R\}$)
Chengsong
parents:
diff changeset
   205
%are discussed in \cite{campeanu2009patterns} 
Chengsong
parents:
diff changeset
   206
%and \cite{CAMPEANU2009Intersect}.
Chengsong
parents:
diff changeset
   207
%Freydenberger \cite{Frey2013} investigated the 
Chengsong
parents:
diff changeset
   208
%expressive power of back-references. He showed several 
Chengsong
parents:
diff changeset
   209
%undecidability and decriptional complexity results 
Chengsong
parents:
diff changeset
   210
%of back-references, concluding that they add
Chengsong
parents:
diff changeset
   211
%great power to certain programming tasks but are difficult to harness.
Chengsong
parents:
diff changeset
   212
%An interesting question would then be
Chengsong
parents:
diff changeset
   213
%whether we can add restrictions to them, 
Chengsong
parents:
diff changeset
   214
%so that they become expressive enough for practical use such
Chengsong
parents:
diff changeset
   215
%as html files, but not too powerful.
Chengsong
parents:
diff changeset
   216
%Freydenberger and Schmid \cite{FREYDENBERGER20191}
Chengsong
parents:
diff changeset
   217
%introduced the notion of deterministic
Chengsong
parents:
diff changeset
   218
%regular expressions with back-references to achieve
Chengsong
parents:
diff changeset
   219
%a better balance between expressiveness and tractability.
Chengsong
parents:
diff changeset
   220
%
Chengsong
parents:
diff changeset
   221
%
Chengsong
parents:
diff changeset
   222
%Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
Chengsong
parents:
diff changeset
   223
%investigated the time complexity of different variants
Chengsong
parents:
diff changeset
   224
%of back-references.
Chengsong
parents:
diff changeset
   225
%We are not aware of any work that uses derivatives on back-references.
Chengsong
parents:
diff changeset
   226
%
Chengsong
parents:
diff changeset
   227
%See \cite{BERGLUND2022} for a survey
Chengsong
parents:
diff changeset
   228
%of these works and comparison between different
Chengsong
parents:
diff changeset
   229
%flavours  of back-references syntax (e.g. whether references can be circular, 
Chengsong
parents:
diff changeset
   230
%can labels be repeated etc.).
Chengsong
parents:
diff changeset
   231
%
Chengsong
parents:
diff changeset
   232
%
Chengsong
parents:
diff changeset
   233
%\subsection{Matchers and Lexers with Mechanised Proofs}
Chengsong
parents:
diff changeset
   234
%We are aware
Chengsong
parents:
diff changeset
   235
%of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
Chengsong
parents:
diff changeset
   236
%Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
Chengsong
parents:
diff changeset
   237
%of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
Chengsong
parents:
diff changeset
   238
%in Coq is given by Coquand and Siles \parencite{Coquand2012}.
Chengsong
parents:
diff changeset
   239
%Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
Chengsong
parents:
diff changeset
   240
% 
Chengsong
parents:
diff changeset
   241
%\subsection{Static Analysis of Evil Regex Patterns} 
Chengsong
parents:
diff changeset
   242
%When a regular expression does not behave as intended,
Chengsong
parents:
diff changeset
   243
%people usually try to rewrite the regex to some equivalent form
Chengsong
parents:
diff changeset
   244
%or they try to avoid the possibly problematic patterns completely,
Chengsong
parents:
diff changeset
   245
%for which many false positives exist\parencite{Davis18}.
Chengsong
parents:
diff changeset
   246
%Animated tools to "debug" regular expressions such as
Chengsong
parents:
diff changeset
   247
% \parencite{regexploit2021} \parencite{regex101} are also popular.
Chengsong
parents:
diff changeset
   248
%We are also aware of static analysis work on regular expressions that
Chengsong
parents:
diff changeset
   249
%aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
Chengsong
parents:
diff changeset
   250
%\parencite{Rathnayake2014StaticAF} proposed an algorithm
Chengsong
parents:
diff changeset
   251
%that detects regular expressions triggering exponential
Chengsong
parents:
diff changeset
   252
%behavious on backtracking matchers.
Chengsong
parents:
diff changeset
   253
%Weideman \parencite{Weideman2017Static} came up with 
Chengsong
parents:
diff changeset
   254
%non-linear polynomial worst-time estimates
Chengsong
parents:
diff changeset
   255
%for regexes, attack string that exploit the worst-time 
Chengsong
parents:
diff changeset
   256
%scenario, and "attack automata" that generates
Chengsong
parents:
diff changeset
   257
%attack strings.
Chengsong
parents:
diff changeset
   258
%
Chengsong
parents:
diff changeset
   259
%
Chengsong
parents:
diff changeset
   260
%
Chengsong
parents:
diff changeset
   261
%
Chengsong
parents:
diff changeset
   262
%
Chengsong
parents:
diff changeset
   263
%
Chengsong
parents:
diff changeset
   264
%Thanks to our theorem-prover-friendly approach,
Chengsong
parents:
diff changeset
   265
%we believe that 
Chengsong
parents:
diff changeset
   266
%this finiteness bound can be improved to a bound
Chengsong
parents:
diff changeset
   267
%linear to input and
Chengsong
parents:
diff changeset
   268
%cubic to the regular expression size using a technique by
Chengsong
parents:
diff changeset
   269
%Antimirov\cite{Antimirov95}.
Chengsong
parents:
diff changeset
   270
%Once formalised, this would be a guarantee for the absence of all super-linear behavious.
Chengsong
parents:
diff changeset
   271
%We are working out the
Chengsong
parents:
diff changeset
   272
%details.
Chengsong
parents:
diff changeset
   273
%
Chengsong
parents:
diff changeset
   274
% 
Chengsong
parents:
diff changeset
   275
%To our best knowledge, no lexing libraries using Brzozowski derivatives
Chengsong
parents:
diff changeset
   276
%have similar complexity-related bounds, 
Chengsong
parents:
diff changeset
   277
%and claims about running time are usually speculative and backed by empirical
Chengsong
parents:
diff changeset
   278
%evidence on a few test cases.
Chengsong
parents:
diff changeset
   279
%If a matching or lexing algorithm
Chengsong
parents:
diff changeset
   280
%does not come with certain basic complexity related 
Chengsong
parents:
diff changeset
   281
%guarantees (for examaple the internal data structure size
Chengsong
parents:
diff changeset
   282
%does not grow indefinitely), 
Chengsong
parents:
diff changeset
   283
%then they cannot claim with confidence having solved the problem
Chengsong
parents:
diff changeset
   284
%of catastrophic backtracking.
Chengsong
parents:
diff changeset
   285
%
Chengsong
parents:
diff changeset
   286
%
Chengsong
parents:
diff changeset
   287
%\section{Optimisations}
Chengsong
parents:
diff changeset
   288
%Perhaps the biggest problem that prevents derivative-based lexing 
Chengsong
parents:
diff changeset
   289
%from being more widely adopted
Chengsong
parents:
diff changeset
   290
%is that they are still not blindingly fast in practice, unable to 
Chengsong
parents:
diff changeset
   291
%reach throughputs like gigabytes per second, which the commercial
Chengsong
parents:
diff changeset
   292
%regular expression matchers such as Snort and Bro are able to achieve.
Chengsong
parents:
diff changeset
   293
%For our algorithm to be more attractive for practical use, we
Chengsong
parents:
diff changeset
   294
%need more correctness-preserving optimisations.
Chengsong
parents:
diff changeset
   295
%%So far our focus has been mainly on the bit-coded algorithm,
Chengsong
parents:
diff changeset
   296
%%but the injection-based lexing algorithm 
Chengsong
parents:
diff changeset
   297
%%could also be sped up in various ways:
Chengsong
parents:
diff changeset
   298
%%
Chengsong
parents:
diff changeset
   299
%One such optimisation is defining string derivatives
Chengsong
parents:
diff changeset
   300
%directly, without recursively decomposing them into 
Chengsong
parents:
diff changeset
   301
%character-by-character derivatives. For example, instead of
Chengsong
parents:
diff changeset
   302
%replacing $(r_1 + r_2)\backslash (c::cs)$ by
Chengsong
parents:
diff changeset
   303
%$((r_1 + r_2)\backslash c)\backslash cs$, we rather
Chengsong
parents:
diff changeset
   304
%calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
Chengsong
parents:
diff changeset
   305
%
Chengsong
parents:
diff changeset
   306
%
Chengsong
parents:
diff changeset
   307
%\subsection{Derivatives and Zippers}
Chengsong
parents:
diff changeset
   308
%Zipper is a data structure designed to focus on 
Chengsong
parents:
diff changeset
   309
%and navigate between local parts of a tree.
Chengsong
parents:
diff changeset
   310
%The idea is that often operations on a large tree only deal with
Chengsong
parents:
diff changeset
   311
%local regions each time.
Chengsong
parents:
diff changeset
   312
%It was first formally described by Huet \cite{HuetZipper}.
Chengsong
parents:
diff changeset
   313
%Typical applications of zippers involve text editor buffers
Chengsong
parents:
diff changeset
   314
%and proof system databases.
Chengsong
parents:
diff changeset
   315
%In our setting, the idea is to compactify the representation
Chengsong
parents:
diff changeset
   316
%of derivatives with zippers, thereby making our algorithm faster.
Chengsong
parents:
diff changeset
   317
%We draw our inspirations from several works on parsing, derivatives
Chengsong
parents:
diff changeset
   318
%and zippers.
Chengsong
parents:
diff changeset
   319
%
Chengsong
parents:
diff changeset
   320
%Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
Chengsong
parents:
diff changeset
   321
%\cite{Zippy2020}.
Chengsong
parents:
diff changeset
   322
%They adopted zippers to improve the speed, and argued that the runtime
Chengsong
parents:
diff changeset
   323
%complexity of their algorithm was linear with respect to the input string.
Chengsong
parents:
diff changeset
   324
%
Chengsong
parents:
diff changeset
   325
%The idea of using Brzozowski derivatives on general context-free 
Chengsong
parents:
diff changeset
   326
%parsing was first implemented
Chengsong
parents:
diff changeset
   327
%by Might et al. \ref{Might2011}. 
Chengsong
parents:
diff changeset
   328
%They used memoization and fixpoint construction to eliminate infinite recursion,
Chengsong
parents:
diff changeset
   329
%which is a major problem for using derivatives on context-free grammars.
Chengsong
parents:
diff changeset
   330
%The initial version was quite slow----exponential on the size of the input.
Chengsong
parents:
diff changeset
   331
%Adams et al. \ref{Adams2016} improved the speed and argued that their version
Chengsong
parents:
diff changeset
   332
%was cubic with respect to the input.
Chengsong
parents:
diff changeset
   333
%Darragh and Adams \cite{10.1145/3408990} further improved the performance
Chengsong
parents:
diff changeset
   334
%by using zippers in an innovative way--their zippers had multiple focuses
Chengsong
parents:
diff changeset
   335
%instead of just one in a traditional zipper to handle ambiguity.
Chengsong
parents:
diff changeset
   336
%Their algorithm was not formalised, though.
Chengsong
parents:
diff changeset
   337
%
Chengsong
parents:
diff changeset
   338
%We aim to  integrate the zipper data structure into our lexer.
Chengsong
parents:
diff changeset
   339
%The idea is very simple: using a zipper with multiple focuses
Chengsong
parents:
diff changeset
   340
%just like Darragh did, we could represent
Chengsong
parents:
diff changeset
   341
%\[
Chengsong
parents:
diff changeset
   342
%	x\cdot r + y \cdot r + \ldots
Chengsong
parents:
diff changeset
   343
%\]
Chengsong
parents:
diff changeset
   344
%as 
Chengsong
parents:
diff changeset
   345
%\[
Chengsong
parents:
diff changeset
   346
%	(x+y + \ldots) \cdot r.
Chengsong
parents:
diff changeset
   347
%\]
Chengsong
parents:
diff changeset
   348
%This would greatly reduce the amount of space needed
Chengsong
parents:
diff changeset
   349
%when we have many terms like $x\cdot r$.
Chengsong
parents:
diff changeset
   350
%
Chengsong
parents:
diff changeset
   351
%into the current lexer. 
Chengsong
parents:
diff changeset
   352
%This allows the lexer to not traverse the entire 
Chengsong
parents:
diff changeset
   353
%derivative tree generated
Chengsong
parents:
diff changeset
   354
%in every intermediate step, but only a small segment 
Chengsong
parents:
diff changeset
   355
%that is currently active. 
Chengsong
parents:
diff changeset
   356
%
Chengsong
parents:
diff changeset
   357
%Some initial results have been obtained, but more care needs to be taken to make sure 
Chengsong
parents:
diff changeset
   358
%that the lexer output conforms to the POSIX standard. Formalising correctness of 
Chengsong
parents:
diff changeset
   359
%such a lexer using zippers will probably require using an imperative
Chengsong
parents:
diff changeset
   360
%framework with separation logic as it involves
Chengsong
parents:
diff changeset
   361
%mutable data structures, which is also interesting to look into.  
Chengsong
parents:
diff changeset
   362
%
Chengsong
parents:
diff changeset
   363
%To further optimise the algorithm, 
Chengsong
parents:
diff changeset
   364
%I plan to add a ``deduplicate'' function that tells 
Chengsong
parents:
diff changeset
   365
%whether two regular expressions denote the same 
Chengsong
parents:
diff changeset
   366
%language using
Chengsong
parents:
diff changeset
   367
%an efficient and verified equivalence checker.
Chengsong
parents:
diff changeset
   368
%In this way, the internal data structures can be pruned more aggressively, 
Chengsong
parents:
diff changeset
   369
%leading to better simplifications and ultimately faster algorithms.
Chengsong
parents:
diff changeset
   370
%
Chengsong
parents:
diff changeset
   371
%
Chengsong
parents:
diff changeset
   372
%Some initial results 
Chengsong
parents:
diff changeset
   373
%We first give a brief introduction to what zippers are,
Chengsong
parents:
diff changeset
   374
%and other works
Chengsong
parents:
diff changeset
   375
%that apply zippers to derivatives
Chengsong
parents:
diff changeset
   376
%When dealing with large trees, it would be a waste to 
Chengsong
parents:
diff changeset
   377
%traverse the entire tree if
Chengsong
parents:
diff changeset
   378
%the operation only
Chengsong
parents:
diff changeset
   379
%involves a small fraction of it.
Chengsong
parents:
diff changeset
   380
%The idea is to put the focus on that subtree, turning other parts
Chengsong
parents:
diff changeset
   381
%of the tree into a context
Chengsong
parents:
diff changeset
   382
%
Chengsong
parents:
diff changeset
   383
%
Chengsong
parents:
diff changeset
   384
%One observation about our derivative-based lexing algorithm is that
Chengsong
parents:
diff changeset
   385
%the derivative operation sometimes traverses the entire regular expression
Chengsong
parents:
diff changeset
   386
%unnecessarily:
Chengsong
parents:
diff changeset
   387