ChengsongTanPhdThesis/Chapters/Future.tex
author Chengsong
Wed, 23 Aug 2023 03:02:31 +0100
changeset 668 3831621d7b14
parent 666 6da4516ea87d
permissions -rw-r--r--
added technical Overview section, almost done introduction
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
637
Chengsong
parents:
diff changeset
     1
% Chapter Template
Chengsong
parents:
diff changeset
     2
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
     3
\chapter{Conclusion and Future Work} % Main chapter title
637
Chengsong
parents:
diff changeset
     4
Chengsong
parents:
diff changeset
     5
\label{Future} 
Chengsong
parents:
diff changeset
     6
Chengsong
parents:
diff changeset
     7
In this thesis, in order to solve the ReDoS attacks
Chengsong
parents:
diff changeset
     8
once and for all, we have set out to formalise the correctness
Chengsong
parents:
diff changeset
     9
proof of Sulzmann and Lu's
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    10
lexing algorithm with aggressive simplifications \cite{Sulzmann2014}.
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    11
We formalised our proofs in the Isabelle/HOL theorem prover.
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    12
We have fixed some inefficiencies and a bug in  their original simplification
637
Chengsong
parents:
diff changeset
    13
algorithm, and established the correctness by proving
Chengsong
parents:
diff changeset
    14
that our algorithm outputs the same result as the original bit-coded
Chengsong
parents:
diff changeset
    15
lexer without simplifications (whose correctness was established in 
Chengsong
parents:
diff changeset
    16
previous work by Ausaf et al. in
Chengsong
parents:
diff changeset
    17
\cite{Ausaf}  and \cite{AusafDyckhoffUrban2016}).
Chengsong
parents:
diff changeset
    18
The proof technique used in \cite{Ausaf} does not work in our case
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    19
because the simplification function messes with the structure of simplified
637
Chengsong
parents:
diff changeset
    20
regular expressions.
Chengsong
parents:
diff changeset
    21
Despite having to try out several workarounds and being
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    22
stuck for months looking for proofs, 
637
Chengsong
parents:
diff changeset
    23
we were delighted to have discovered the simple yet
Chengsong
parents:
diff changeset
    24
effective proof method
Chengsong
parents:
diff changeset
    25
by modelling individual
Chengsong
parents:
diff changeset
    26
simplification steps as small-step rewriting rules and proving
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    27
equivalence between terms linked by these rewrite rules.
637
Chengsong
parents:
diff changeset
    28
Chengsong
parents:
diff changeset
    29
In addition, we have proved a formal size bound on the 
Chengsong
parents:
diff changeset
    30
regular expressions. The technique was by establishing
Chengsong
parents:
diff changeset
    31
a ``closed form'' informally
Chengsong
parents:
diff changeset
    32
described by
Chengsong
parents:
diff changeset
    33
Murugesan and Shanmuga Sundaram \cite{Murugesan2014} 
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    34
for compound derivatives
637
Chengsong
parents:
diff changeset
    35
and using those closed forms to control the size.
Chengsong
parents:
diff changeset
    36
The Isabelle/HOL code for our formalisation can be 
Chengsong
parents:
diff changeset
    37
found at
Chengsong
parents:
diff changeset
    38
\begin{center}
Chengsong
parents:
diff changeset
    39
\url{https://github.com/hellotommmy/posix}
Chengsong
parents:
diff changeset
    40
\end{center}
Chengsong
parents:
diff changeset
    41
\noindent
Chengsong
parents:
diff changeset
    42
Thanks to our theorem-prover-friendly approach,
Chengsong
parents:
diff changeset
    43
we believe that 
Chengsong
parents:
diff changeset
    44
this finiteness bound can be improved to a bound
Chengsong
parents:
diff changeset
    45
linear to input and
Chengsong
parents:
diff changeset
    46
cubic to the regular expression size using a technique by
Chengsong
parents:
diff changeset
    47
Antimirov\cite{Antimirov95}.
Chengsong
parents:
diff changeset
    48
Once formalised, this would be a guarantee for the absence of all super-linear behavious.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    49
We are yet to work out the
637
Chengsong
parents:
diff changeset
    50
details.
Chengsong
parents:
diff changeset
    51
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    52
Our formalisation is approximately 7500 lines of code. Slightly more than half of this concerns the finiteness bound obtained in Chapter 5. This slight "bloat" is because we had to repeat many definitions for the rrexp datatype. However, we think we would not have been able to formalise the quite intricate reasoning involving rrexps with annotated regular expressions because we would have to carry around the bit-sequences (that are of course necessary in the algorithm) but which do not contribute to the size bound of the calculated derivatives.
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    53
637
Chengsong
parents:
diff changeset
    54
Chengsong
parents:
diff changeset
    55
To our best knowledge, no lexing libraries using Brzozowski derivatives
Chengsong
parents:
diff changeset
    56
have similar complexity-related bounds, 
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    57
and claims about running time are usually speculative and backed up only by empirical
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    58
evidence on some test cases.
637
Chengsong
parents:
diff changeset
    59
If a matching or lexing algorithm
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    60
does not come with complexity related 
637
Chengsong
parents:
diff changeset
    61
guarantees (for examaple the internal data structure size
Chengsong
parents:
diff changeset
    62
does not grow indefinitely), 
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    63
then one cannot claim with confidence of having solved the problem
637
Chengsong
parents:
diff changeset
    64
of catastrophic backtracking.
Chengsong
parents:
diff changeset
    65
Chengsong
parents:
diff changeset
    66
We believe our proof method is not specific to this
Chengsong
parents:
diff changeset
    67
algorithm, and intend to extend this approach
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    68
to prove the correctness of the faster version
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    69
of the lexer proposed in chapter \cite{Cubic}. The closed
637
Chengsong
parents:
diff changeset
    70
forms can then be re-used as well. Together
Chengsong
parents:
diff changeset
    71
with the idea from Antimirov \cite{Antimirov95}
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    72
we plan to reduce the size bound to be just polynomial
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    73
with respect to the size of the regular expressions.
637
Chengsong
parents:
diff changeset
    74
%We have tried out a variety of other optimisations 
Chengsong
parents:
diff changeset
    75
%to make lexing/parsing with derivatives faster,
Chengsong
parents:
diff changeset
    76
%for example in \cite{Edelmann2021}, \cite{Darragh2020}
Chengsong
parents:
diff changeset
    77
%and \cite{Verbatimpp}.
Chengsong
parents:
diff changeset
    78
Chengsong
parents:
diff changeset
    79
We have learned quite a few lessons in this process.
Chengsong
parents:
diff changeset
    80
As simple as the end result may seem,
Chengsong
parents:
diff changeset
    81
coming up with the initial proof idea with rewriting relations
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    82
was the hardest for us, as it required a deep understanding of what information
637
Chengsong
parents:
diff changeset
    83
is preserved during the simplification process.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    84
There the ideas given by Sulzmann and Lu and Ausaf et al. were of no use.
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    85
%Having the first formalisation gives us much more confidence----it took
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    86
%us much less time to obtain the closed forms and size bound, although they
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    87
%are more involved than the correctness proof.
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    88
Of course this has already been shown many times,
637
Chengsong
parents:
diff changeset
    89
the usefulness of formal approaches cannot be overstated:
Chengsong
parents:
diff changeset
    90
they not only allow us to find bugs in Sulzmann and Lu's simplification functions,
Chengsong
parents:
diff changeset
    91
but also helped us set up realistic expectations about performance
Chengsong
parents:
diff changeset
    92
of algorithms. We believed in the beginning that
Chengsong
parents:
diff changeset
    93
the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2}
Chengsong
parents:
diff changeset
    94
was already able to achieve the linear bound claimed by
Chengsong
parents:
diff changeset
    95
Sulzmann and Lu.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    96
We then attempted to prove that the size $\llbracket \blexersimp\; r \; s\rrbracket$ 
637
Chengsong
parents:
diff changeset
    97
is $O(\llbracket r\rrbracket^c\cdot |s|)$,
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
    98
where $c$ is a small constant, making
637
Chengsong
parents:
diff changeset
    99
$\llbracket r\rrbracket^c$ a small constant factor.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   100
We were then quite surprised that this did not go through despite a lot of effort.
637
Chengsong
parents:
diff changeset
   101
This led us to discover the example where $\llbracket r\rrbracket^c$
Chengsong
parents:
diff changeset
   102
can in fact be exponentially large in chapter \ref{Finite}.
Chengsong
parents:
diff changeset
   103
We therefore learned the necessity to introduce the stronger simplifications
Chengsong
parents:
diff changeset
   104
in chapter \ref{Cubic}.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   105
Without formal proofs we would not have found out this so soon,
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   106
if at all.
637
Chengsong
parents:
diff changeset
   107
Chengsong
parents:
diff changeset
   108
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   109
%In the future, we plan to incorporate many extensions
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   110
%to this work.
637
Chengsong
parents:
diff changeset
   111
Chengsong
parents:
diff changeset
   112
\section{Future Work}
Chengsong
parents:
diff changeset
   113
The most obvious next-step is to implement the cubic bound and
Chengsong
parents:
diff changeset
   114
correctness of $\blexerStrong$
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   115
in chapter \ref{Cubic}.
637
Chengsong
parents:
diff changeset
   116
A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to 
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   117
regular expression size will get us one step closer to fulfilling
637
Chengsong
parents:
diff changeset
   118
the linear complexity claim made by Sulzmann and Lu.
Chengsong
parents:
diff changeset
   119
Chengsong
parents:
diff changeset
   120
With a linear size bound theoretically, the next challenge would be
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   121
to generate code that is competitive with respect to 
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   122
matchers based on DFAs or NFAs. For that a lot of optimisations are needed.
637
Chengsong
parents:
diff changeset
   123
We aim to  integrate the zipper data structure into our lexer.
Chengsong
parents:
diff changeset
   124
The idea is very simple: using a zipper with multiple focuses
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   125
just like Darragh \cite{Darragh2020} did in their parsing algorithm, we could represent
637
Chengsong
parents:
diff changeset
   126
\[
Chengsong
parents:
diff changeset
   127
	x\cdot r + y \cdot r + \ldots
Chengsong
parents:
diff changeset
   128
\]
Chengsong
parents:
diff changeset
   129
as 
Chengsong
parents:
diff changeset
   130
\[
Chengsong
parents:
diff changeset
   131
	(x+y + \ldots) \cdot r.
Chengsong
parents:
diff changeset
   132
\]
Chengsong
parents:
diff changeset
   133
This would greatly reduce the amount of space needed
Chengsong
parents:
diff changeset
   134
when we have many terms like $x\cdot r$.
Chengsong
parents:
diff changeset
   135
Some initial results have been obtained, but 
Chengsong
parents:
diff changeset
   136
significant changes are needed to make sure 
Chengsong
parents:
diff changeset
   137
that the lexer output conforms to the POSIX standard. 
Chengsong
parents:
diff changeset
   138
We aim to make use of Okui and Suzuki's labelling
Chengsong
parents:
diff changeset
   139
system \cite{Okui10} to ensure regular expressions represented as zippers
Chengsong
parents:
diff changeset
   140
always maintain the POSIX orderings.
Chengsong
parents:
diff changeset
   141
%Formalising correctness of 
Chengsong
parents:
diff changeset
   142
%such a lexer using zippers will probably require using an imperative
Chengsong
parents:
diff changeset
   143
%framework with separation logic as it involves
Chengsong
parents:
diff changeset
   144
%mutable data structures, which is also interesting to look into.  
Chengsong
parents:
diff changeset
   145
Chengsong
parents:
diff changeset
   146
To further optimise the algorithm, 
Chengsong
parents:
diff changeset
   147
we plan to add a deduplicate function that tells 
Chengsong
parents:
diff changeset
   148
whether two regular expressions denote the same 
Chengsong
parents:
diff changeset
   149
language using
Chengsong
parents:
diff changeset
   150
an efficient and verified equivalence checker like \cite{Krauss2011}.
Chengsong
parents:
diff changeset
   151
In this way, the internal data structures can be pruned more aggressively, 
Chengsong
parents:
diff changeset
   152
leading to better simplifications and ultimately faster algorithms.
Chengsong
parents:
diff changeset
   153
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   154
Traditional automata approaches can be sped up by compiling
637
Chengsong
parents:
diff changeset
   155
multiple rules into the same automaton. This has been done
Chengsong
parents:
diff changeset
   156
in \cite{Kumar2006} and \cite{Becchi08}, for example.
Chengsong
parents:
diff changeset
   157
Currently our lexer only deals with a single regular expression each time.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   158
Extending this to multiple regular expressions might open up more
637
Chengsong
parents:
diff changeset
   159
possibilities of simplifications.
Chengsong
parents:
diff changeset
   160
Chengsong
parents:
diff changeset
   161
As already mentioned in chapter \ref{RelatedWork},
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   162
reducing the number of memory accesses can also accelerate the 
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   163
matching speed. It would be interesting to study the memory
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   164
bandwidth of our derivative-based matching algorithm and improve accordingly.
637
Chengsong
parents:
diff changeset
   165
Chengsong
parents:
diff changeset
   166
Memoization has been used frequently in lexing and parsing to achieve
Chengsong
parents:
diff changeset
   167
better complexity results, for example in \cite{Reps1998},
Chengsong
parents:
diff changeset
   168
\cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}.
Chengsong
parents:
diff changeset
   169
We plan to explore the performance enhancements by memoization in our algorithm
Chengsong
parents:
diff changeset
   170
in a correctness-preserving way.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   171
The monadic data refinement technique that Lammich and Tuerk used
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   172
in \cite{Lammich2012} to optimise Hopcroft's automaton minimisation
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   173
algorithm seems also quite relevant for such an enterprise.
637
Chengsong
parents:
diff changeset
   174
We aim to learn from their refinement framework
Chengsong
parents:
diff changeset
   175
which generates high performance code with proofs
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 637
diff changeset
   176
that can be broken down into small steps.
637
Chengsong
parents:
diff changeset
   177
666
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 640
diff changeset
   178
Extending the Sulzmann and Lu's algorithm to parse
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 640
diff changeset
   179
more pcre regex constructs like lookahead, capture groups
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 640
diff changeset
   180
and back-references with proofs on basic properties like correctness
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 640
diff changeset
   181
seems useful, despite it cannot be made efficient. 
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 640
diff changeset
   182
Creating a correct and easy-to-prove version of $\blexersimp$
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 640
diff changeset
   183
seems an appealing next-step for both practice and theory.
637
Chengsong
parents:
diff changeset
   184
Chengsong
parents:
diff changeset
   185
Chengsong
parents:
diff changeset
   186
Chengsong
parents:
diff changeset
   187
%We establish the correctness 
Chengsong
parents:
diff changeset
   188
%claim made by them about the 
Chengsong
parents:
diff changeset
   189
%, which were only proven with pencil and paper.
Chengsong
parents:
diff changeset
   190
Chengsong
parents:
diff changeset
   191
Chengsong
parents:
diff changeset
   192
%\section{Work on Back-References}
Chengsong
parents:
diff changeset
   193
%We introduced regular expressions with back-references
Chengsong
parents:
diff changeset
   194
%in chapter \ref{Introduction}.
Chengsong
parents:
diff changeset
   195
%We adopt the common practice of calling them rewbrs (Regular Expressions
Chengsong
parents:
diff changeset
   196
%With Back References) in this section for brevity.
Chengsong
parents:
diff changeset
   197
%It has been shown by Aho \cite{Aho1990}
Chengsong
parents:
diff changeset
   198
%that the k-vertex cover problem can be reduced
Chengsong
parents:
diff changeset
   199
%to the problem of rewbrs matching, and is therefore NP-complete.
Chengsong
parents:
diff changeset
   200
%Given the depth of the problem, the progress made at the full generality
Chengsong
parents:
diff changeset
   201
%of arbitrary rewbrs matching has been slow, with
Chengsong
parents:
diff changeset
   202
%theoretical work on them being 
Chengsong
parents:
diff changeset
   203
%fairly recent. 
Chengsong
parents:
diff changeset
   204
%
Chengsong
parents:
diff changeset
   205
%Campaneu et al. studied regexes with back-references
Chengsong
parents:
diff changeset
   206
%in the context of formal languages theory in 
Chengsong
parents:
diff changeset
   207
%their 2003 work \cite{campeanu2003formal}.
Chengsong
parents:
diff changeset
   208
%They devised a pumping lemma for Perl-like regexes, 
Chengsong
parents:
diff changeset
   209
%proving that the langugages denoted by them
Chengsong
parents:
diff changeset
   210
%is  properly  contained in context-sensitive
Chengsong
parents:
diff changeset
   211
%languages.
Chengsong
parents:
diff changeset
   212
%More interesting questions such as 
Chengsong
parents:
diff changeset
   213
%whether the language denoted by Perl-like regexes
Chengsong
parents:
diff changeset
   214
%can express palindromes ($\{w \mid w = w^R\}$)
Chengsong
parents:
diff changeset
   215
%are discussed in \cite{campeanu2009patterns} 
Chengsong
parents:
diff changeset
   216
%and \cite{CAMPEANU2009Intersect}.
Chengsong
parents:
diff changeset
   217
%Freydenberger \cite{Frey2013} investigated the 
Chengsong
parents:
diff changeset
   218
%expressive power of back-references. He showed several 
Chengsong
parents:
diff changeset
   219
%undecidability and decriptional complexity results 
Chengsong
parents:
diff changeset
   220
%of back-references, concluding that they add
Chengsong
parents:
diff changeset
   221
%great power to certain programming tasks but are difficult to harness.
Chengsong
parents:
diff changeset
   222
%An interesting question would then be
Chengsong
parents:
diff changeset
   223
%whether we can add restrictions to them, 
Chengsong
parents:
diff changeset
   224
%so that they become expressive enough for practical use such
Chengsong
parents:
diff changeset
   225
%as html files, but not too powerful.
Chengsong
parents:
diff changeset
   226
%Freydenberger and Schmid \cite{FREYDENBERGER20191}
Chengsong
parents:
diff changeset
   227
%introduced the notion of deterministic
Chengsong
parents:
diff changeset
   228
%regular expressions with back-references to achieve
Chengsong
parents:
diff changeset
   229
%a better balance between expressiveness and tractability.
Chengsong
parents:
diff changeset
   230
%
Chengsong
parents:
diff changeset
   231
%
Chengsong
parents:
diff changeset
   232
%Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
Chengsong
parents:
diff changeset
   233
%investigated the time complexity of different variants
Chengsong
parents:
diff changeset
   234
%of back-references.
Chengsong
parents:
diff changeset
   235
%We are not aware of any work that uses derivatives on back-references.
Chengsong
parents:
diff changeset
   236
%
Chengsong
parents:
diff changeset
   237
%See \cite{BERGLUND2022} for a survey
Chengsong
parents:
diff changeset
   238
%of these works and comparison between different
Chengsong
parents:
diff changeset
   239
%flavours  of back-references syntax (e.g. whether references can be circular, 
Chengsong
parents:
diff changeset
   240
%can labels be repeated etc.).
Chengsong
parents:
diff changeset
   241
%
Chengsong
parents:
diff changeset
   242
%
Chengsong
parents:
diff changeset
   243
%\subsection{Matchers and Lexers with Mechanised Proofs}
Chengsong
parents:
diff changeset
   244
%We are aware
Chengsong
parents:
diff changeset
   245
%of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
Chengsong
parents:
diff changeset
   246
%Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
Chengsong
parents:
diff changeset
   247
%of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
Chengsong
parents:
diff changeset
   248
%in Coq is given by Coquand and Siles \parencite{Coquand2012}.
Chengsong
parents:
diff changeset
   249
%Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
Chengsong
parents:
diff changeset
   250
% 
Chengsong
parents:
diff changeset
   251
%\subsection{Static Analysis of Evil Regex Patterns} 
Chengsong
parents:
diff changeset
   252
%When a regular expression does not behave as intended,
Chengsong
parents:
diff changeset
   253
%people usually try to rewrite the regex to some equivalent form
Chengsong
parents:
diff changeset
   254
%or they try to avoid the possibly problematic patterns completely,
Chengsong
parents:
diff changeset
   255
%for which many false positives exist\parencite{Davis18}.
Chengsong
parents:
diff changeset
   256
%Animated tools to "debug" regular expressions such as
Chengsong
parents:
diff changeset
   257
% \parencite{regexploit2021} \parencite{regex101} are also popular.
Chengsong
parents:
diff changeset
   258
%We are also aware of static analysis work on regular expressions that
Chengsong
parents:
diff changeset
   259
%aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
Chengsong
parents:
diff changeset
   260
%\parencite{Rathnayake2014StaticAF} proposed an algorithm
Chengsong
parents:
diff changeset
   261
%that detects regular expressions triggering exponential
Chengsong
parents:
diff changeset
   262
%behavious on backtracking matchers.
Chengsong
parents:
diff changeset
   263
%Weideman \parencite{Weideman2017Static} came up with 
Chengsong
parents:
diff changeset
   264
%non-linear polynomial worst-time estimates
Chengsong
parents:
diff changeset
   265
%for regexes, attack string that exploit the worst-time 
Chengsong
parents:
diff changeset
   266
%scenario, and "attack automata" that generates
Chengsong
parents:
diff changeset
   267
%attack strings.
Chengsong
parents:
diff changeset
   268
%
Chengsong
parents:
diff changeset
   269
%
Chengsong
parents:
diff changeset
   270
%
Chengsong
parents:
diff changeset
   271
%
Chengsong
parents:
diff changeset
   272
%
Chengsong
parents:
diff changeset
   273
%
Chengsong
parents:
diff changeset
   274
%Thanks to our theorem-prover-friendly approach,
Chengsong
parents:
diff changeset
   275
%we believe that 
Chengsong
parents:
diff changeset
   276
%this finiteness bound can be improved to a bound
Chengsong
parents:
diff changeset
   277
%linear to input and
Chengsong
parents:
diff changeset
   278
%cubic to the regular expression size using a technique by
Chengsong
parents:
diff changeset
   279
%Antimirov\cite{Antimirov95}.
Chengsong
parents:
diff changeset
   280
%Once formalised, this would be a guarantee for the absence of all super-linear behavious.
Chengsong
parents:
diff changeset
   281
%We are working out the
Chengsong
parents:
diff changeset
   282
%details.
Chengsong
parents:
diff changeset
   283
%
Chengsong
parents:
diff changeset
   284
% 
Chengsong
parents:
diff changeset
   285
%To our best knowledge, no lexing libraries using Brzozowski derivatives
Chengsong
parents:
diff changeset
   286
%have similar complexity-related bounds, 
Chengsong
parents:
diff changeset
   287
%and claims about running time are usually speculative and backed by empirical
Chengsong
parents:
diff changeset
   288
%evidence on a few test cases.
Chengsong
parents:
diff changeset
   289
%If a matching or lexing algorithm
Chengsong
parents:
diff changeset
   290
%does not come with certain basic complexity related 
Chengsong
parents:
diff changeset
   291
%guarantees (for examaple the internal data structure size
Chengsong
parents:
diff changeset
   292
%does not grow indefinitely), 
Chengsong
parents:
diff changeset
   293
%then they cannot claim with confidence having solved the problem
Chengsong
parents:
diff changeset
   294
%of catastrophic backtracking.
Chengsong
parents:
diff changeset
   295
%
Chengsong
parents:
diff changeset
   296
%
Chengsong
parents:
diff changeset
   297
%\section{Optimisations}
Chengsong
parents:
diff changeset
   298
%Perhaps the biggest problem that prevents derivative-based lexing 
Chengsong
parents:
diff changeset
   299
%from being more widely adopted
Chengsong
parents:
diff changeset
   300
%is that they are still not blindingly fast in practice, unable to 
Chengsong
parents:
diff changeset
   301
%reach throughputs like gigabytes per second, which the commercial
Chengsong
parents:
diff changeset
   302
%regular expression matchers such as Snort and Bro are able to achieve.
Chengsong
parents:
diff changeset
   303
%For our algorithm to be more attractive for practical use, we
Chengsong
parents:
diff changeset
   304
%need more correctness-preserving optimisations.
Chengsong
parents:
diff changeset
   305
%%So far our focus has been mainly on the bit-coded algorithm,
Chengsong
parents:
diff changeset
   306
%%but the injection-based lexing algorithm 
Chengsong
parents:
diff changeset
   307
%%could also be sped up in various ways:
Chengsong
parents:
diff changeset
   308
%%
Chengsong
parents:
diff changeset
   309
%One such optimisation is defining string derivatives
Chengsong
parents:
diff changeset
   310
%directly, without recursively decomposing them into 
Chengsong
parents:
diff changeset
   311
%character-by-character derivatives. For example, instead of
Chengsong
parents:
diff changeset
   312
%replacing $(r_1 + r_2)\backslash (c::cs)$ by
Chengsong
parents:
diff changeset
   313
%$((r_1 + r_2)\backslash c)\backslash cs$, we rather
Chengsong
parents:
diff changeset
   314
%calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
Chengsong
parents:
diff changeset
   315
%
Chengsong
parents:
diff changeset
   316
%
Chengsong
parents:
diff changeset
   317
%\subsection{Derivatives and Zippers}
Chengsong
parents:
diff changeset
   318
%Zipper is a data structure designed to focus on 
Chengsong
parents:
diff changeset
   319
%and navigate between local parts of a tree.
Chengsong
parents:
diff changeset
   320
%The idea is that often operations on a large tree only deal with
Chengsong
parents:
diff changeset
   321
%local regions each time.
Chengsong
parents:
diff changeset
   322
%It was first formally described by Huet \cite{HuetZipper}.
Chengsong
parents:
diff changeset
   323
%Typical applications of zippers involve text editor buffers
Chengsong
parents:
diff changeset
   324
%and proof system databases.
Chengsong
parents:
diff changeset
   325
%In our setting, the idea is to compactify the representation
Chengsong
parents:
diff changeset
   326
%of derivatives with zippers, thereby making our algorithm faster.
Chengsong
parents:
diff changeset
   327
%We draw our inspirations from several works on parsing, derivatives
Chengsong
parents:
diff changeset
   328
%and zippers.
Chengsong
parents:
diff changeset
   329
%
Chengsong
parents:
diff changeset
   330
%Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
Chengsong
parents:
diff changeset
   331
%\cite{Zippy2020}.
Chengsong
parents:
diff changeset
   332
%They adopted zippers to improve the speed, and argued that the runtime
Chengsong
parents:
diff changeset
   333
%complexity of their algorithm was linear with respect to the input string.
Chengsong
parents:
diff changeset
   334
%
Chengsong
parents:
diff changeset
   335
%The idea of using Brzozowski derivatives on general context-free 
Chengsong
parents:
diff changeset
   336
%parsing was first implemented
Chengsong
parents:
diff changeset
   337
%by Might et al. \ref{Might2011}. 
Chengsong
parents:
diff changeset
   338
%They used memoization and fixpoint construction to eliminate infinite recursion,
Chengsong
parents:
diff changeset
   339
%which is a major problem for using derivatives on context-free grammars.
Chengsong
parents:
diff changeset
   340
%The initial version was quite slow----exponential on the size of the input.
Chengsong
parents:
diff changeset
   341
%Adams et al. \ref{Adams2016} improved the speed and argued that their version
Chengsong
parents:
diff changeset
   342
%was cubic with respect to the input.
Chengsong
parents:
diff changeset
   343
%Darragh and Adams \cite{10.1145/3408990} further improved the performance
Chengsong
parents:
diff changeset
   344
%by using zippers in an innovative way--their zippers had multiple focuses
Chengsong
parents:
diff changeset
   345
%instead of just one in a traditional zipper to handle ambiguity.
Chengsong
parents:
diff changeset
   346
%Their algorithm was not formalised, though.
Chengsong
parents:
diff changeset
   347
%
Chengsong
parents:
diff changeset
   348
%We aim to  integrate the zipper data structure into our lexer.
Chengsong
parents:
diff changeset
   349
%The idea is very simple: using a zipper with multiple focuses
Chengsong
parents:
diff changeset
   350
%just like Darragh did, we could represent
Chengsong
parents:
diff changeset
   351
%\[
Chengsong
parents:
diff changeset
   352
%	x\cdot r + y \cdot r + \ldots
Chengsong
parents:
diff changeset
   353
%\]
Chengsong
parents:
diff changeset
   354
%as 
Chengsong
parents:
diff changeset
   355
%\[
Chengsong
parents:
diff changeset
   356
%	(x+y + \ldots) \cdot r.
Chengsong
parents:
diff changeset
   357
%\]
Chengsong
parents:
diff changeset
   358
%This would greatly reduce the amount of space needed
Chengsong
parents:
diff changeset
   359
%when we have many terms like $x\cdot r$.
Chengsong
parents:
diff changeset
   360
%
Chengsong
parents:
diff changeset
   361
%into the current lexer. 
Chengsong
parents:
diff changeset
   362
%This allows the lexer to not traverse the entire 
Chengsong
parents:
diff changeset
   363
%derivative tree generated
Chengsong
parents:
diff changeset
   364
%in every intermediate step, but only a small segment 
Chengsong
parents:
diff changeset
   365
%that is currently active. 
Chengsong
parents:
diff changeset
   366
%
Chengsong
parents:
diff changeset
   367
%Some initial results have been obtained, but more care needs to be taken to make sure 
Chengsong
parents:
diff changeset
   368
%that the lexer output conforms to the POSIX standard. Formalising correctness of 
Chengsong
parents:
diff changeset
   369
%such a lexer using zippers will probably require using an imperative
Chengsong
parents:
diff changeset
   370
%framework with separation logic as it involves
Chengsong
parents:
diff changeset
   371
%mutable data structures, which is also interesting to look into.  
Chengsong
parents:
diff changeset
   372
%
Chengsong
parents:
diff changeset
   373
%To further optimise the algorithm, 
Chengsong
parents:
diff changeset
   374
%I plan to add a ``deduplicate'' function that tells 
Chengsong
parents:
diff changeset
   375
%whether two regular expressions denote the same 
Chengsong
parents:
diff changeset
   376
%language using
Chengsong
parents:
diff changeset
   377
%an efficient and verified equivalence checker.
Chengsong
parents:
diff changeset
   378
%In this way, the internal data structures can be pruned more aggressively, 
Chengsong
parents:
diff changeset
   379
%leading to better simplifications and ultimately faster algorithms.
Chengsong
parents:
diff changeset
   380
%
Chengsong
parents:
diff changeset
   381
%
Chengsong
parents:
diff changeset
   382
%Some initial results 
Chengsong
parents:
diff changeset
   383
%We first give a brief introduction to what zippers are,
Chengsong
parents:
diff changeset
   384
%and other works
Chengsong
parents:
diff changeset
   385
%that apply zippers to derivatives
Chengsong
parents:
diff changeset
   386
%When dealing with large trees, it would be a waste to 
Chengsong
parents:
diff changeset
   387
%traverse the entire tree if
Chengsong
parents:
diff changeset
   388
%the operation only
Chengsong
parents:
diff changeset
   389
%involves a small fraction of it.
Chengsong
parents:
diff changeset
   390
%The idea is to put the focus on that subtree, turning other parts
Chengsong
parents:
diff changeset
   391
%of the tree into a context
Chengsong
parents:
diff changeset
   392
%
Chengsong
parents:
diff changeset
   393
%
Chengsong
parents:
diff changeset
   394
%One observation about our derivative-based lexing algorithm is that
Chengsong
parents:
diff changeset
   395
%the derivative operation sometimes traverses the entire regular expression
Chengsong
parents:
diff changeset
   396
%unnecessarily:
Chengsong
parents:
diff changeset
   397