ChengsongTanPhdThesis/Chapters/Future.tex
changeset 637 e3752aac8ec2
child 640 bd1354127574
equal deleted inserted replaced
636:0bcb4a7cb40c 637:e3752aac8ec2
       
     1 % Chapter Template
       
     2 
       
     3 \chapter{Future Work and Conclusion} % Main chapter title
       
     4 
       
     5 \label{Future} 
       
     6 
       
     7 In this thesis, in order to solve the ReDoS attacks
       
     8 once and for all, we have set out to formalise the correctness
       
     9 proof of Sulzmann and Lu's
       
    10 lexing algorithm with simplifications \cite{Sulzmann2014}
       
    11 in Isabelle/HOL. 
       
    12 We have fixed inefficiencies in  the original simplification
       
    13 algorithm, and established the correctness by proving
       
    14 that our algorithm outputs the same result as the original bit-coded
       
    15 lexer without simplifications (whose correctness was established in 
       
    16 previous work by Ausaf et al. in
       
    17 \cite{Ausaf}  and \cite{AusafDyckhoffUrban2016}).
       
    18 The proof technique used in \cite{Ausaf} does not work in our case
       
    19 because the simplification function messes with the structure of
       
    20 regular expressions.
       
    21 Despite having to try out several workarounds and being
       
    22 stuck for months or even years looking for proofs, 
       
    23 we were delighted to have discovered the simple yet
       
    24 effective proof method
       
    25 by modelling individual
       
    26 simplification steps as small-step rewriting rules and proving
       
    27 equivalence between terms linked by these rules.
       
    28 
       
    29 In addition, we have proved a formal size bound on the 
       
    30 regular expressions. The technique was by establishing
       
    31 a ``closed form'' informally
       
    32 described by
       
    33 Murugesan and Shanmuga Sundaram \cite{Murugesan2014} 
       
    34 for compound regular expressions' derivatives
       
    35 and using those closed forms to control the size.
       
    36 The Isabelle/HOL code for our formalisation can be 
       
    37 found at
       
    38 \begin{center}
       
    39 \url{https://github.com/hellotommmy/posix}
       
    40 \end{center}
       
    41 \noindent
       
    42 Thanks to our theorem-prover-friendly approach,
       
    43 we believe that 
       
    44 this finiteness bound can be improved to a bound
       
    45 linear to input and
       
    46 cubic to the regular expression size using a technique by
       
    47 Antimirov\cite{Antimirov95}.
       
    48 Once formalised, this would be a guarantee for the absence of all super-linear behavious.
       
    49 We are working out the
       
    50 details.
       
    51 
       
    52 
       
    53 To our best knowledge, no lexing libraries using Brzozowski derivatives
       
    54 have similar complexity-related bounds, 
       
    55 and claims about running time are usually speculative and backed by empirical
       
    56 evidence on a few test cases.
       
    57 If a matching or lexing algorithm
       
    58 does not come with certain basic complexity related 
       
    59 guarantees (for examaple the internal data structure size
       
    60 does not grow indefinitely), 
       
    61 then one cannot claim with confidence having solved the problem
       
    62 of catastrophic backtracking.
       
    63 
       
    64 We believe our proof method is not specific to this
       
    65 algorithm, and intend to extend this approach
       
    66 to prove the correctness of a faster version
       
    67 of the lexer. The closed
       
    68 forms can then be re-used as well. Together
       
    69 with the idea from Antimirov \cite{Antimirov95}
       
    70 we plan to reduce the size bound to become polynomial
       
    71 with respect to the size of the regular expression.
       
    72 %We have tried out a variety of other optimisations 
       
    73 %to make lexing/parsing with derivatives faster,
       
    74 %for example in \cite{Edelmann2021}, \cite{Darragh2020}
       
    75 %and \cite{Verbatimpp}.
       
    76 
       
    77 We have learned quite a few lessons in this process.
       
    78 As simple as the end result may seem,
       
    79 coming up with the initial proof idea with rewriting relations
       
    80 was the hardest for us, as it requires a deep understanding of what information
       
    81 is preserved during the simplification process.
       
    82 Having the first formalisation gives us much more confidence----it took
       
    83 us much less time to obtain the closed forms and size bound, although they
       
    84 are more involved than the correctness proof.
       
    85 As having been tested numerous times, 
       
    86 the usefulness of formal approaches cannot be overstated:
       
    87 they not only allow us to find bugs in Sulzmann and Lu's simplification functions,
       
    88 but also helped us set up realistic expectations about performance
       
    89 of algorithms. We believed in the beginning that
       
    90 the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2}
       
    91 was already able to achieve the linear bound claimed by
       
    92 Sulzmann and Lu.
       
    93 We then attempted to prove that $\llbracket \blexersimp\; r \; s\rrbracket$ 
       
    94 is $O(\llbracket r\rrbracket^c\cdot |s|)$,
       
    95 where $c$ is a small constant number, making
       
    96 $\llbracket r\rrbracket^c$ a small constant factor.
       
    97 We were quite surprised that this did not go through despite a lot of efforts.
       
    98 This led us to discover the example where $\llbracket r\rrbracket^c$
       
    99 can in fact be exponentially large in chapter \ref{Finite}.
       
   100 We therefore learned the necessity to introduce the stronger simplifications
       
   101 in chapter \ref{Cubic}.
       
   102 Without formal proofs we could not have found out this so soon.
       
   103 
       
   104 
       
   105 In the future, we plan to incorporate many extensions
       
   106 to this work.
       
   107 
       
   108 \section{Future Work}
       
   109 The most obvious next-step is to implement the cubic bound and
       
   110 correctness of $\blexerStrong$
       
   111 in chapter \ref{Cubic} formally.
       
   112 A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to 
       
   113 regular expression size will fulfill
       
   114 the linear complexity claim made by Sulzmann and Lu.
       
   115 
       
   116 With a linear size bound theoretically, the next challenge would be
       
   117 to generate code that are competitive with respect to commercial
       
   118 matchers. For that a lot of optimisations are needed.
       
   119 We aim to  integrate the zipper data structure into our lexer.
       
   120 The idea is very simple: using a zipper with multiple focuses
       
   121 just like Darragh \cite{Darragh2020} did, we could represent
       
   122 \[
       
   123 	x\cdot r + y \cdot r + \ldots
       
   124 \]
       
   125 as 
       
   126 \[
       
   127 	(x+y + \ldots) \cdot r.
       
   128 \]
       
   129 This would greatly reduce the amount of space needed
       
   130 when we have many terms like $x\cdot r$.
       
   131 Some initial results have been obtained, but 
       
   132 significant changes are needed to make sure 
       
   133 that the lexer output conforms to the POSIX standard. 
       
   134 We aim to make use of Okui and Suzuki's labelling
       
   135 system \cite{Okui10} to ensure regular expressions represented as zippers
       
   136 always maintain the POSIX orderings.
       
   137 %Formalising correctness of 
       
   138 %such a lexer using zippers will probably require using an imperative
       
   139 %framework with separation logic as it involves
       
   140 %mutable data structures, which is also interesting to look into.  
       
   141 
       
   142 To further optimise the algorithm, 
       
   143 we plan to add a deduplicate function that tells 
       
   144 whether two regular expressions denote the same 
       
   145 language using
       
   146 an efficient and verified equivalence checker like \cite{Krauss2011}.
       
   147 In this way, the internal data structures can be pruned more aggressively, 
       
   148 leading to better simplifications and ultimately faster algorithms.
       
   149 
       
   150 Traditional automaton approaches can be sped up by compiling
       
   151 multiple rules into the same automaton. This has been done
       
   152 in \cite{Kumar2006} and \cite{Becchi08}, for example.
       
   153 Currently our lexer only deals with a single regular expression each time.
       
   154 Putting multiple regular expressions might open up more
       
   155 possibilities of simplifications.
       
   156 
       
   157 As already mentioned in chapter \ref{RelatedWork},
       
   158 reducing the number of memory accesses can accelerate the 
       
   159 matching speed. It would be interesting to see the memory
       
   160 bandwidth of our derivative-based matching and improve accordingly.
       
   161 
       
   162 Memoization has been used frequently in lexing and parsing to achieve
       
   163 better complexity results, for example in \cite{Reps1998},
       
   164 \cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}.
       
   165 We plan to explore the performance enhancements by memoization in our algorithm
       
   166 in a correctness-preserving way.
       
   167 The Monadic data refinement technique that Lammich and Tuerk used
       
   168 in \cite{Lammich2012} to optimise the Hopcroft automaton minimisation
       
   169 algorithm seem quite interesting.
       
   170 We aim to learn from their refinement framework
       
   171 which generates high performance code with proofs
       
   172 that can be broken down into uncluttered small steps.
       
   173 
       
   174 
       
   175 
       
   176 
       
   177 %We establish the correctness 
       
   178 %claim made by them about the 
       
   179 %, which were only proven with pencil and paper.
       
   180 
       
   181 
       
   182 %\section{Work on Back-References}
       
   183 %We introduced regular expressions with back-references
       
   184 %in chapter \ref{Introduction}.
       
   185 %We adopt the common practice of calling them rewbrs (Regular Expressions
       
   186 %With Back References) in this section for brevity.
       
   187 %It has been shown by Aho \cite{Aho1990}
       
   188 %that the k-vertex cover problem can be reduced
       
   189 %to the problem of rewbrs matching, and is therefore NP-complete.
       
   190 %Given the depth of the problem, the progress made at the full generality
       
   191 %of arbitrary rewbrs matching has been slow, with
       
   192 %theoretical work on them being 
       
   193 %fairly recent. 
       
   194 %
       
   195 %Campaneu et al. studied regexes with back-references
       
   196 %in the context of formal languages theory in 
       
   197 %their 2003 work \cite{campeanu2003formal}.
       
   198 %They devised a pumping lemma for Perl-like regexes, 
       
   199 %proving that the langugages denoted by them
       
   200 %is  properly  contained in context-sensitive
       
   201 %languages.
       
   202 %More interesting questions such as 
       
   203 %whether the language denoted by Perl-like regexes
       
   204 %can express palindromes ($\{w \mid w = w^R\}$)
       
   205 %are discussed in \cite{campeanu2009patterns} 
       
   206 %and \cite{CAMPEANU2009Intersect}.
       
   207 %Freydenberger \cite{Frey2013} investigated the 
       
   208 %expressive power of back-references. He showed several 
       
   209 %undecidability and decriptional complexity results 
       
   210 %of back-references, concluding that they add
       
   211 %great power to certain programming tasks but are difficult to harness.
       
   212 %An interesting question would then be
       
   213 %whether we can add restrictions to them, 
       
   214 %so that they become expressive enough for practical use such
       
   215 %as html files, but not too powerful.
       
   216 %Freydenberger and Schmid \cite{FREYDENBERGER20191}
       
   217 %introduced the notion of deterministic
       
   218 %regular expressions with back-references to achieve
       
   219 %a better balance between expressiveness and tractability.
       
   220 %
       
   221 %
       
   222 %Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
       
   223 %investigated the time complexity of different variants
       
   224 %of back-references.
       
   225 %We are not aware of any work that uses derivatives on back-references.
       
   226 %
       
   227 %See \cite{BERGLUND2022} for a survey
       
   228 %of these works and comparison between different
       
   229 %flavours  of back-references syntax (e.g. whether references can be circular, 
       
   230 %can labels be repeated etc.).
       
   231 %
       
   232 %
       
   233 %\subsection{Matchers and Lexers with Mechanised Proofs}
       
   234 %We are aware
       
   235 %of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
       
   236 %Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
       
   237 %of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
       
   238 %in Coq is given by Coquand and Siles \parencite{Coquand2012}.
       
   239 %Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
       
   240 % 
       
   241 %\subsection{Static Analysis of Evil Regex Patterns} 
       
   242 %When a regular expression does not behave as intended,
       
   243 %people usually try to rewrite the regex to some equivalent form
       
   244 %or they try to avoid the possibly problematic patterns completely,
       
   245 %for which many false positives exist\parencite{Davis18}.
       
   246 %Animated tools to "debug" regular expressions such as
       
   247 % \parencite{regexploit2021} \parencite{regex101} are also popular.
       
   248 %We are also aware of static analysis work on regular expressions that
       
   249 %aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
       
   250 %\parencite{Rathnayake2014StaticAF} proposed an algorithm
       
   251 %that detects regular expressions triggering exponential
       
   252 %behavious on backtracking matchers.
       
   253 %Weideman \parencite{Weideman2017Static} came up with 
       
   254 %non-linear polynomial worst-time estimates
       
   255 %for regexes, attack string that exploit the worst-time 
       
   256 %scenario, and "attack automata" that generates
       
   257 %attack strings.
       
   258 %
       
   259 %
       
   260 %
       
   261 %
       
   262 %
       
   263 %
       
   264 %Thanks to our theorem-prover-friendly approach,
       
   265 %we believe that 
       
   266 %this finiteness bound can be improved to a bound
       
   267 %linear to input and
       
   268 %cubic to the regular expression size using a technique by
       
   269 %Antimirov\cite{Antimirov95}.
       
   270 %Once formalised, this would be a guarantee for the absence of all super-linear behavious.
       
   271 %We are working out the
       
   272 %details.
       
   273 %
       
   274 % 
       
   275 %To our best knowledge, no lexing libraries using Brzozowski derivatives
       
   276 %have similar complexity-related bounds, 
       
   277 %and claims about running time are usually speculative and backed by empirical
       
   278 %evidence on a few test cases.
       
   279 %If a matching or lexing algorithm
       
   280 %does not come with certain basic complexity related 
       
   281 %guarantees (for examaple the internal data structure size
       
   282 %does not grow indefinitely), 
       
   283 %then they cannot claim with confidence having solved the problem
       
   284 %of catastrophic backtracking.
       
   285 %
       
   286 %
       
   287 %\section{Optimisations}
       
   288 %Perhaps the biggest problem that prevents derivative-based lexing 
       
   289 %from being more widely adopted
       
   290 %is that they are still not blindingly fast in practice, unable to 
       
   291 %reach throughputs like gigabytes per second, which the commercial
       
   292 %regular expression matchers such as Snort and Bro are able to achieve.
       
   293 %For our algorithm to be more attractive for practical use, we
       
   294 %need more correctness-preserving optimisations.
       
   295 %%So far our focus has been mainly on the bit-coded algorithm,
       
   296 %%but the injection-based lexing algorithm 
       
   297 %%could also be sped up in various ways:
       
   298 %%
       
   299 %One such optimisation is defining string derivatives
       
   300 %directly, without recursively decomposing them into 
       
   301 %character-by-character derivatives. For example, instead of
       
   302 %replacing $(r_1 + r_2)\backslash (c::cs)$ by
       
   303 %$((r_1 + r_2)\backslash c)\backslash cs$, we rather
       
   304 %calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
       
   305 %
       
   306 %
       
   307 %\subsection{Derivatives and Zippers}
       
   308 %Zipper is a data structure designed to focus on 
       
   309 %and navigate between local parts of a tree.
       
   310 %The idea is that often operations on a large tree only deal with
       
   311 %local regions each time.
       
   312 %It was first formally described by Huet \cite{HuetZipper}.
       
   313 %Typical applications of zippers involve text editor buffers
       
   314 %and proof system databases.
       
   315 %In our setting, the idea is to compactify the representation
       
   316 %of derivatives with zippers, thereby making our algorithm faster.
       
   317 %We draw our inspirations from several works on parsing, derivatives
       
   318 %and zippers.
       
   319 %
       
   320 %Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
       
   321 %\cite{Zippy2020}.
       
   322 %They adopted zippers to improve the speed, and argued that the runtime
       
   323 %complexity of their algorithm was linear with respect to the input string.
       
   324 %
       
   325 %The idea of using Brzozowski derivatives on general context-free 
       
   326 %parsing was first implemented
       
   327 %by Might et al. \ref{Might2011}. 
       
   328 %They used memoization and fixpoint construction to eliminate infinite recursion,
       
   329 %which is a major problem for using derivatives on context-free grammars.
       
   330 %The initial version was quite slow----exponential on the size of the input.
       
   331 %Adams et al. \ref{Adams2016} improved the speed and argued that their version
       
   332 %was cubic with respect to the input.
       
   333 %Darragh and Adams \cite{10.1145/3408990} further improved the performance
       
   334 %by using zippers in an innovative way--their zippers had multiple focuses
       
   335 %instead of just one in a traditional zipper to handle ambiguity.
       
   336 %Their algorithm was not formalised, though.
       
   337 %
       
   338 %We aim to  integrate the zipper data structure into our lexer.
       
   339 %The idea is very simple: using a zipper with multiple focuses
       
   340 %just like Darragh did, we could represent
       
   341 %\[
       
   342 %	x\cdot r + y \cdot r + \ldots
       
   343 %\]
       
   344 %as 
       
   345 %\[
       
   346 %	(x+y + \ldots) \cdot r.
       
   347 %\]
       
   348 %This would greatly reduce the amount of space needed
       
   349 %when we have many terms like $x\cdot r$.
       
   350 %
       
   351 %into the current lexer. 
       
   352 %This allows the lexer to not traverse the entire 
       
   353 %derivative tree generated
       
   354 %in every intermediate step, but only a small segment 
       
   355 %that is currently active. 
       
   356 %
       
   357 %Some initial results have been obtained, but more care needs to be taken to make sure 
       
   358 %that the lexer output conforms to the POSIX standard. Formalising correctness of 
       
   359 %such a lexer using zippers will probably require using an imperative
       
   360 %framework with separation logic as it involves
       
   361 %mutable data structures, which is also interesting to look into.  
       
   362 %
       
   363 %To further optimise the algorithm, 
       
   364 %I plan to add a ``deduplicate'' function that tells 
       
   365 %whether two regular expressions denote the same 
       
   366 %language using
       
   367 %an efficient and verified equivalence checker.
       
   368 %In this way, the internal data structures can be pruned more aggressively, 
       
   369 %leading to better simplifications and ultimately faster algorithms.
       
   370 %
       
   371 %
       
   372 %Some initial results 
       
   373 %We first give a brief introduction to what zippers are,
       
   374 %and other works
       
   375 %that apply zippers to derivatives
       
   376 %When dealing with large trees, it would be a waste to 
       
   377 %traverse the entire tree if
       
   378 %the operation only
       
   379 %involves a small fraction of it.
       
   380 %The idea is to put the focus on that subtree, turning other parts
       
   381 %of the tree into a context
       
   382 %
       
   383 %
       
   384 %One observation about our derivative-based lexing algorithm is that
       
   385 %the derivative operation sometimes traverses the entire regular expression
       
   386 %unnecessarily:
       
   387