ChengsongTanPhdThesis/Chapters/RelatedWork.tex
changeset 626 1c8525061545
parent 625 b797c9a709d9
child 627 94db2636a296
equal deleted inserted replaced
625:b797c9a709d9 626:1c8525061545
     3 \chapter{Related Work} % Main chapter title
     3 \chapter{Related Work} % Main chapter title
     4 
     4 
     5 \label{RelatedWork} 
     5 \label{RelatedWork} 
     6 
     6 
     7 In this chapter, we introduce
     7 In this chapter, we introduce
     8 the work relevant to this thesis.
     8 work relevant to this thesis.
       
     9 
       
    10 \section{Regular Expressions, Derivatives and POSIX Lexing}
       
    11 
       
    12 %\subsection{Formalisations of Automata, Regular Expressions, and Matching Algorithms}
       
    13 Regular expressions were introduced by Kleene in the 1950s \cite{Kleene1956}.
       
    14 Since then they have become a fundamental concept in 
       
    15 formal languages and automata theory \cite{Sakarovitch2009}.
       
    16 Brzozowski defined derivatives on regular expressions in his PhD thesis
       
    17 in 1964 \cite{Brzozowski1964}, in which he proved the finiteness of the numbers
       
    18 of regular expression derivatives modulo the ACI-axioms.
       
    19 It is worth pointing out that this result does not directly
       
    20 translate to our
       
    21 finiteness proof, 
       
    22 and our proof does not depend on it to be valid.
       
    23 The key observation is that our version of the Sulzmann and Lu's algorithm
       
    24 \cite{Sulzmann2014} represents 
       
    25 derivative terms in a way that allows efficient de-duplication,
       
    26 and we do not make use of an equivalence checker that exploits the ACI-equivalent
       
    27 terms.
       
    28 
       
    29 Central to this thesis is the work by Sulzmann and Lu \cite{Sulzmann2014}.
       
    30 They first introduced the elegant and simple idea of injection-based lexing
       
    31 and bit-coded lexing.
       
    32 In a follow-up work \cite{Sulzmann2014b}, Sulzmann and Steenhoven
       
    33 incorporated these ideas
       
    34 into a tool called \emph{dreml}.
       
    35 The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas
       
    36 by Frisch and Cardelli \cite{Frisch2004} were later
       
    37 found to have unfillable gaps by Ausaf \cite{Ausaf},
       
    38 who came up with an alternative proof inspired by 
       
    39 Vansummeren \cite{Vansummeren2006}.
       
    40 Sulzmann and Thiemann extended the Brzozowski derivatives to 
       
    41 shuffling regular expressions \cite{Sulzmann2015},
       
    42 which are a very succinct way of describing basic regular expressions.
       
    43 
       
    44 
       
    45 
       
    46 Regular expressions and lexers have been a popular topic among
       
    47 the theorem proving and functional programming community.
       
    48 In the next section we give a list of lexers
       
    49 and matchers that come with a machine-checked correctness proof.
       
    50 
       
    51 \subsection{Matchers and Lexers with Mechanised Proofs}
       
    52 We are aware
       
    53 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
       
    54 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
       
    55 of the work by Krauss and Nipkow \parencite{Krauss2011}.  Another one
       
    56 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
       
    57 Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}.
       
    58 The most recent works to our best knowledge are the Verbatim \cite{Verbatim}
       
    59 and Verbatim++ \cite{Verbatim} lexers.
       
    60 The Verbatim++ lexer adds many correctness-preserving 
       
    61 optimisations to the Verbatim lexer,
       
    62 and is therefore quite fast on many inputs.
       
    63 The problem is that they chose to use DFA to speed up things,
       
    64 for which dealing with bounded repetitions is a bottleneck.
       
    65 
       
    66 
       
    67 This thesis builds on the formal specifications of POSIX
       
    68 rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}.
       
    69 The bounded repetitions is a continuation of the work by Ausaf \cite{Ausaf}.
       
    70 
       
    71 Automata formalisations are in general harder and more cumbersome to deal
       
    72 with for theorem provers \cite{Nipkow1998}.
       
    73 To represent them, one way is to use graphs, but graphs are not inductive datatypes.
       
    74 Having to set the inductive principle on the number of nodes
       
    75 in a graph makes formal reasoning non-intuitive and convoluted. 
       
    76 When combining two graphs, one also needs to make sure that the nodes in
       
    77 both graphs are distinct. 
       
    78 If they are not distinct, then renaming of the nodes is needed.
       
    79 Using Coq which provides dependent types can potentially make things slightly easier
       
    80 \cite{Doczkal2013}
       
    81 Another representation for automata are matrices. 
       
    82 But the induction for them is not as straightforward either.
       
    83 Both approaches have been used in the past and resulted in huge formalisations.
       
    84 There are some more clever representations, for example one by Paulson 
       
    85 using hereditarily finite sets. 
       
    86 There the problem with combining graphs can be solved better, 
       
    87 but we believe that such clever tricks are not very obvious for 
       
    88 the John-average-Isabelle-user.
       
    89 
       
    90 \subsection{Different Definitions of POSIX Rules}
       
    91 There are different ways to formalise values and POSIX matching.
       
    92 Cardelli and Frisch \cite{Frisch2004} have developed a notion of
       
    93 \emph{non-problematic values} which is a slight variation
       
    94 of the values defined in the inhabitation relation in \ref{fig:inhab}.
       
    95 They then defined an ordering between values, and showed that
       
    96 the maximal element of those values correspond to the output
       
    97 of their GREEDY lexing algorithm.
       
    98 
       
    99 Okui and Suzuki \cite{Okui10} allow iterations of values to 
       
   100 flatten to the empty
       
   101 string for the star inhabitation rule in
       
   102 \ref{fig:inhab}.
       
   103 They refer to the more restrictive version as used 
       
   104 in this thesis (which was defined by Ausaf et al.
       
   105 \cite{AusafDyckhoffUrban2016}) as \emph{canonical values}.
       
   106 The very interesting link between the work by Ausaf et al.
       
   107 and Okui and Suzuki is that they have distinct formalisations
       
   108 of POSIX values, and yet they are provably equivalent! See
       
   109 \cite{AusafDyckhoffUrban2016} for details of the
       
   110 alternative definitions given by Okui and Suzuki and the formalisation.
       
   111 %TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF.
       
   112 
       
   113 Sulzmann and Lu themselves have come up with POSIX definitions \cite{Sulzmann2014}.
       
   114 In their paper they defined an ordering between values with respect
       
   115 to regular expressions, and tried to establish that their
       
   116 algorithm outputs the minimal element by a pencil-and-paper proof.
       
   117 But having the ordering relation taking regular expression as parameters
       
   118 causes the transitivity of their ordering to not go through.
       
   119 
       
   120 
       
   121 \section{Optimisations}
       
   122 Perhaps the biggest problem that prevents derivative-based lexing 
       
   123 from being more widely adopted
       
   124 is that they tend to be not blindingly fast in practice, unable to 
       
   125 reach throughputs like gigabytes per second, which is the
       
   126 application we had in mind when we initially started looking at the topic.
       
   127 Commercial
       
   128 regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro}
       
   129 are capable of inspecting payloads
       
   130 at line rates (which can be up to a few gigabits per second) against 
       
   131 thousands of rules \cite{communityRules}.
       
   132 For our algorithm to be more attractive for practical use, we
       
   133 need more correctness-preserving optimisations.
       
   134 %So far our focus has been mainly on the bit-coded algorithm,
       
   135 %but the injection-based lexing algorithm 
       
   136 %could also be sped up in various ways:
       
   137 %
       
   138 One such optimisation is defining string derivatives
       
   139 directly, without recursively decomposing them into 
       
   140 character-by-character derivatives. For example, instead of
       
   141 replacing $(r_1 + r_2)\backslash (c::cs)$ by
       
   142 $((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather
       
   143 calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
       
   144 This has the potential to speed up matching because input is 
       
   145 processed in larger granularity.
       
   146 One interesting thing is to explore whether this can be done
       
   147 to $\inj$ as well, so that we can generate a lexical value 
       
   148 rather than simply get a matcher.
       
   149 
       
   150 
       
   151 \subsection{Derivatives and Zippers}
       
   152 Zipper is a data structure designed to focus on 
       
   153 and navigate between local parts of a tree.
       
   154 The idea is that often operations on a large tree only deal with
       
   155 local regions each time.
       
   156 Therefore it would be a waste to 
       
   157 traverse the entire tree if
       
   158 the operation only
       
   159 involves a small fraction of it.
       
   160 It was first formally described by Huet \cite{HuetZipper}.
       
   161 Typical applications of zippers involve text editor buffers
       
   162 and proof system databases.
       
   163 In our setting, the idea is to compactify the representation
       
   164 of derivatives with zippers, thereby making our algorithm faster.
       
   165 We introduce several works on parsing, derivatives
       
   166 and zippers.
       
   167 
       
   168 Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
       
   169 \cite{Zippy2020}.
       
   170 They adopted zippers to improve the speed, and argued that the runtime
       
   171 complexity of their algorithm was linear with respect to the input string.
       
   172 
       
   173 The idea of using Brzozowski derivatives on general context-free 
       
   174 parsing was first implemented
       
   175 by Might et al. \ref{Might2011}. 
       
   176 They used memoization and fixpoint construction to eliminate infinite recursion,
       
   177 which is a major problem for using derivatives on context-free grammars.
       
   178 The initial version was quite slow----exponential on the size of the input.
       
   179 Adams et al. \ref{Adams2016} improved the speed and argued that their version
       
   180 was cubic with respect to the input.
       
   181 Darragh and Adams \cite{Darragh2020} further improved the performance
       
   182 by using zippers in an innovative way--their zippers had multiple focuses
       
   183 instead of just one in a traditional zipper to handle ambiguity.
       
   184 Their algorithm was not formalised, though.
       
   185 
       
   186 
       
   187 Some initial results have been obtained, but more care needs to be taken to make sure 
       
   188 that the lexer output conforms to the POSIX standard. Formalising correctness of 
       
   189 such a lexer using zippers will probably require using an imperative
       
   190 framework with separation logic as it involves
       
   191 mutable data structures, which is also interesting to look into.  
       
   192 
       
   193 To further optimise the algorithm, 
       
   194 I plan to add a ``deduplicate'' function that tells 
       
   195 whether two regular expressions denote the same 
       
   196 language using
       
   197 an efficient and verified equivalence checker.
       
   198 In this way, the internal data structures can be pruned more aggressively, 
       
   199 leading to better simplifications and ultimately faster algorithms.
       
   200 
       
   201 
       
   202 
       
   203 The idea is to put the focus on that subtree, turning other parts
       
   204 of the tree into a context
       
   205 
       
   206 
       
   207 One observation about our derivative-based lexing algorithm is that
       
   208 the derivative operation sometimes traverses the entire regular expression
       
   209 unnecessarily:
       
   210 
       
   211 
       
   212 
     9 
   213 
    10 
   214 
    11 \section{Work on Back-References}
   215 \section{Work on Back-References}
    12 We introduced regular expressions with back-references
   216 We introduced regular expressions with back-references
    13 in chapter \ref{Introduction}.
   217 in chapter \ref{Introduction}.
    14 We adopt the common practice of calling them rewbrs (Regular Expressions
   218 We adopt the common practice of calling them rewbrs (Regular Expressions
    15 With Back References) in this section for brevity.
   219 With Back References) in this section for brevity.
    16 It has been shown by Aho \cite{Aho1990}
   220 It has been shown by Aho \cite{Aho1990}
    17 that the k-vertex cover problem can be reduced
   221 that the k-vertex cover problem can be reduced
    18 to the problem of rewbrs matching problem, and is therefore NP-complete.
   222 to the problem of rewbrs matching, and is therefore NP-complete.
    19 Given the depth of the problem, the
   223 Given the depth of the problem, the progress made at the full generality
    20 theoretical work on them is 
   224 of arbitrary rewbrs matching has been slow, with
       
   225 theoretical work on them being 
    21 fairly recent. 
   226 fairly recent. 
    22 
   227 
    23 Campaneu et al. studied regexes with back-references
   228 Campaneu et al. studied regexes with back-references
    24 in the context of formal languages theory in 
   229 in the context of formal languages theory in 
    25 their 2003 work \cite{campeanu2003formal}.
   230 their 2003 work \cite{campeanu2003formal}.
    56 of these works and comparison between different
   261 of these works and comparison between different
    57 flavours  of back-references syntax (e.g. whether references can be circular, 
   262 flavours  of back-references syntax (e.g. whether references can be circular, 
    58 can labels be repeated etc.).
   263 can labels be repeated etc.).
    59 
   264 
    60 
   265 
    61 \subsection{Matchers and Lexers with Mechanised Proofs}
       
    62 We are aware
       
    63 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
       
    64 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
       
    65 of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
       
    66 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
       
    67 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
       
    68  
   266  
    69 \subsection{Static Analysis of Evil Regex Patterns} 
   267 \subsection{Static Analysis of Evil Regex Patterns} 
    70 When a regular expression does not behave as intended,
   268 When a regular expression does not behave as intended,
    71 people usually try to rewrite the regex to some equivalent form
   269 people usually try to rewrite the regex to some equivalent form
    72 or they try to avoid the possibly problematic patterns completely,
   270 or they try to avoid the possibly problematic patterns completely,
   110 does not grow indefinitely), 
   308 does not grow indefinitely), 
   111 then they cannot claim with confidence having solved the problem
   309 then they cannot claim with confidence having solved the problem
   112 of catastrophic backtracking.
   310 of catastrophic backtracking.
   113 
   311 
   114 
   312 
   115 \section{Derivatives and Zippers}
       
   116 Zipper is a data structure designed to focus on 
       
   117 and navigate between local parts of a tree.
       
   118 The idea is that often operations on a large tree only deal with
       
   119 local regions each time.
       
   120 It was first formally described by Huet \cite{HuetZipper}.
       
   121 Typical applications of zippers involve text editor buffers
       
   122 and proof system databases.
       
   123 In our setting, the idea is to compactify the representation
       
   124 of derivatives with zippers, thereby making our algorithm faster.
       
   125 We draw our inspirations from several works on parsing, derivatives
       
   126 and zippers.
       
   127 
       
   128 Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
       
   129 \cite{Zippy2020}.
       
   130 They adopted zippers to improve the speed, and argued that the runtime
       
   131 complexity of their algorithm was linear with respect to the input string.
       
   132 
       
   133 The idea of using Brzozowski derivatives on general context-free 
       
   134 parsing was first implemented
       
   135 by Might et al. \ref{Might2011}. 
       
   136 They used memoization and fixpoint construction to eliminate infinite recursion,
       
   137 which is a major problem for using derivatives on context-free grammars.
       
   138 The initial version was quite slow----exponential on the size of the input.
       
   139 Adams et al. \ref{Adams2016} improved the speed and argued that their version
       
   140 was cubic with respect to the input.
       
   141 Darragh and Adams \cite{10.1145/3408990} further improved the performance
       
   142 by using zippers in an innovative way--their zippers had multiple focuses
       
   143 instead of just one in a traditional zipper to handle ambiguity.
       
   144 Their algorithm was not formalised, though.
       
   145 
       
   146 We aim to  integrate the zipper data structure into our lexer.
       
   147 The idea is very simple: using a zipper with multiple focuses
       
   148 just like Darragh did, we could represent
       
   149 \[
       
   150 	x\cdot r + y \cdot r + \ldots
       
   151 \]
       
   152 as 
       
   153 \[
       
   154 	(x+y + \ldots) \cdot r.
       
   155 \]
       
   156 This would greatly reduce the amount of space needed
       
   157 when we have many terms like $x\cdot r$.
       
   158 
       
   159 into the current lexer. 
       
   160 This allows the lexer to not traverse the entire 
       
   161 derivative tree generated
       
   162 in every intermediate step, but only a small segment 
       
   163 that is currently active. 
       
   164 
       
   165 Some initial results have been obtained, but more care needs to be taken to make sure 
       
   166 that the lexer output conforms to the POSIX standard. Formalising correctness of 
       
   167 such a lexer using zippers will probably require using an imperative
       
   168 framework with separation logic as it involves
       
   169 mutable data structures, which is also interesting to look into.  
       
   170 
       
   171 To further optimise the algorithm, 
       
   172 I plan to add a ``deduplicate'' function that tells 
       
   173 whether two regular expressions denote the same 
       
   174 language using
       
   175 an efficient and verified equivalence checker.
       
   176 In this way, the internal data structures can be pruned more aggressively, 
       
   177 leading to better simplifications and ultimately faster algorithms.
       
   178 
       
   179 
       
   180 Some initial results 
       
   181 We first give a brief introduction to what zippers are,
       
   182 and other works
       
   183 that apply zippers to derivatives
       
   184 When dealing with large trees, it would be a waste to 
       
   185 traverse the entire tree if
       
   186 the operation only
       
   187 involves a small fraction of it.
       
   188 The idea is to put the focus on that subtree, turning other parts
       
   189 of the tree into a context
       
   190 
       
   191 
       
   192 One observation about our derivative-based lexing algorithm is that
       
   193 the derivative operation sometimes traverses the entire regular expression
       
   194 unnecessarily:
       
   195