ChengsongTanPhdThesis/Chapters/RelatedWork.tex
changeset 627 94db2636a296
parent 626 1c8525061545
child 628 7af4e2420a8c
equal deleted inserted replaced
626:1c8525061545 627:94db2636a296
    70 
    70 
    71 Automata formalisations are in general harder and more cumbersome to deal
    71 Automata formalisations are in general harder and more cumbersome to deal
    72 with for theorem provers \cite{Nipkow1998}.
    72 with for theorem provers \cite{Nipkow1998}.
    73 To represent them, one way is to use graphs, but graphs are not inductive datatypes.
    73 To represent them, one way is to use graphs, but graphs are not inductive datatypes.
    74 Having to set the inductive principle on the number of nodes
    74 Having to set the inductive principle on the number of nodes
    75 in a graph makes formal reasoning non-intuitive and convoluted. 
    75 in a graph makes formal reasoning non-intuitive and convoluted,
       
    76 resulting in large formalisations \cite{Lammich2012}. 
    76 When combining two graphs, one also needs to make sure that the nodes in
    77 When combining two graphs, one also needs to make sure that the nodes in
    77 both graphs are distinct. 
    78 both graphs are distinct. 
    78 If they are not distinct, then renaming of the nodes is needed.
    79 If they are not distinct, then renaming of the nodes is needed.
    79 Using Coq which provides dependent types can potentially make things slightly easier
    80 Using Coq which provides dependent types can potentially make things slightly easier
    80 \cite{Doczkal2013}
    81 \cite{Doczkal2013}
    81 Another representation for automata are matrices. 
    82 Another representation for automata are matrices. 
    82 But the induction for them is not as straightforward either.
    83 But the induction for them is not as straightforward either.
    83 Both approaches have been used in the past and resulted in huge formalisations.
    84 Both approaches have been used in the past and resulted in huge formalisations.
    84 There are some more clever representations, for example one by Paulson 
    85 There are some more clever representations, for example one by Paulson 
    85 using hereditarily finite sets. 
    86 using hereditarily finite sets \cite{Paulson2015}. 
    86 There the problem with combining graphs can be solved better, 
    87 There the problem with combining graphs can be solved better. 
    87 but we believe that such clever tricks are not very obvious for 
    88 %but we believe that such clever tricks are not very obvious for 
    88 the John-average-Isabelle-user.
    89 %the John-average-Isabelle-user.
    89 
    90 
    90 \subsection{Different Definitions of POSIX Rules}
    91 \subsection{Different Definitions of POSIX Rules}
    91 There are different ways to formalise values and POSIX matching.
    92 There are different ways to formalise values and POSIX matching.
    92 Cardelli and Frisch \cite{Frisch2004} have developed a notion of
    93 Cardelli and Frisch \cite{Frisch2004} have developed a notion of
    93 \emph{non-problematic values} which is a slight variation
    94 \emph{non-problematic values} which is a slight variation
   116 algorithm outputs the minimal element by a pencil-and-paper proof.
   117 algorithm outputs the minimal element by a pencil-and-paper proof.
   117 But having the ordering relation taking regular expression as parameters
   118 But having the ordering relation taking regular expression as parameters
   118 causes the transitivity of their ordering to not go through.
   119 causes the transitivity of their ordering to not go through.
   119 
   120 
   120 
   121 
       
   122 \subsection{Static Analysis of Evil Regex Patterns} 
       
   123 When a regular expression does not behave as intended,
       
   124 people usually try to rewrite the regex to some equivalent form
       
   125 or they try to avoid the possibly problematic patterns completely,
       
   126 for which many false positives exist\parencite{Davis18}.
       
   127 Animated tools to "debug" regular expressions such as
       
   128  \parencite{regexploit2021} \parencite{regex101} are also popular.
       
   129 We are also aware of static analysis work on regular expressions that
       
   130 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
       
   131 \parencite{Rathnayake2014StaticAF} proposed an algorithm
       
   132 that detects regular expressions triggering exponential
       
   133 behavious on backtracking matchers.
       
   134 Weideman \parencite{Weideman2017Static} came up with 
       
   135 non-linear polynomial worst-time estimates
       
   136 for regexes, attack string that exploit the worst-time 
       
   137 scenario, and "attack automata" that generates
       
   138 attack strings.
       
   139 
       
   140 
       
   141 
       
   142 
   121 \section{Optimisations}
   143 \section{Optimisations}
   122 Perhaps the biggest problem that prevents derivative-based lexing 
   144 Perhaps the biggest problem that prevents derivative-based lexing 
   123 from being more widely adopted
   145 from being more widely adopted
   124 is that they tend to be not blindingly fast in practice, unable to 
   146 is that they tend to be not blindingly fast in practice, unable to 
   125 reach throughputs like gigabytes per second, which is the
   147 reach throughputs like gigabytes per second, which is the
   129 are capable of inspecting payloads
   151 are capable of inspecting payloads
   130 at line rates (which can be up to a few gigabits per second) against 
   152 at line rates (which can be up to a few gigabits per second) against 
   131 thousands of rules \cite{communityRules}.
   153 thousands of rules \cite{communityRules}.
   132 For our algorithm to be more attractive for practical use, we
   154 For our algorithm to be more attractive for practical use, we
   133 need more correctness-preserving optimisations.
   155 need more correctness-preserving optimisations.
       
   156 
       
   157 FPGA-based implementations such as \cite{Sidhu2001} 
       
   158 have the advantages of being
       
   159 reconfigurable and parallel, but suffer from lower clock frequency
       
   160 and scalability.
       
   161 Traditional automaton approaches that use a DFA instead of NFA
       
   162 benefit from the fact that only a single transition is needed 
       
   163 for each input character \cite{Becchi08}. Lower memory bandwidth leads
       
   164 to faster performance.
       
   165 However, they suffer from exponential blow-ups on bounded repetitions.
       
   166 Compression techniques are used, such as those in \cite{Kumar2006} and
       
   167 \cite{Becchi2007}.
       
   168 Variations of pure NFAs or DFAs like counting-set automata \cite{Turonova2020}
       
   169 have been
       
   170 proposed to better deal with bounded repetitions.
       
   171 
   134 %So far our focus has been mainly on the bit-coded algorithm,
   172 %So far our focus has been mainly on the bit-coded algorithm,
   135 %but the injection-based lexing algorithm 
   173 %but the injection-based lexing algorithm 
   136 %could also be sped up in various ways:
   174 %could also be sped up in various ways:
   137 %
   175 %
   138 One such optimisation is defining string derivatives
   176 Another direction of optimisation for derivative-based approaches
       
   177 is defining string derivatives
   139 directly, without recursively decomposing them into 
   178 directly, without recursively decomposing them into 
   140 character-by-character derivatives. For example, instead of
   179 character-by-character derivatives. For example, instead of
   141 replacing $(r_1 + r_2)\backslash (c::cs)$ by
   180 replacing $(r_1 + r_2)\backslash (c::cs)$ by
   142 $((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather
   181 $((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather
   143 calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
   182 calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
   155 local regions each time.
   194 local regions each time.
   156 Therefore it would be a waste to 
   195 Therefore it would be a waste to 
   157 traverse the entire tree if
   196 traverse the entire tree if
   158 the operation only
   197 the operation only
   159 involves a small fraction of it.
   198 involves a small fraction of it.
   160 It was first formally described by Huet \cite{HuetZipper}.
   199 It was first formally described by Huet \cite{Huet1997}.
   161 Typical applications of zippers involve text editor buffers
   200 Typical applications of zippers involve text editor buffers
   162 and proof system databases.
   201 and proof system databases.
   163 In our setting, the idea is to compactify the representation
   202 In our setting, the idea is to compactify the representation
   164 of derivatives with zippers, thereby making our algorithm faster.
   203 of derivatives with zippers, thereby making our algorithm faster.
   165 We introduce several works on parsing, derivatives
   204 We introduce several works on parsing, derivatives
   170 They adopted zippers to improve the speed, and argued that the runtime
   209 They adopted zippers to improve the speed, and argued that the runtime
   171 complexity of their algorithm was linear with respect to the input string.
   210 complexity of their algorithm was linear with respect to the input string.
   172 
   211 
   173 The idea of using Brzozowski derivatives on general context-free 
   212 The idea of using Brzozowski derivatives on general context-free 
   174 parsing was first implemented
   213 parsing was first implemented
   175 by Might et al. \ref{Might2011}. 
   214 by Might et al. \cite{Might2011}. 
   176 They used memoization and fixpoint construction to eliminate infinite recursion,
   215 They used memoization and fixpoint construction to eliminate infinite recursion,
   177 which is a major problem for using derivatives on context-free grammars.
   216 which is a major problem for using derivatives on context-free grammars.
   178 The initial version was quite slow----exponential on the size of the input.
   217 The initial version was quite slow----exponential on the size of the input.
   179 Adams et al. \ref{Adams2016} improved the speed and argued that their version
   218 Adams et al. \cite{Adams2016} improved the speed and argued that their version
   180 was cubic with respect to the input.
   219 was cubic with respect to the input.
   181 Darragh and Adams \cite{Darragh2020} further improved the performance
   220 Darragh and Adams \cite{Darragh2020} further improved the performance
   182 by using zippers in an innovative way--their zippers had multiple focuses
   221 by using zippers in an innovative way--their zippers had multiple focuses
   183 instead of just one in a traditional zipper to handle ambiguity.
   222 instead of just one in a traditional zipper to handle ambiguity.
   184 Their algorithm was not formalised, though.
   223 Their algorithm was not formalised, though.
   185 
   224 
   186 
   225 
   187 Some initial results have been obtained, but more care needs to be taken to make sure 
   226 
   188 that the lexer output conforms to the POSIX standard. Formalising correctness of 
   227 
   189 such a lexer using zippers will probably require using an imperative
   228 \subsection{Back-References}
   190 framework with separation logic as it involves
       
   191 mutable data structures, which is also interesting to look into.  
       
   192 
       
   193 To further optimise the algorithm, 
       
   194 I plan to add a ``deduplicate'' function that tells 
       
   195 whether two regular expressions denote the same 
       
   196 language using
       
   197 an efficient and verified equivalence checker.
       
   198 In this way, the internal data structures can be pruned more aggressively, 
       
   199 leading to better simplifications and ultimately faster algorithms.
       
   200 
       
   201 
       
   202 
       
   203 The idea is to put the focus on that subtree, turning other parts
       
   204 of the tree into a context
       
   205 
       
   206 
       
   207 One observation about our derivative-based lexing algorithm is that
       
   208 the derivative operation sometimes traverses the entire regular expression
       
   209 unnecessarily:
       
   210 
       
   211 
       
   212 
       
   213 
       
   214 
       
   215 \section{Work on Back-References}
       
   216 We introduced regular expressions with back-references
   229 We introduced regular expressions with back-references
   217 in chapter \ref{Introduction}.
   230 in chapter \ref{Introduction}.
   218 We adopt the common practice of calling them rewbrs (Regular Expressions
   231 We adopt the common practice of calling them rewbrs (Regular Expressions
   219 With Back References) in this section for brevity.
   232 With Back References) for brevity.
   220 It has been shown by Aho \cite{Aho1990}
   233 It has been shown by Aho \cite{Aho1990}
   221 that the k-vertex cover problem can be reduced
   234 that the k-vertex cover problem can be reduced
   222 to the problem of rewbrs matching, and is therefore NP-complete.
   235 to the problem of rewbrs matching, and is therefore NP-complete.
   223 Given the depth of the problem, the progress made at the full generality
   236 Given the depth of the problem, the progress made at the full generality
   224 of arbitrary rewbrs matching has been slow, with
   237 of arbitrary rewbrs matching has been slow, with
   262 flavours  of back-references syntax (e.g. whether references can be circular, 
   275 flavours  of back-references syntax (e.g. whether references can be circular, 
   263 can labels be repeated etc.).
   276 can labels be repeated etc.).
   264 
   277 
   265 
   278 
   266  
   279  
   267 \subsection{Static Analysis of Evil Regex Patterns} 
   280 
   268 When a regular expression does not behave as intended,
   281 
   269 people usually try to rewrite the regex to some equivalent form
   282 
   270 or they try to avoid the possibly problematic patterns completely,
   283 
   271 for which many false positives exist\parencite{Davis18}.
       
   272 Animated tools to "debug" regular expressions such as
       
   273  \parencite{regexploit2021} \parencite{regex101} are also popular.
       
   274 We are also aware of static analysis work on regular expressions that
       
   275 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
       
   276 \parencite{Rathnayake2014StaticAF} proposed an algorithm
       
   277 that detects regular expressions triggering exponential
       
   278 behavious on backtracking matchers.
       
   279 Weideman \parencite{Weideman2017Static} came up with 
       
   280 non-linear polynomial worst-time estimates
       
   281 for regexes, attack string that exploit the worst-time 
       
   282 scenario, and "attack automata" that generates
       
   283 attack strings.
       
   284 
       
   285 
       
   286 
       
   287 
       
   288 
       
   289 
       
   290 Thanks to our theorem-prover-friendly approach,
       
   291 we believe that 
       
   292 this finiteness bound can be improved to a bound
       
   293 linear to input and
       
   294 cubic to the regular expression size using a technique by
       
   295 Antimirov\cite{Antimirov95}.
       
   296 Once formalised, this would be a guarantee for the absence of all super-linear behavious.
       
   297 We are working out the
       
   298 details.
       
   299 
   284 
   300  
   285  
   301 To our best knowledge, no lexing libraries using Brzozowski derivatives
   286 
   302 have similar complexity-related bounds, 
   287 
   303 and claims about running time are usually speculative and backed by empirical
       
   304 evidence on a few test cases.
       
   305 If a matching or lexing algorithm
       
   306 does not come with certain basic complexity related 
       
   307 guarantees (for examaple the internal data structure size
       
   308 does not grow indefinitely), 
       
   309 then they cannot claim with confidence having solved the problem
       
   310 of catastrophic backtracking.
       
   311 
       
   312