ChengsongTanPhdThesis/Chapters/RelatedWork.tex
changeset 640 bd1354127574
parent 632 99eceee5c4ac
child 668 3831621d7b14
equal deleted inserted replaced
639:80cc6dc4c98b 640:bd1354127574
    17 in 1964 \cite{Brzozowski1964}, in which he proved the finiteness of the numbers
    17 in 1964 \cite{Brzozowski1964}, in which he proved the finiteness of the numbers
    18 of regular expression derivatives modulo the ACI-axioms.
    18 of regular expression derivatives modulo the ACI-axioms.
    19 It is worth pointing out that this result does not directly
    19 It is worth pointing out that this result does not directly
    20 translate to our
    20 translate to our
    21 finiteness proof, 
    21 finiteness proof, 
    22 and our proof does not depend on it to be valid.
    22 and our proof does not depend on it.
    23 The key observation is that our version of the Sulzmann and Lu's algorithm
    23 The key observation is that our version of the Sulzmann and Lu's algorithm
    24 \cite{Sulzmann2014} represents 
    24 \cite{Sulzmann2014} represents 
    25 derivative terms in a way that allows efficient de-duplication,
    25 derivative terms in a way that allows efficient de-duplication,
    26 and we do not make use of an equivalence checker that exploits the ACI-equivalent
    26 and we do not make use of an equivalence checker that exploits the ACI-equivalent
    27 terms.
    27 terms.
    33 In a follow-up work \cite{Sulzmann2014b}, Sulzmann and Steenhoven
    33 In a follow-up work \cite{Sulzmann2014b}, Sulzmann and Steenhoven
    34 incorporated these ideas
    34 incorporated these ideas
    35 into a tool called \emph{dreml}.
    35 into a tool called \emph{dreml}.
    36 The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas
    36 The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas
    37 by Frisch and Cardelli \cite{Frisch2004} were later
    37 by Frisch and Cardelli \cite{Frisch2004} were later
    38 found to have unfillable gaps by Ausaf \cite{Ausaf},
    38 found to have unfillable gaps by Ausaf et al. \cite{AusafDyckhoffUrban2016},
    39 who came up with an alternative proof inspired by 
    39 who came up with an alternative proof inspired by 
    40 Vansummeren \cite{Vansummeren2006}.
    40 Vansummeren \cite{Vansummeren2006}.
    41 Sulzmann and Thiemann extended the Brzozowski derivatives to 
    41 Sulzmann and Thiemann extended the Brzozowski derivatives to 
    42 shuffling regular expressions \cite{Sulzmann2015},
    42 shuffling regular expressions \cite{Sulzmann2015},
    43 which are a very succinct way of describing basic regular expressions.
    43 which are a very succinct way of describing regular expressions.
    44 
    44 
    45 
    45 
    46 
    46 
    47 Regular expressions and lexers have been a popular topic among
    47 Regular expressions and lexers have been a popular topic among
    48 the theorem proving and functional programming community.
    48 the theorem proving and functional programming community.
    54 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
    54 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
    55 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
    55 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
    56 of the work by Krauss and Nipkow \parencite{Krauss2011}.  Another one
    56 of the work by Krauss and Nipkow \parencite{Krauss2011}.  Another one
    57 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
    57 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
    58 Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}.
    58 Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}.
    59 The most recent works to our best knowledge are the Verbatim \cite{Verbatim}
    59 The most recent works on verified lexers to our best knowledge are the Verbatim \cite{Verbatim}
    60 and Verbatim++ \cite{Verbatim} lexers.
    60 and Verbatim++ \cite{Verbatimpp} lexers by Egolf et al.
       
    61 Verbatim is based on derivatives but does not simplify them. 
       
    62 Therefore it can be very slow on evil regular expressions.
    61 The Verbatim++ lexer adds many correctness-preserving 
    63 The Verbatim++ lexer adds many correctness-preserving 
    62 optimisations to the Verbatim lexer,
    64 optimisations to the Verbatim lexer,
    63 and is therefore quite fast on many inputs.
    65 and is therefore quite fast on many inputs.
    64 The problem is that they chose to use DFA to speed up things,
    66 The problem is that Egolf et al. chose to use traditional DFAs to speed up lexing,
    65 for which dealing with bounded repetitions is a bottleneck.
    67 but then dealing with bounded repetitions is a real bottleneck.
    66 
    68 
    67 
    69 
    68 This thesis builds on the formal specifications of POSIX
    70 This thesis builds on the formal specifications of POSIX
    69 rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}.
    71 rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}.
    70 The bounded repetitions is a continuation of the work by Ausaf \cite{Ausaf}.
    72 The bounded repetitions presented here is a continuation of the work by Ausaf \cite{Ausaf}.
    71 
    73 
    72 Automata formalisations are in general harder and more cumbersome to deal
    74 Automata formalisations are in general harder and more cumbersome to deal
    73 with for theorem provers \cite{Nipkow1998}.
    75 with for theorem provers \cite{Nipkow1998}.
    74 To represent them, one way is to use graphs, but graphs are not inductive datatypes.
    76 To represent them, one way is to use graphs, but graphs are not inductive datatypes.
    75 Having to set the inductive principle on the number of nodes
    77 Having to set the inductive principle on the number of nodes
    77 resulting in large formalisations \cite{Lammich2012}. 
    79 resulting in large formalisations \cite{Lammich2012}. 
    78 When combining two graphs, one also needs to make sure that the nodes in
    80 When combining two graphs, one also needs to make sure that the nodes in
    79 both graphs are distinct. 
    81 both graphs are distinct. 
    80 If they are not distinct, then renaming of the nodes is needed.
    82 If they are not distinct, then renaming of the nodes is needed.
    81 Using Coq which provides dependent types can potentially make things slightly easier
    83 Using Coq which provides dependent types can potentially make things slightly easier
    82 \cite{Doczkal2013}
    84 \cite{Doczkal2013}.
       
    85 
    83 Another representation for automata are matrices. 
    86 Another representation for automata are matrices. 
    84 But the induction for them is not as straightforward either.
    87 But the induction for them is not as straightforward either.
    85 Both approaches have been used in the past and resulted in huge formalisations.
    88 Both approaches have been used in the past and resulted in huge formalisations.
    86 There are some more clever representations, for example one by Paulson 
    89 There are some more clever representations, for example one by Paulson 
    87 using hereditarily finite sets \cite{Paulson2015}. 
    90 using hereditarily finite sets \cite{Paulson2015}. 
    88 There the problem with combining graphs can be solved better. 
    91 There the problem with combining graphs can be solved better. 
    89 %but we believe that such clever tricks are not very obvious for 
    92 %but we believe that such clever tricks are not very obvious for 
    90 %the John-average-Isabelle-user.
    93 %the John-average-Isabelle-user.
       
    94 Because of these problems with automata, we prefer regular expressions
       
    95 and derivatives which can be implemented in theorem provers and functional programming languages
       
    96 with ease.
    91 
    97 
    92 \subsection{Different Definitions of POSIX Rules}
    98 \subsection{Different Definitions of POSIX Rules}
    93 There are different ways to formalise values and POSIX matching.
    99 There are different ways to formalise values and POSIX matching.
    94 Cardelli and Frisch \cite{Frisch2004} have developed a notion of
   100 Cardelli and Frisch \cite{Frisch2004} have developed a notion of
    95 \emph{non-problematic values} which is a slight variation
   101 \emph{non-problematic values} which is a slight variation
   105 They refer to the more restrictive version as used 
   111 They refer to the more restrictive version as used 
   106 in this thesis (which was defined by Ausaf et al.
   112 in this thesis (which was defined by Ausaf et al.
   107 \cite{AusafDyckhoffUrban2016}) as \emph{canonical values}.
   113 \cite{AusafDyckhoffUrban2016}) as \emph{canonical values}.
   108 The very interesting link between the work by Ausaf et al.
   114 The very interesting link between the work by Ausaf et al.
   109 and Okui and Suzuki is that they have distinct formalisations
   115 and Okui and Suzuki is that they have distinct formalisations
   110 of POSIX values, and yet they are provably equivalent! See
   116 of POSIX values, and yet they define the same notion.  See
   111 \cite{AusafDyckhoffUrban2016} for details of the
   117 \cite{AusafDyckhoffUrban2016} for details of the
   112 alternative definitions given by Okui and Suzuki and the formalisation.
   118 alternative definitions given by Okui and Suzuki and the formalisation
       
   119 described in this thesis.
   113 %TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF.
   120 %TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF.
   114 
   121 
   115 Sulzmann and Lu themselves have come up with POSIX definitions \cite{Sulzmann2014}.
   122 Sulzmann and Lu themselves have come up with a POSIX definition \cite{Sulzmann2014}.
   116 In their paper they defined an ordering between values with respect
   123 In their paper they defined an ordering between values with respect
   117 to regular expressions, and tried to establish that their
   124 to regular expressions, and tried to establish that their
   118 algorithm outputs the minimal element by a pencil-and-paper proof.
   125 algorithm outputs the minimal element by a pencil-and-paper proof.
   119 But having the ordering relation taking regular expression as parameters
   126 But having the ordering relation taking regular expression as parameters
   120 causes the transitivity of their ordering to not go through.
   127 causes the transitivity of their ordering to not go through.
   121 
   128 
       
   129 \subsection{Static Analysis of Evil Regex Patterns} 
       
   130 
       
   131 
   122 When faced with catastrophic backtracking, 
   132 When faced with catastrophic backtracking, 
   123 sometimes programmers 
   133 sometimes programmers 
   124 have to rewrite their regexes in an ad hoc fashion.
   134 have to rewrite their regexes in an ad hoc fashion.
   125 Knowing which patterns should be avoided
   135 Knowing which patterns should be avoided
   126 can be helpful.
   136 can be helpful.
   127 \subsection{Static Analysis of Evil Regex Patterns} 
   137 Tools for this
   128 or they try to avoid the possibly problematic patterns completely,
   138 often over-approximate evil regular expression patterns,
   129 for which many false positives exist\parencite{Davis18}.
   139 and there can be many false positives \cite{Davis18}.
   130 Animated tools to "debug" regular expressions such as
   140 Animated tools to "debug" regular expressions such as
   131  \parencite{regexploit2021} \parencite{regex101} are also popular.
   141  \parencite{regexploit2021} \parencite{regex101} are also popular.
   132 We are also aware of static analysis work on regular expressions that
   142 We are also aware of static analysis work on regular expressions that
   133 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
   143 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
   134 \parencite{Rathnayake2014StaticAF} proposed an algorithm
   144 \parencite{Rathnayake2014StaticAF} proposed an algorithm
   135 that detects regular expressions triggering exponential
   145 that detects regular expressions triggering exponential
   136 behavious on backtracking matchers.
   146 behavious with backtracking matchers.
   137 Weideman \parencite{Weideman2017Static} came up with 
   147 Weideman \parencite{Weideman2017Static} came up with 
   138 non-linear polynomial worst-time estimates
   148 non-linear polynomial worst-time estimates
   139 for regexes, attack string that exploit the worst-time 
   149 for regexes and ``attack string'' that exploit the worst-time 
   140 scenario, and "attack automata" that generates
   150 scenario, and introduced "attack automaton" that generates
   141 attack strings.
   151 attack strings.
   142 
   152 
   143 
   153 
   144 
   154 
   145 
   155 
   146 \section{Optimisations}
   156 \section{Optimisations}
   147 Perhaps the biggest problem that prevents derivative-based lexing 
   157 Perhaps the biggest problem that prevents derivative-based lexing 
   148 from being more widely adopted
   158 from being more widely adopted
   149 is that they tend to be not blindingly fast in practice, unable to 
   159 is that they tend to be not very fast in practice, unable to 
   150 reach throughputs like gigabytes per second, which is the
   160 reach throughputs like gigabytes per second, which is the
   151 application we had in mind when we initially started looking at the topic.
   161 application we have in mind when we started looking at the topic.
   152 Commercial
   162 Commercial
   153 regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro}
   163 regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro}
   154 are capable of inspecting payloads
   164 are capable of inspecting payloads
   155 at line rates (which can be up to a few gigabits per second) against 
   165 at line rates (which can be up to a few gigabits per second) against 
   156 thousands of rules \cite{communityRules}.
   166 thousands of regex rules \cite{communityRules}.
   157 For our algorithm to be more attractive for practical use, we
   167 For our algorithm to be more attractive for practical use, we
   158 need more correctness-preserving optimisations.
   168 need more correctness-preserving optimisations.
   159 
   169 
   160 FPGA-based implementations such as \cite{Sidhu2001} 
   170 FPGA-based implementations such as \cite{Sidhu2001} 
   161 have the advantages of being
   171 have the advantages of being
   162 reconfigurable and parallel, but suffer from lower clock frequency
   172 reconfigurable and parallel, but suffer from lower clock frequency
   163 and scalability.
   173 and scalability.
   164 Traditional automaton approaches that use a DFA instead of NFA
   174 Traditional automaton approaches that use DFAs instead of NFAs
   165 benefit from the fact that only a single transition is needed 
   175 benefit from the fact that only a single transition is needed 
   166 for each input character \cite{Becchi08}. Lower memory bandwidth leads
   176 for each input character \cite{Becchi08}. Lower memory bandwidth leads
   167 to faster performance.
   177 to faster performance.
   168 However, they suffer from exponential blow-ups on bounded repetitions.
   178 However, they suffer from exponential memory requirements on bounded repetitions.
   169 Compression techniques are used, such as those in \cite{Kumar2006} and
   179 Compression techniques are used, such as those in \cite{Kumar2006} and
   170 \cite{Becchi2007}.
   180 \cite{Becchi2007}.
   171 Variations of pure NFAs or DFAs like counting-set automata \cite{Turonova2020}
   181 Variations of pure NFAs or DFAs like counting-set automata \cite{Turonova2020}
   172 have been
   182 have been
   173 proposed to better deal with bounded repetitions.
   183 proposed to better deal with bounded repetitions.
       
   184 But they are usually not used in formalised proofs.
   174 
   185 
   175 %So far our focus has been mainly on the bit-coded algorithm,
   186 %So far our focus has been mainly on the bit-coded algorithm,
   176 %but the injection-based lexing algorithm 
   187 %but the injection-based lexing algorithm 
   177 %could also be sped up in various ways:
   188 %could also be sped up in various ways:
   178 %
   189 %
   182 character-by-character derivatives. For example, instead of
   193 character-by-character derivatives. For example, instead of
   183 replacing $(r_1 + r_2)\backslash (c::cs)$ by
   194 replacing $(r_1 + r_2)\backslash (c::cs)$ by
   184 $((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather
   195 $((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather
   185 calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
   196 calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
   186 This has the potential to speed up matching because input is 
   197 This has the potential to speed up matching because input is 
   187 processed in larger granularity.
   198 processed at a larger granularity.
   188 One interesting thing is to explore whether this can be done
   199 One interesting point is to explore whether this can be done
   189 to $\inj$ as well, so that we can generate a lexical value 
   200 to $\inj$ as well, so that we can generate a lexical value 
   190 rather than simply get a matcher.
   201 rather than simply get a matcher.
       
   202 It is also not yet clear how this idea can be extended to sequences and stars.
   191 
   203 
   192 
   204 
   193 \subsection{Derivatives and Zippers}
   205 \subsection{Derivatives and Zippers}
   194 Zipper is a data structure designed to focus on 
   206 Zippers are a data structure designed to focus on 
   195 and navigate between local parts of a tree.
   207 and navigate between local parts of a tree.
   196 The idea is that often operations on a large tree only deal with
   208 The idea is that often operations on large trees only deal with
   197 local regions each time.
   209 local regions each time.
   198 Therefore it would be a waste to 
   210 Therefore it would be a waste to 
   199 traverse the entire tree if
   211 traverse the entire tree if
   200 the operation only
   212 the operation only
   201 involves a small fraction of it.
   213 involves a small fraction of it.
   202 It was first formally described by Huet \cite{Huet1997}.
   214 Zippers were first formally described by Huet \cite{Huet1997}.
   203 Typical applications of zippers involve text editor buffers
   215 Typical applications of zippers involve text editor buffers
   204 and proof system databases.
   216 and proof system databases.
   205 In our setting, the idea is to compactify the representation
   217 In our setting, the idea is to compactify the representation
   206 of derivatives with zippers, thereby making our algorithm faster.
   218 of derivatives with zippers, thereby making our algorithm faster.
   207 We introduce several works on parsing, derivatives
   219 Below we describe several works on parsing, derivatives
   208 and zippers.
   220 and zippers.
   209 
   221 
   210 Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
   222 Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
   211 \cite{Zippy2020}.
   223 \cite{Zippy2020}.
   212 They adopted zippers to improve the speed, and argued that the runtime
   224 They adopted zippers to improve the speed, and argued that the runtime
   213 complexity of their algorithm was linear with respect to the input string.
   225 complexity of their algorithm was linear with respect to the input string.
       
   226 They did not provide a formal proof for this.
   214 
   227 
   215 The idea of using Brzozowski derivatives on general context-free 
   228 The idea of using Brzozowski derivatives on general context-free 
   216 parsing was first implemented
   229 parsing was first implemented
   217 by Might et al. \cite{Might2011}. 
   230 by Might et al. \cite{Might2011}. 
   218 They used memoization and fixpoint construction to eliminate infinite recursion,
   231 They used memoization and fixpoint constructions to eliminate infinite recursion,
   219 which is a major problem for using derivatives on context-free grammars.
   232 which is a major problem for using derivatives with context-free grammars.
   220 The initial version was quite slow----exponential on the size of the input.
   233 Their initial version was quite slow----exponential on the size of the input.
   221 Adams et al. \cite{Adams2016} improved the speed and argued that their version
   234 Adams et al. \cite{Adams2016} improved the speed and argued that their version
   222 was cubic with respect to the input.
   235 was cubic with respect to the input.
   223 Darragh and Adams \cite{Darragh2020} further improved the performance
   236 Darragh and Adams \cite{Darragh2020} further improved the performance
   224 by using zippers in an innovative way--their zippers had multiple focuses
   237 by using zippers in an innovative way--their zippers had multiple focuses
   225 instead of just one in a traditional zipper to handle ambiguity.
   238 instead of just one in a traditional zipper to handle ambiguity.
   226 Their algorithm was not formalised, though.
   239 Their algorithm was not formalised though.
   227 
   240 
   228 
   241 
   229 
   242 
   230 
   243 
   231 \subsection{Back-References}
   244 \subsection{Back-References}
   233 in chapter \ref{Introduction}.
   246 in chapter \ref{Introduction}.
   234 We adopt the common practice of calling them rewbrs (Regular Expressions
   247 We adopt the common practice of calling them rewbrs (Regular Expressions
   235 With Back References) for brevity.
   248 With Back References) for brevity.
   236 It has been shown by Aho \cite{Aho1990}
   249 It has been shown by Aho \cite{Aho1990}
   237 that the k-vertex cover problem can be reduced
   250 that the k-vertex cover problem can be reduced
   238 to the problem of rewbrs matching, and is therefore NP-complete.
   251 to the problem of rewbrs matching, and therefore matching with rewbrs is NP-complete.
   239 Given the depth of the problem, the progress made at the full generality
   252 Given the depth of the problem, the progress made at the full generality
   240 of arbitrary rewbrs matching has been slow, with
   253 of arbitrary rewbrs matching has been slow with
   241 theoretical work on them being 
   254 theoretical work on them being 
   242 fairly recent. 
   255 fairly recent. 
   243 
   256 
   244 Campaneu et al. studied regexes with back-references
   257 Campaneu et al. studied rewbrs 
   245 in the context of formal languages theory in 
   258 in the context of formal languages theory in 
   246 their 2003 work \cite{campeanu2003formal}.
   259 \cite{campeanu2003formal}.
   247 They devised a pumping lemma for Perl-like regexes, 
   260 They devised a pumping lemma for Perl-like regexes, 
   248 proving that the langugages denoted by them
   261 proving that the langugages denoted by them
   249 is  properly  contained in context-sensitive
   262 is  properly  contained in context-sensitive
   250 languages.
   263 languages.
   251 More interesting questions such as 
   264 More interesting questions such as 
   252 whether the language denoted by Perl-like regexes
   265 whether the languages denoted by Perl-like regexes
   253 can express palindromes ($\{w \mid w = w^R\}$)
   266 can express palindromes ($\{w \mid w = w^R\}$)
   254 are discussed in \cite{campeanu2009patterns} 
   267 are discussed in \cite{campeanu2009patterns} 
   255 and \cite{CAMPEANU2009Intersect}.
   268 and \cite{CAMPEANU2009Intersect}.
   256 Freydenberger \cite{Frey2013} investigated the 
   269 Freydenberger \cite{Frey2013} investigated the 
   257 expressive power of back-references. He showed several 
   270 expressive power of back-references. He showed several 
   258 undecidability and decriptional complexity results 
   271 undecidability and decriptional complexity results 
   259 of back-references, concluding that they add
   272 for back-references, concluding that they add
   260 great power to certain programming tasks but are difficult to harness.
   273 great power to certain programming tasks but are difficult to comprehend.
   261 An interesting question would then be
   274 An interesting question would then be
   262 whether we can add restrictions to them, 
   275 whether we can add restrictions to them, 
   263 so that they become expressive enough for practical use such
   276 so that they become expressive enough for practical use, such
   264 as html files, but not too powerful.
   277 as matching html files, but not too powerful.
   265 Freydenberger and Schmid \cite{FREYDENBERGER20191}
   278 Freydenberger and Schmid \cite{FREYDENBERGER20191}
   266 introduced the notion of deterministic
   279 introduced the notion of deterministic
   267 regular expressions with back-references to achieve
   280 regular expressions with back-references to achieve
   268 a better balance between expressiveness and tractability.
   281 a better balance between expressiveness and tractability.
   269 
   282 
   270 
   283 
   271 Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
   284 Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
   272 investigated the time complexity of different variants
   285 investigated the time complexity of different variants
   273 of back-references.
   286 of back-references.
   274 We are not aware of any work that uses derivatives on back-references.
   287 We are not aware of any work that uses derivatives with back-references.
   275 
   288 
   276 See \cite{BERGLUND2022} for a survey
   289 %See \cite{BERGLUND2022} for a survey
   277 of these works and comparison between different
   290 %of these works and comparison between different
   278 flavours  of back-references syntax (e.g. whether references can be circular, 
   291 %flavours  of back-references syntax (e.g. whether references can be circular, 
   279 can labels be repeated etc.).
   292 %can labels be repeated etc.).
   280 
   293 
   281 
   294 
   282  
   295  
   283 
   296 
   284 
   297