ChengsongTanPhdThesis/Chapters/RelatedWork.tex
changeset 625 b797c9a709d9
parent 624 8ffa28fce271
child 626 1c8525061545
equal deleted inserted replaced
624:8ffa28fce271 625:b797c9a709d9
     7 In this chapter, we introduce
     7 In this chapter, we introduce
     8 the work relevant to this thesis.
     8 the work relevant to this thesis.
     9 
     9 
    10 
    10 
    11 \section{Work on Back-References}
    11 \section{Work on Back-References}
    12 We introduced back-references
    12 We introduced regular expressions with back-references
    13 in chapter \ref{Introduction}.
    13 in chapter \ref{Introduction}.
    14 This is a quite deep problem,
    14 We adopt the common practice of calling them rewbrs (Regular Expressions
    15 with theoretical work on them being
    15 With Back References) in this section for brevity.
    16 fairly recent.
    16 It has been shown by Aho \cite{Aho1990}
       
    17 that the k-vertex cover problem can be reduced
       
    18 to the problem of rewbrs matching problem, and is therefore NP-complete.
       
    19 Given the depth of the problem, the
       
    20 theoretical work on them is 
       
    21 fairly recent. 
    17 
    22 
    18 Campaneu et al. studied regexes with back-references
    23 Campaneu et al. studied regexes with back-references
    19 in the context of formal languages theory in 
    24 in the context of formal languages theory in 
    20 their 2003 work \cite{campeanu2003formal}.
    25 their 2003 work \cite{campeanu2003formal}.
    21 They devised a pumping lemma for Perl-like regexes, 
    26 They devised a pumping lemma for Perl-like regexes, 
    32 undecidability and decriptional complexity results 
    37 undecidability and decriptional complexity results 
    33 of back-references, concluding that they add
    38 of back-references, concluding that they add
    34 great power to certain programming tasks but are difficult to harness.
    39 great power to certain programming tasks but are difficult to harness.
    35 An interesting question would then be
    40 An interesting question would then be
    36 whether we can add restrictions to them, 
    41 whether we can add restrictions to them, 
    37 so that they become expressive enough, but not too powerful.
    42 so that they become expressive enough for practical use such
       
    43 as html files, but not too powerful.
    38 Freydenberger and Schmid \cite{FREYDENBERGER20191}
    44 Freydenberger and Schmid \cite{FREYDENBERGER20191}
    39 introduced the notion of deterministic
    45 introduced the notion of deterministic
    40 regular expressions with back-references to achieve
    46 regular expressions with back-references to achieve
    41 a better balance between expressiveness and tractability.
    47 a better balance between expressiveness and tractability.
    42 
    48 
    43 
    49 
    44 Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
    50 Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
    45 investigated the time complexity of different variants
    51 investigated the time complexity of different variants
    46 of back-references.
    52 of back-references.
       
    53 We are not aware of any work that uses derivatives on back-references.
    47 
    54 
    48 See \cite{BERGLUND2022} for a survey
    55 See \cite{BERGLUND2022} for a survey
    49 of these works and comparison between different
    56 of these works and comparison between different
    50 flavours  of back-references syntax (e.g. whether references can be circular, 
    57 flavours  of back-references syntax (e.g. whether references can be circular, 
    51 can labels be repeated etc.).
    58 can labels be repeated etc.).
    58 of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
    65 of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
    59 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
    66 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
    60 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
    67 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
    61  
    68  
    62 \subsection{Static Analysis of Evil Regex Patterns} 
    69 \subsection{Static Analysis of Evil Regex Patterns} 
    63  When a regular expression does not behave as intended,
    70 When a regular expression does not behave as intended,
    64 people usually try to rewrite the regex to some equivalent form
    71 people usually try to rewrite the regex to some equivalent form
    65 or they try to avoid the possibly problematic patterns completely,
    72 or they try to avoid the possibly problematic patterns completely,
    66 for which many false positives exist\parencite{Davis18}.
    73 for which many false positives exist\parencite{Davis18}.
    67 Animated tools to "debug" regular expressions such as
    74 Animated tools to "debug" regular expressions such as
    68  \parencite{regexploit2021} \parencite{regex101} are also popular.
    75  \parencite{regexploit2021} \parencite{regex101} are also popular.
   101 does not come with certain basic complexity related 
   108 does not come with certain basic complexity related 
   102 guarantees (for examaple the internal data structure size
   109 guarantees (for examaple the internal data structure size
   103 does not grow indefinitely), 
   110 does not grow indefinitely), 
   104 then they cannot claim with confidence having solved the problem
   111 then they cannot claim with confidence having solved the problem
   105 of catastrophic backtracking.
   112 of catastrophic backtracking.
       
   113 
       
   114 
       
   115 \section{Derivatives and Zippers}
       
   116 Zipper is a data structure designed to focus on 
       
   117 and navigate between local parts of a tree.
       
   118 The idea is that often operations on a large tree only deal with
       
   119 local regions each time.
       
   120 It was first formally described by Huet \cite{HuetZipper}.
       
   121 Typical applications of zippers involve text editor buffers
       
   122 and proof system databases.
       
   123 In our setting, the idea is to compactify the representation
       
   124 of derivatives with zippers, thereby making our algorithm faster.
       
   125 We draw our inspirations from several works on parsing, derivatives
       
   126 and zippers.
       
   127 
       
   128 Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
       
   129 \cite{Zippy2020}.
       
   130 They adopted zippers to improve the speed, and argued that the runtime
       
   131 complexity of their algorithm was linear with respect to the input string.
       
   132 
       
   133 The idea of using Brzozowski derivatives on general context-free 
       
   134 parsing was first implemented
       
   135 by Might et al. \ref{Might2011}. 
       
   136 They used memoization and fixpoint construction to eliminate infinite recursion,
       
   137 which is a major problem for using derivatives on context-free grammars.
       
   138 The initial version was quite slow----exponential on the size of the input.
       
   139 Adams et al. \ref{Adams2016} improved the speed and argued that their version
       
   140 was cubic with respect to the input.
       
   141 Darragh and Adams \cite{10.1145/3408990} further improved the performance
       
   142 by using zippers in an innovative way--their zippers had multiple focuses
       
   143 instead of just one in a traditional zipper to handle ambiguity.
       
   144 Their algorithm was not formalised, though.
       
   145 
       
   146 We aim to  integrate the zipper data structure into our lexer.
       
   147 The idea is very simple: using a zipper with multiple focuses
       
   148 just like Darragh did, we could represent
       
   149 \[
       
   150 	x\cdot r + y \cdot r + \ldots
       
   151 \]
       
   152 as 
       
   153 \[
       
   154 	(x+y + \ldots) \cdot r.
       
   155 \]
       
   156 This would greatly reduce the amount of space needed
       
   157 when we have many terms like $x\cdot r$.
       
   158 
       
   159 into the current lexer. 
       
   160 This allows the lexer to not traverse the entire 
       
   161 derivative tree generated
       
   162 in every intermediate step, but only a small segment 
       
   163 that is currently active. 
       
   164 
       
   165 Some initial results have been obtained, but more care needs to be taken to make sure 
       
   166 that the lexer output conforms to the POSIX standard. Formalising correctness of 
       
   167 such a lexer using zippers will probably require using an imperative
       
   168 framework with separation logic as it involves
       
   169 mutable data structures, which is also interesting to look into.  
       
   170 
       
   171 To further optimise the algorithm, 
       
   172 I plan to add a ``deduplicate'' function that tells 
       
   173 whether two regular expressions denote the same 
       
   174 language using
       
   175 an efficient and verified equivalence checker.
       
   176 In this way, the internal data structures can be pruned more aggressively, 
       
   177 leading to better simplifications and ultimately faster algorithms.
       
   178 
       
   179 
       
   180 Some initial results 
       
   181 We first give a brief introduction to what zippers are,
       
   182 and other works
       
   183 that apply zippers to derivatives
       
   184 When dealing with large trees, it would be a waste to 
       
   185 traverse the entire tree if
       
   186 the operation only
       
   187 involves a small fraction of it.
       
   188 The idea is to put the focus on that subtree, turning other parts
       
   189 of the tree into a context
       
   190 
       
   191 
       
   192 One observation about our derivative-based lexing algorithm is that
       
   193 the derivative operation sometimes traverses the entire regular expression
       
   194 unnecessarily:
       
   195