7 In this chapter, we introduce |
7 In this chapter, we introduce |
8 the work relevant to this thesis. |
8 the work relevant to this thesis. |
9 |
9 |
10 |
10 |
11 \section{Work on Back-References} |
11 \section{Work on Back-References} |
12 We introduced back-references |
12 We introduced regular expressions with back-references |
13 in chapter \ref{Introduction}. |
13 in chapter \ref{Introduction}. |
14 This is a quite deep problem, |
14 We adopt the common practice of calling them rewbrs (Regular Expressions |
15 with theoretical work on them being |
15 With Back References) in this section for brevity. |
16 fairly recent. |
16 It has been shown by Aho \cite{Aho1990} |
|
17 that the k-vertex cover problem can be reduced |
|
18 to the problem of rewbrs matching problem, and is therefore NP-complete. |
|
19 Given the depth of the problem, the |
|
20 theoretical work on them is |
|
21 fairly recent. |
17 |
22 |
18 Campaneu et al. studied regexes with back-references |
23 Campaneu et al. studied regexes with back-references |
19 in the context of formal languages theory in |
24 in the context of formal languages theory in |
20 their 2003 work \cite{campeanu2003formal}. |
25 their 2003 work \cite{campeanu2003formal}. |
21 They devised a pumping lemma for Perl-like regexes, |
26 They devised a pumping lemma for Perl-like regexes, |
32 undecidability and decriptional complexity results |
37 undecidability and decriptional complexity results |
33 of back-references, concluding that they add |
38 of back-references, concluding that they add |
34 great power to certain programming tasks but are difficult to harness. |
39 great power to certain programming tasks but are difficult to harness. |
35 An interesting question would then be |
40 An interesting question would then be |
36 whether we can add restrictions to them, |
41 whether we can add restrictions to them, |
37 so that they become expressive enough, but not too powerful. |
42 so that they become expressive enough for practical use such |
|
43 as html files, but not too powerful. |
38 Freydenberger and Schmid \cite{FREYDENBERGER20191} |
44 Freydenberger and Schmid \cite{FREYDENBERGER20191} |
39 introduced the notion of deterministic |
45 introduced the notion of deterministic |
40 regular expressions with back-references to achieve |
46 regular expressions with back-references to achieve |
41 a better balance between expressiveness and tractability. |
47 a better balance between expressiveness and tractability. |
42 |
48 |
43 |
49 |
44 Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012} |
50 Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012} |
45 investigated the time complexity of different variants |
51 investigated the time complexity of different variants |
46 of back-references. |
52 of back-references. |
|
53 We are not aware of any work that uses derivatives on back-references. |
47 |
54 |
48 See \cite{BERGLUND2022} for a survey |
55 See \cite{BERGLUND2022} for a survey |
49 of these works and comparison between different |
56 of these works and comparison between different |
50 flavours of back-references syntax (e.g. whether references can be circular, |
57 flavours of back-references syntax (e.g. whether references can be circular, |
51 can labels be repeated etc.). |
58 can labels be repeated etc.). |
58 of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one |
65 of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one |
59 in Coq is given by Coquand and Siles \parencite{Coquand2012}. |
66 in Coq is given by Coquand and Siles \parencite{Coquand2012}. |
60 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. |
67 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. |
61 |
68 |
62 \subsection{Static Analysis of Evil Regex Patterns} |
69 \subsection{Static Analysis of Evil Regex Patterns} |
63 When a regular expression does not behave as intended, |
70 When a regular expression does not behave as intended, |
64 people usually try to rewrite the regex to some equivalent form |
71 people usually try to rewrite the regex to some equivalent form |
65 or they try to avoid the possibly problematic patterns completely, |
72 or they try to avoid the possibly problematic patterns completely, |
66 for which many false positives exist\parencite{Davis18}. |
73 for which many false positives exist\parencite{Davis18}. |
67 Animated tools to "debug" regular expressions such as |
74 Animated tools to "debug" regular expressions such as |
68 \parencite{regexploit2021} \parencite{regex101} are also popular. |
75 \parencite{regexploit2021} \parencite{regex101} are also popular. |
101 does not come with certain basic complexity related |
108 does not come with certain basic complexity related |
102 guarantees (for examaple the internal data structure size |
109 guarantees (for examaple the internal data structure size |
103 does not grow indefinitely), |
110 does not grow indefinitely), |
104 then they cannot claim with confidence having solved the problem |
111 then they cannot claim with confidence having solved the problem |
105 of catastrophic backtracking. |
112 of catastrophic backtracking. |
|
113 |
|
114 |
|
115 \section{Derivatives and Zippers} |
|
116 Zipper is a data structure designed to focus on |
|
117 and navigate between local parts of a tree. |
|
118 The idea is that often operations on a large tree only deal with |
|
119 local regions each time. |
|
120 It was first formally described by Huet \cite{HuetZipper}. |
|
121 Typical applications of zippers involve text editor buffers |
|
122 and proof system databases. |
|
123 In our setting, the idea is to compactify the representation |
|
124 of derivatives with zippers, thereby making our algorithm faster. |
|
125 We draw our inspirations from several works on parsing, derivatives |
|
126 and zippers. |
|
127 |
|
128 Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives |
|
129 \cite{Zippy2020}. |
|
130 They adopted zippers to improve the speed, and argued that the runtime |
|
131 complexity of their algorithm was linear with respect to the input string. |
|
132 |
|
133 The idea of using Brzozowski derivatives on general context-free |
|
134 parsing was first implemented |
|
135 by Might et al. \ref{Might2011}. |
|
136 They used memoization and fixpoint construction to eliminate infinite recursion, |
|
137 which is a major problem for using derivatives on context-free grammars. |
|
138 The initial version was quite slow----exponential on the size of the input. |
|
139 Adams et al. \ref{Adams2016} improved the speed and argued that their version |
|
140 was cubic with respect to the input. |
|
141 Darragh and Adams \cite{10.1145/3408990} further improved the performance |
|
142 by using zippers in an innovative way--their zippers had multiple focuses |
|
143 instead of just one in a traditional zipper to handle ambiguity. |
|
144 Their algorithm was not formalised, though. |
|
145 |
|
146 We aim to integrate the zipper data structure into our lexer. |
|
147 The idea is very simple: using a zipper with multiple focuses |
|
148 just like Darragh did, we could represent |
|
149 \[ |
|
150 x\cdot r + y \cdot r + \ldots |
|
151 \] |
|
152 as |
|
153 \[ |
|
154 (x+y + \ldots) \cdot r. |
|
155 \] |
|
156 This would greatly reduce the amount of space needed |
|
157 when we have many terms like $x\cdot r$. |
|
158 |
|
159 into the current lexer. |
|
160 This allows the lexer to not traverse the entire |
|
161 derivative tree generated |
|
162 in every intermediate step, but only a small segment |
|
163 that is currently active. |
|
164 |
|
165 Some initial results have been obtained, but more care needs to be taken to make sure |
|
166 that the lexer output conforms to the POSIX standard. Formalising correctness of |
|
167 such a lexer using zippers will probably require using an imperative |
|
168 framework with separation logic as it involves |
|
169 mutable data structures, which is also interesting to look into. |
|
170 |
|
171 To further optimise the algorithm, |
|
172 I plan to add a ``deduplicate'' function that tells |
|
173 whether two regular expressions denote the same |
|
174 language using |
|
175 an efficient and verified equivalence checker. |
|
176 In this way, the internal data structures can be pruned more aggressively, |
|
177 leading to better simplifications and ultimately faster algorithms. |
|
178 |
|
179 |
|
180 Some initial results |
|
181 We first give a brief introduction to what zippers are, |
|
182 and other works |
|
183 that apply zippers to derivatives |
|
184 When dealing with large trees, it would be a waste to |
|
185 traverse the entire tree if |
|
186 the operation only |
|
187 involves a small fraction of it. |
|
188 The idea is to put the focus on that subtree, turning other parts |
|
189 of the tree into a context |
|
190 |
|
191 |
|
192 One observation about our derivative-based lexing algorithm is that |
|
193 the derivative operation sometimes traverses the entire regular expression |
|
194 unnecessarily: |
|
195 |