205 |
205 |
206 |
206 |
207 |
207 |
208 |
208 |
209 |
209 |
210 |
210 Regular expressions |
211 Regular expressions are widely used in computer science: |
211 have been extensively studied and |
212 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; |
212 implemented since their invention in 1940s. |
213 command-line tools like $\mathit{grep}$ that facilitate easy |
213 It is primarily used in lexing, where an unstructured input string is broken |
214 text-processing \cite{grep}; network intrusion |
214 down into a tree of tokens, where that tree's construction is guided by the shape of the regular expression. |
215 detection systems that inspect suspicious traffic; or compiler |
215 Being able to delimit different subsections of a string and potentially |
216 front ends. |
216 extract information from inputs, regular expression |
217 Given their usefulness and ubiquity, one would assume that |
217 matchers and lexers serve as an indispensible component in modern software systems' text processing tasks |
218 modern regular expression matching implementations |
218 such as compilers, IDEs, and firewalls. |
219 are mature and fully studied. |
219 Research on them is far from over due to the well-known issue called catastrophic-backtracking. |
220 Indeed, in many popular programming languages' regular expression engines, |
220 This stems from the ambiguities of lexing: when matching a multiple-character string with a regular |
221 one can supply a regular expression and a string, and in most cases get |
221 exression that includes serveral sub-expressions, there might be different positions to set |
222 get the matching information in a very short time. |
222 the border between sub-expressions' corresponding sub-strings. |
223 Those engines can sometimes be blindingly fast--some |
223 For example, matching the string $aaa$ against the regular expression |
224 network intrusion detection systems |
224 $(a+aa)^*$, the border between the initial match and the second iteration |
225 use regular expression engines that are able to process |
225 could be between the first and second character ($a | aa$) |
226 hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
226 or between the second and third character ($aa | a$). |
227 However, those engines can sometimes also exhibit a surprising security vulnerability |
227 As the size of the input string and the structural complexity |
228 under a certain class of inputs. |
228 of the regular expression increase, |
|
229 the number of different combinations of setting delimiters can grow exponentially, and |
|
230 algorithms that explore these possibilities unwisely will also see an exponential complexity. |
|
231 |
|
232 Catastrophic backtracking allow a class of computationally inexpensive attacks called |
|
233 Regular expression Denial of Service attacks (ReDoS). |
|
234 These attacks, be it deliberate or not, have caused real-world systems to go down (see more |
|
235 details of this in the practical overview section in chapter \ref{Introduction}). |
|
236 There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX. |
|
237 The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible. |
|
238 There have been prose definitions like the following |
|
239 by Kuklewicz \cite{KuklewiczHaskell}: |
|
240 described the POSIX rule as (section 1, last paragraph): |
|
241 \begin{quote} |
|
242 \begin{itemize} |
|
243 \item |
|
244 regular expressions (REs) take the leftmost starting match, and the longest match starting there |
|
245 earlier subpatterns have leftmost-longest priority over later subpatterns\\ |
|
246 \item |
|
247 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\ |
|
248 \item |
|
249 REs have right associative concatenation which can be changed with parenthesis\\ |
|
250 \item |
|
251 parenthesized subexpressions return the match from their last usage\\ |
|
252 \item |
|
253 text of component subexpressions must be contained in the text of the |
|
254 higher-level subexpressions\\ |
|
255 \item |
|
256 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
|
257 \item |
|
258 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ |
|
259 \end{itemize} |
|
260 \end{quote} |
|
261 But the author noted that lexers are rarely correct according to this standard. |
|
262 |
|
263 Formal proofs, such as the one written in Isabelle/HOL, is a powerful means |
|
264 for computer scientists to be certain about correctness of their algorithms and correctness properties. |
|
265 This is done by automatically checking the end goal is derived solely from a list of axioms, definitions |
|
266 and proof rules that are known to be correct. |
|
267 The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such |
|
268 methods as their implementation and complexity analysis tend to be error-prone. |
|
269 |
|
270 |
|
271 |
|
272 |
|
273 |
|
274 |
|
275 %Regular expressions are widely used in computer science: |
|
276 %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; |
|
277 %command-line tools like $\mathit{grep}$ that facilitate easy |
|
278 %text-processing \cite{grep}; network intrusion |
|
279 %detection systems that inspect suspicious traffic; or compiler |
|
280 %front ends. |
|
281 %Given their usefulness and ubiquity, one would assume that |
|
282 %modern regular expression matching implementations |
|
283 %are mature and fully studied. |
|
284 %Indeed, in many popular programming languages' regular expression engines, |
|
285 %one can supply a regular expression and a string, and in most cases get |
|
286 %get the matching information in a very short time. |
|
287 %Those engines can sometimes be blindingly fast--some |
|
288 %network intrusion detection systems |
|
289 %use regular expression engines that are able to process |
|
290 %hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
|
291 %However, those engines can sometimes also exhibit a surprising security vulnerability |
|
292 %under a certain class of inputs. |
229 %However, , this is not the case for $\mathbf{all}$ inputs. |
293 %However, , this is not the case for $\mathbf{all}$ inputs. |
230 %TODO: get source for SNORT/BRO's regex matching engine/speed |
294 %TODO: get source for SNORT/BRO's regex matching engine/speed |
231 |
295 |
232 |
296 |
233 Consider for example the regular expression $(a^*)^*\,b$ and |
297 Consider for example the regular expression $(a^*)^*\,b$ and |
1260 %as a DFA. |
1331 %as a DFA. |
1261 |
1332 |
1262 |
1333 |
1263 %---------------------------------------------------------------------------------------- |
1334 %---------------------------------------------------------------------------------------- |
1264 \section{Contribution} |
1335 \section{Contribution} |
|
1336 {\color{red} \rule{\linewidth}{0.5mm}} |
|
1337 \textbf{issue with this part: way too short, not enough details of what I have done.} |
|
1338 {\color{red} \rule{\linewidth}{0.5mm}} |
1265 |
1339 |
1266 In this thesis, |
1340 In this thesis, |
1267 we propose a solution to catastrophic |
1341 we propose a solution to catastrophic |
1268 backtracking and error-prone matchers: a formally verified |
1342 backtracking and error-prone matchers: a formally verified |
1269 regular expression lexing algorithm |
1343 regular expression lexing algorithm |
1270 that is both fast |
1344 that is both fast |
1271 and correct. |
1345 and correct. |
|
1346 {\color{red} \rule{\linewidth}{0.5mm}} |
|
1347 \HandRight Added content: |
|
1348 %Package \verb`pifont`: \ding{43} |
|
1349 %Package \verb`utfsym`: \usym{261E} |
|
1350 %Package \verb`dingbat`: \leftpointright |
|
1351 %Package \verb`dingbat`: \rightpointright |
|
1352 |
|
1353 In particular, the main problem we solved on top of previous work was |
|
1354 coming up with a formally verified algorithm called $\blexersimp$ based |
|
1355 on Brzozowski derivatives. It calculates a POSIX |
|
1356 lexical value from a string and a regular expression. This algorithm was originally |
|
1357 by Sulzmann and Lu \cite{Sulzmann2014}, but we made the key observation that its $\textit{nub}$ |
|
1358 function does not really simplify intermediate results where it needs to and improved the |
|
1359 algorithm accordingly. |
|
1360 We have proven our $\blexersimp$'s internal data structure does not grow beyond a constant $N_r$ |
|
1361 depending on the input regular expression $r$, thanks to the aggressive simplifications of $\blexersimp$: |
|
1362 \begin{theorem} |
|
1363 $|\blexersimp \; r \; s | \leq N_r$ |
|
1364 \end{theorem} |
|
1365 The simplifications applied in each step of $\blexersimp$ |
|
1366 |
|
1367 \begin{center} |
|
1368 $\blexersimp |
|
1369 $ |
|
1370 \end{center} |
|
1371 keeps the derivatives small, but presents a |
|
1372 challenge |
|
1373 |
|
1374 |
|
1375 establishing a correctness theorem of the below form: |
|
1376 %TODO: change this to "theorem to prove" |
|
1377 \begin{theorem} |
|
1378 If the POSIX lexical value of matching regular expression $r$ with string $s$ is $v$, |
|
1379 then $\blexersimp\; r \; s = \Some \; v$. Otherwise |
|
1380 $\blexersimp \; r \; s = \None$. |
|
1381 \end{theorem} |
|
1382 |
|
1383 |
|
1384 |
|
1385 |
1272 The result is %a regular expression lexing algorithm that comes with |
1386 The result is %a regular expression lexing algorithm that comes with |
1273 \begin{itemize} |
1387 \begin{itemize} |
1274 \item |
1388 \item |
1275 an improved version of Sulzmann and Lu's bit-coded algorithm using |
1389 an improved version of Sulzmann and Lu's bit-coded algorithm using |
1276 derivatives with simplifications, |
1390 derivatives with simplifications, |