ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 646 56057198e4f5
parent 638 dd9dde2d902b
child 648 d15a0b7d6d90
equal deleted inserted replaced
640:bd1354127574 646:56057198e4f5
   205 
   205 
   206 
   206 
   207 
   207 
   208 
   208 
   209 
   209 
   210 
   210 Regular expressions 
   211 Regular expressions are widely used in computer science: 
   211 have been extensively studied and
   212 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
   212 implemented since their invention in 1940s.
   213 command-line tools like $\mathit{grep}$ that facilitate easy 
   213 It is primarily used in lexing, where an unstructured input string is broken
   214 text-processing \cite{grep}; network intrusion
   214 down into a tree of tokens, where that tree's construction is guided by the shape of the regular expression.
   215 detection systems that inspect suspicious traffic; or compiler
   215 Being able to delimit different subsections of a string and potentially 
   216 front ends.
   216 extract information from inputs, regular expression 
   217 Given their usefulness and ubiquity, one would assume that
   217 matchers and lexers serve as an indispensible component in modern software systems' text processing tasks
   218 modern regular expression matching implementations
   218 such as compilers, IDEs, and firewalls.
   219 are mature and fully studied.
   219 Research on them is far from over due to the well-known issue called catastrophic-backtracking.
   220 Indeed, in many popular programming languages' regular expression engines, 
   220 This stems from the ambiguities of lexing: when matching a multiple-character string with a regular 
   221 one can supply a regular expression and a string, and in most cases get
   221 exression that includes serveral sub-expressions, there might be different positions to set 
   222 get the matching information in a very short time.
   222 the border between sub-expressions' corresponding sub-strings.
   223 Those engines can sometimes be blindingly fast--some 
   223 For example, matching the string $aaa$ against the regular expression
   224 network intrusion detection systems
   224 $(a+aa)^*$, the border between the initial match and the second iteration
   225 use regular expression engines that are able to process 
   225 could be between the first and second character ($a | aa$) 
   226 hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
   226 or between the second and third character ($aa | a$).
   227 However, those engines can sometimes also exhibit a surprising security vulnerability
   227 As the size of the input string and the structural complexity 
   228 under a certain class of inputs.
   228 of the regular expression increase,
       
   229 the number of different combinations of setting delimiters can grow exponentially, and
       
   230 algorithms that explore these possibilities unwisely will also see an exponential complexity.
       
   231 
       
   232 Catastrophic backtracking allow a class of computationally inexpensive attacks called
       
   233 Regular expression Denial of Service attacks (ReDoS). 
       
   234 These attacks, be it deliberate or not, have caused real-world systems to go down (see more 
       
   235 details of this in the practical overview section in chapter \ref{Introduction}).
       
   236 There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
       
   237 The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
       
   238 There have been prose definitions like the following
       
   239 by Kuklewicz \cite{KuklewiczHaskell}: 
       
   240 described the POSIX rule as (section 1, last paragraph):
       
   241 \begin{quote}
       
   242 	\begin{itemize}
       
   243 		\item
       
   244 regular expressions (REs) take the leftmost starting match, and the longest match starting there
       
   245 earlier subpatterns have leftmost-longest priority over later subpatterns\\
       
   246 \item
       
   247 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
       
   248 \item
       
   249 REs have right associative concatenation which can be changed with parenthesis\\
       
   250 \item
       
   251 parenthesized subexpressions return the match from their last usage\\
       
   252 \item
       
   253 text of component subexpressions must be contained in the text of the 
       
   254 higher-level subexpressions\\
       
   255 \item
       
   256 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
       
   257 \item
       
   258 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
       
   259 \end{itemize}
       
   260 \end{quote}
       
   261 But the author noted that lexers are rarely correct according to this standard.
       
   262 
       
   263 Formal proofs, such as the one written in Isabelle/HOL, is a powerful means 
       
   264 for computer scientists to be certain about correctness of their algorithms and correctness properties.
       
   265 This is done by automatically checking the end goal is derived solely from a list of axioms, definitions
       
   266 and proof rules that are known to be correct.
       
   267 The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
       
   268 methods as their implementation and complexity analysis tend to be error-prone.
       
   269 
       
   270 
       
   271 
       
   272 
       
   273 
       
   274 
       
   275 %Regular expressions are widely used in computer science: 
       
   276 %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
       
   277 %command-line tools like $\mathit{grep}$ that facilitate easy 
       
   278 %text-processing \cite{grep}; network intrusion
       
   279 %detection systems that inspect suspicious traffic; or compiler
       
   280 %front ends.
       
   281 %Given their usefulness and ubiquity, one would assume that
       
   282 %modern regular expression matching implementations
       
   283 %are mature and fully studied.
       
   284 %Indeed, in many popular programming languages' regular expression engines, 
       
   285 %one can supply a regular expression and a string, and in most cases get
       
   286 %get the matching information in a very short time.
       
   287 %Those engines can sometimes be blindingly fast--some 
       
   288 %network intrusion detection systems
       
   289 %use regular expression engines that are able to process 
       
   290 %hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
       
   291 %However, those engines can sometimes also exhibit a surprising security vulnerability
       
   292 %under a certain class of inputs.
   229 %However, , this is not the case for $\mathbf{all}$ inputs.
   293 %However, , this is not the case for $\mathbf{all}$ inputs.
   230 %TODO: get source for SNORT/BRO's regex matching engine/speed
   294 %TODO: get source for SNORT/BRO's regex matching engine/speed
   231 
   295 
   232 
   296 
   233 Consider for example the regular expression $(a^*)^*\,b$ and 
   297 Consider for example the regular expression $(a^*)^*\,b$ and 
   970 	``The POSIX strategy is more complicated than the 
  1034 	``The POSIX strategy is more complicated than the 
   971 	greedy because of the dependence on information about 
  1035 	greedy because of the dependence on information about 
   972 	the length of matched strings in the various subexpressions.''
  1036 	the length of matched strings in the various subexpressions.''
   973 \end{quote}
  1037 \end{quote}
   974 %\noindent
  1038 %\noindent
   975 We think the implementation complexity of POSIX rules also come from
  1039 People have recognised that the implementation complexity of POSIX rules also come from
   976 the specification being not very precise.
  1040 the specification being not very precise.
   977 The developers of the regexp package of Go 
  1041 The developers of the regexp package of Go 
   978 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
  1042 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
   979 commented that
  1043 commented that
   980 \begin{quote}\it
  1044 \begin{quote}\it
  1008 \end{itemize}
  1072 \end{itemize}
  1009 \end{quote}
  1073 \end{quote}
  1010 %The text above 
  1074 %The text above 
  1011 %is trying to capture something very precise,
  1075 %is trying to capture something very precise,
  1012 %and is crying out for formalising.
  1076 %and is crying out for formalising.
       
  1077 Ribeiro and Du Bois \cite{RibeiroAgda2017} have 
       
  1078 formalised the notion of bit-coded regular expressions
       
  1079 and proved their relations with simple regular expressions in
       
  1080 the dependently-typed proof assistant Agda.
       
  1081 They also proved the soundness and completeness of a matching algorithm
       
  1082 based on the bit-coded regular expressions.
  1013 Ausaf et al. \cite{AusafDyckhoffUrban2016}
  1083 Ausaf et al. \cite{AusafDyckhoffUrban2016}
  1014 are the first to
  1084 are the first to
  1015 give a quite simple formalised POSIX
  1085 give a quite simple formalised POSIX
  1016 specification in Isabelle/HOL, and also prove 
  1086 specification in Isabelle/HOL, and also prove 
  1017 that their specification coincides with an earlier (unformalised) 
  1087 that their specification coincides with an earlier (unformalised) 
  1022 They also found that the informal POSIX
  1092 They also found that the informal POSIX
  1023 specification by Sulzmann and Lu needs to be substantially 
  1093 specification by Sulzmann and Lu needs to be substantially 
  1024 revised in order for the correctness proof to go through.
  1094 revised in order for the correctness proof to go through.
  1025 Their original specification and proof were unfixable
  1095 Their original specification and proof were unfixable
  1026 according to Ausaf et al.
  1096 according to Ausaf et al.
       
  1097 
  1027 
  1098 
  1028 In the next section, we will briefly
  1099 In the next section, we will briefly
  1029 introduce Brzozowski derivatives and Sulzmann
  1100 introduce Brzozowski derivatives and Sulzmann
  1030 and Lu's algorithm, which the main point of this thesis builds on.
  1101 and Lu's algorithm, which the main point of this thesis builds on.
  1031 %We give a taste of what they 
  1102 %We give a taste of what they 
  1260 %as a DFA.
  1331 %as a DFA.
  1261 
  1332 
  1262 
  1333 
  1263 %----------------------------------------------------------------------------------------
  1334 %----------------------------------------------------------------------------------------
  1264 \section{Contribution}
  1335 \section{Contribution}
       
  1336 {\color{red} \rule{\linewidth}{0.5mm}}
       
  1337 \textbf{issue with this part: way too short, not enough details of what I have done.}
       
  1338 {\color{red} \rule{\linewidth}{0.5mm}}
  1265 
  1339 
  1266 In this thesis,
  1340 In this thesis,
  1267 we propose a solution to catastrophic
  1341 we propose a solution to catastrophic
  1268 backtracking and error-prone matchers: a formally verified
  1342 backtracking and error-prone matchers: a formally verified
  1269 regular expression lexing algorithm
  1343 regular expression lexing algorithm
  1270 that is both fast
  1344 that is both fast
  1271 and correct. 
  1345 and correct. 
       
  1346 {\color{red} \rule{\linewidth}{0.5mm}}
       
  1347 \HandRight Added content:
       
  1348 %Package \verb`pifont`: \ding{43}
       
  1349 %Package \verb`utfsym`: \usym{261E} 
       
  1350 %Package \verb`dingbat`: \leftpointright 
       
  1351 %Package \verb`dingbat`: \rightpointright 
       
  1352 
       
  1353 In particular, the main problem we solved on top of previous work was
       
  1354 coming up with a formally verified algorithm called $\blexersimp$ based
       
  1355 on Brzozowski derivatives. It calculates a POSIX
       
  1356 lexical value from a string and a regular expression. This algorithm was originally
       
  1357 by Sulzmann and Lu \cite{Sulzmann2014}, but we made the key observation that its $\textit{nub}$
       
  1358 function does not really simplify intermediate results where it needs to and improved the
       
  1359 algorithm accordingly.
       
  1360 We have proven our $\blexersimp$'s internal data structure does not grow beyond a constant $N_r$
       
  1361 depending on the input regular expression $r$, thanks to the aggressive simplifications of $\blexersimp$:
       
  1362 \begin{theorem}
       
  1363 $|\blexersimp \; r \; s | \leq N_r$
       
  1364 \end{theorem}
       
  1365 The simplifications applied in each step of $\blexersimp$ 
       
  1366 
       
  1367 \begin{center}
       
  1368 	$\blexersimp
       
  1369 	$
       
  1370 \end{center}
       
  1371 keeps the derivatives small, but presents a 
       
  1372 challenge
       
  1373 
       
  1374 
       
  1375 establishing a correctness theorem of the below form:
       
  1376 %TODO: change this to "theorem to prove"
       
  1377 \begin{theorem}
       
  1378 	If the POSIX lexical value of matching regular expression $r$ with string $s$ is $v$, 
       
  1379 	then $\blexersimp\; r \; s = \Some \; v$. Otherwise 
       
  1380 	$\blexersimp \; r \; s = \None$.
       
  1381 \end{theorem}
       
  1382 
       
  1383 
       
  1384 
       
  1385 
  1272 The result is %a regular expression lexing algorithm that comes with 
  1386 The result is %a regular expression lexing algorithm that comes with 
  1273 \begin{itemize}
  1387 \begin{itemize}
  1274 \item
  1388 \item
  1275 an improved version of  Sulzmann and Lu's bit-coded algorithm using 
  1389 an improved version of  Sulzmann and Lu's bit-coded algorithm using 
  1276 derivatives with simplifications, 
  1390 derivatives with simplifications, 
  1285 and an extension to
  1399 and an extension to
  1286 the bounded repetition constructs with the correctness and finiteness property
  1400 the bounded repetition constructs with the correctness and finiteness property
  1287 maintained.
  1401 maintained.
  1288 \end{itemize}
  1402 \end{itemize}
  1289 \noindent
  1403 \noindent
       
  1404 {\color{red} \rule{\linewidth}{0.5mm}}
  1290 With a formal finiteness bound in place,
  1405 With a formal finiteness bound in place,
  1291 we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
  1406 we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
  1292 The Isabelle/HOL code for our formalisation can be 
  1407 The Isabelle/HOL code for our formalisation can be 
  1293 found at
  1408 found at
  1294 \begin{center}
  1409 \begin{center}