ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 648 d15a0b7d6d90
parent 646 56057198e4f5
child 649 ef2b8abcbc55
equal deleted inserted replaced
647:70c10dc41606 648:d15a0b7d6d90
   202 %and a bit-coded lexing algorithm with proven correctness and time bounds.
   202 %and a bit-coded lexing algorithm with proven correctness and time bounds.
   203 
   203 
   204 %TODO: look up snort rules to use here--give readers idea of what regexes look like
   204 %TODO: look up snort rules to use here--give readers idea of what regexes look like
   205 
   205 
   206 
   206 
   207 
   207 Regular expressions, since their inception in the 1940s, 
   208 
   208 have been subject to extensive study and implementation. 
   209 
   209 Their primary application lies in lexing, 
   210 Regular expressions 
   210 a process whereby an unstructured input string is disassembled into a tree of tokens
   211 have been extensively studied and
   211 with their guidance.
   212 implemented since their invention in 1940s.
   212 This is often used to match strings that comprises of numerous fields, 
   213 It is primarily used in lexing, where an unstructured input string is broken
   213 where certain fields may recur or be omitted. 
   214 down into a tree of tokens, where that tree's construction is guided by the shape of the regular expression.
   214 For example, the regular expression for recognising email addresses is
   215 Being able to delimit different subsections of a string and potentially 
   215 \begin{center}
   216 extract information from inputs, regular expression 
   216 \verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|.
   217 matchers and lexers serve as an indispensible component in modern software systems' text processing tasks
   217 \end{center}
   218 such as compilers, IDEs, and firewalls.
   218 Using this, regular expression matchers and lexers are able to extract 
   219 Research on them is far from over due to the well-known issue called catastrophic-backtracking.
   219 for example the domain names by the use of \verb|[a-zA-Z0-9.-]+|. 
   220 This stems from the ambiguities of lexing: when matching a multiple-character string with a regular 
   220 Consequently, they are an indispensible component in text processing tools 
   221 exression that includes serveral sub-expressions, there might be different positions to set 
   221 of software systems such as compilers, IDEs, and firewalls.
   222 the border between sub-expressions' corresponding sub-strings.
   222 
   223 For example, matching the string $aaa$ against the regular expression
   223 The study of regular expressions is ongoing due to an
   224 $(a+aa)^*$, the border between the initial match and the second iteration
   224 issue known as catastrophic backtracking. 
   225 could be between the first and second character ($a | aa$) 
   225 This phenomenon refers to scenarios in which the regular expression 
   226 or between the second and third character ($aa | a$).
   226 matching or lexing engine exhibits a disproportionately long 
   227 As the size of the input string and the structural complexity 
   227 runtime despite the simplicity of the input and expression.
   228 of the regular expression increase,
   228 
   229 the number of different combinations of setting delimiters can grow exponentially, and
   229 The origin of catastrophic backtracking is rooted in the 
   230 algorithms that explore these possibilities unwisely will also see an exponential complexity.
   230 ambiguities of lexing. 
   231 
   231 In the process of matching a multi-character string with 
   232 Catastrophic backtracking allow a class of computationally inexpensive attacks called
   232 a regular expression that encompasses several sub-expressions, 
   233 Regular expression Denial of Service attacks (ReDoS). 
   233 different positions can be designated to mark 
   234 These attacks, be it deliberate or not, have caused real-world systems to go down (see more 
   234 the boundaries of corresponding substrings of the sub-expressions. 
   235 details of this in the practical overview section in chapter \ref{Introduction}).
   235 For instance, in matching the string $aaa$ with the 
   236 There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
   236 regular expression $(a+aa)^*$, the divide between 
   237 The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
   237 the initial match and the subsequent iteration could either be 
   238 There have been prose definitions like the following
   238 set between the first and second characters ($a | aa$) or between the second and third characters ($aa | a$). As both the length of the input string and the structural complexity of the regular expression increase, the number of potential delimiter combinations can grow exponentially, leading to a corresponding increase in complexity for algorithms that do not optimally navigate these possibilities.
   239 by Kuklewicz \cite{KuklewiczHaskell}: 
   239 
   240 described the POSIX rule as (section 1, last paragraph):
   240 Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}).
       
   241 
       
   242 Various disambiguation strategies are employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
       
   243 
       
   244 
       
   245 %Regular expressions 
       
   246 %have been extensively studied and
       
   247 %implemented since their invention in 1940s.
       
   248 %It is primarily used in lexing, where an unstructured input string is broken
       
   249 %down into a tree of tokens.
       
   250 %That tree's construction is guided by the shape of the regular expression.
       
   251 %This is particularly useful in expressing the shape of a string
       
   252 %with many fields, where certain fields might repeat or be omitted.
       
   253 %Regular expression matchers and Lexers allow us to 
       
   254 %identify and delimit different subsections of a string and potentially 
       
   255 %extract information from inputs, making them
       
   256 %an indispensible component in modern software systems' text processing tasks
       
   257 %such as compilers, IDEs, and firewalls.
       
   258 %Research on them is far from over due to the well-known issue called catastrophic-backtracking,
       
   259 %which means the regular expression matching or lexing engine takes an unproportional time to run 
       
   260 %despite the input and the expression being relatively simple.
       
   261 %
       
   262 %Catastrophic backtracking stems from the ambiguities of lexing: 
       
   263 %when matching a multiple-character string with a regular 
       
   264 %exression that includes serveral sub-expressions, there might be different positions to set 
       
   265 %the border between sub-expressions' corresponding sub-strings.
       
   266 %For example, matching the string $aaa$ against the regular expression
       
   267 %$(a+aa)^*$, the border between the initial match and the second iteration
       
   268 %could be between the first and second character ($a | aa$) 
       
   269 %or between the second and third character ($aa | a$).
       
   270 %As the size of the input string and the structural complexity 
       
   271 %of the regular expression increase,
       
   272 %the number of different combinations of delimiters can grow exponentially, and
       
   273 %algorithms that explore these possibilities unwisely will also see an exponential complexity.
       
   274 %
       
   275 %Catastrophic backtracking allow a class of computationally inexpensive attacks called
       
   276 %Regular expression Denial of Service attacks (ReDoS), in which the hacker 
       
   277 %simply sends out a small attack string to a server, 
       
   278 %triggering high-complexity behaviours in its regular expression engine. 
       
   279 %These attacks, be it deliberate or not, have caused real-world systems to go down (see more 
       
   280 %details of this in the practical overview section in chapter \ref{Introduction}).
       
   281 %There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
       
   282 %The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
       
   283 %There have been prose definitions like the following
       
   284 %by Kuklewicz \cite{KuklewiczHaskell}: 
       
   285 %described the POSIX rule as (section 1, last paragraph):
   241 \begin{quote}
   286 \begin{quote}
   242 	\begin{itemize}
   287 	\begin{itemize}
   243 		\item
   288 		\item
   244 regular expressions (REs) take the leftmost starting match, and the longest match starting there
   289 regular expressions (REs) take the leftmost starting match, and the longest match starting there
   245 earlier subpatterns have leftmost-longest priority over later subpatterns\\
   290 earlier subpatterns have leftmost-longest priority over later subpatterns\\
   256 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
   301 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
   257 \item
   302 \item
   258 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
   303 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
   259 \end{itemize}
   304 \end{itemize}
   260 \end{quote}
   305 \end{quote}
   261 But the author noted that lexers are rarely correct according to this standard.
   306 However, the author noted that lexers are rarely correct according to this standard.
   262 
   307 Being able to formally define and capture the idea of POSIX rules and matching/lexing algorithms and prove 
   263 Formal proofs, such as the one written in Isabelle/HOL, is a powerful means 
   308 the correctness of such algorithms against the POSIX semantics definitions
       
   309 would be valuable.
       
   310 
       
   311 Formal proofs are machine checked programs
       
   312 such as the ones written in Isabelle/HOL, is a powerful means 
   264 for computer scientists to be certain about correctness of their algorithms and correctness properties.
   313 for computer scientists to be certain about correctness of their algorithms and correctness properties.
   265 This is done by automatically checking the end goal is derived solely from a list of axioms, definitions
   314 This is done by automatically checking that the end goal is derived solely from a list of axioms, definitions
   266 and proof rules that are known to be correct.
   315 and proof rules that are known to be correct.
   267 The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
   316 The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
   268 methods as their implementation and complexity analysis tend to be error-prone.
   317 methods as their implementation and complexity analysis tend to be error-prone.
   269 
   318 
   270 
   319 There have been many interesting steps taken in the direction of formal proofs and regular expressions
   271 
   320 lexing and matching.
   272 
   321 
   273 
   322 
       
   323 
       
   324 (ends)
   274 
   325 
   275 %Regular expressions are widely used in computer science: 
   326 %Regular expressions are widely used in computer science: 
   276 %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
   327 %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
   277 %command-line tools like $\mathit{grep}$ that facilitate easy 
   328 %command-line tools like $\mathit{grep}$ that facilitate easy 
   278 %text-processing \cite{grep}; network intrusion
   329 %text-processing \cite{grep}; network intrusion