ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 653 bc5571c38d1f
parent 652 a4d692a9a289
child 654 2ad20ba5b178
equal deleted inserted replaced
652:a4d692a9a289 653:bc5571c38d1f
   105 \def\Left{\textit{Left}}
   105 \def\Left{\textit{Left}}
   106 \def\Right{\textit{Right}}
   106 \def\Right{\textit{Right}}
   107 \def\Stars{\textit{Stars}}
   107 \def\Stars{\textit{Stars}}
   108 \def\Char{\textit{Char}}
   108 \def\Char{\textit{Char}}
   109 \def\Seq{\textit{Seq}}
   109 \def\Seq{\textit{Seq}}
       
   110 \def\Alt{\textit{Alt}}
   110 \def\Der{\textit{Der}}
   111 \def\Der{\textit{Der}}
   111 \def\Ders{\textit{Ders}}
   112 \def\Ders{\textit{Ders}}
   112 \def\nullable{\mathit{nullable}}
   113 \def\nullable{\mathit{nullable}}
   113 \def\Z{\mathit{Z}}
   114 \def\Z{\mathit{Z}}
   114 \def\S{\mathit{S}}
   115 \def\S{\mathit{S}}
   204 %TODO: look up snort rules to use here--give readers idea of what regexes look like
   205 %TODO: look up snort rules to use here--give readers idea of what regexes look like
   205 
   206 
   206 
   207 
   207 Regular expressions, since their inception in the 1940s, 
   208 Regular expressions, since their inception in the 1940s, 
   208 have been subject to extensive study and implementation. 
   209 have been subject to extensive study and implementation. 
   209 Their primary application lies in lexing, 
   210 Their primary application lies in text processing--finding
   210 a process whereby an unstructured input string is disassembled into a tree of tokens
   211 matches and identifying patterns in a string.
   211 with their guidance.
   212 %It is often used to match strings that comprises of numerous fields, 
   212 This is often used to match strings that comprises of numerous fields, 
   213 %where certain fields may recur or be omitted. 
   213 where certain fields may recur or be omitted. 
   214 For example, a simple regular expression that tries 
   214 For example, the regular expression for recognising email addresses is
   215 to recognise email addresses is
       
   216 \marginpar{rephrased from "the regex for recognising" to "a simple regex that tries to match email"}
   215 \begin{center}
   217 \begin{center}
   216 \verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|.
   218 $[a-z0-9.\_]^\backslash+@[a-z0-9.-]^\backslash+\.\{a-z\}\{2,6\}$
       
   219 %$[a-z0-9._]^+@[a-z0-9.-]^+\.[a-z]{2,6}$.
   217 \end{center}
   220 \end{center}
   218 Using this, regular expression matchers and lexers are able to extract 
   221 \marginpar{Simplified example, but the distinction between . and escaped . is correct
   219 the domain names by the use of \verb|[a-zA-Z0-9.-]+|. 
   222 and therefore left unchanged.}
   220 Consequently, they are an indispensible components in text processing tools 
   223 
   221 of software systems such as compilers, IDEs, and firewalls.
   224 %Using this, regular expression matchers and lexers are able to extract 
       
   225 %the domain names by the use of \verb|[a-zA-Z0-9.-]+|. 
       
   226 \marginpar{Rewrote explanation for the expression.}
       
   227 The bracketed sub-expressions are used to extract specific parts of an email address.
       
   228 The local part is recognised by the expression enclosed in 
       
   229 the first pair of brackets: $[a-z0-9._]$, and after the ``@'' sign
       
   230 is the part that recognises the domain, where $[a-z]{2, 6}$ specifically
       
   231 matches the top-level domain.
       
   232 %Consequently, they are an indispensible components in text processing tools 
       
   233 %of software systems such as compilers, IDEs, and firewalls.
   222 
   234 
   223 The study of regular expressions is ongoing due to an
   235 The study of regular expressions is ongoing due to an
   224 issue known as catastrophic backtracking. 
   236 issue known as catastrophic backtracking. 
   225 This phenomenon refers to scenarios in which the regular expression 
   237 This phenomenon refers to scenarios in which the regular expression 
   226 matching or lexing engine exhibits a disproportionately long 
   238 matching or lexing engine exhibits a disproportionately long 
   227 runtime despite the simplicity of the input and expression.
   239 runtime despite the simplicity of the input and expression.
   228 
   240 
   229 The origin of catastrophic backtracking is rooted in the 
   241 One cause of catastrophic backtracking lies within the 
   230 ambiguities of lexing. 
   242 ambiguities of lexing.\marginpar{rephrased "the origin of catastrophic ...} 
   231 In the process of matching a multi-character string with 
   243 In the process of matching a multi-character string with 
   232 a regular expression that encompasses several sub-expressions, 
   244 a regular expression that encompasses several sub-expressions, 
   233 different positions can be designated to mark 
   245 different positions can be designated to mark 
   234 the boundaries of corresponding substrings of the sub-expressions. 
   246 the boundaries of corresponding substrings of the sub-expressions. 
   235 For instance, in matching the string $aaa$ with the 
   247 For instance, in matching the string $aaa$ with the 
   237 the initial match and the subsequent iteration could either be 
   249 the initial match and the subsequent iteration could either be 
   238 set between the first and second characters ($a | aa$) or between the second and third characters ($aa | a$). As both the length of the input string and the structural complexity of the regular expression increase, the number of potential delimiter combinations can grow exponentially, leading to a corresponding increase in complexity for algorithms that do not optimally navigate these possibilities.
   250 set between the first and second characters ($a | aa$) or between the second and third characters ($aa | a$). As both the length of the input string and the structural complexity of the regular expression increase, the number of potential delimiter combinations can grow exponentially, leading to a corresponding increase in complexity for algorithms that do not optimally navigate these possibilities.
   239 
   251 
   240 Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}).
   252 Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}).
   241 
   253 
   242 Various disambiguation strategies are employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
   254 Various disambiguation strategies are 
       
   255 employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
   243 
   256 
   244 
   257 
   245 %Regular expressions 
   258 %Regular expressions 
   246 %have been extensively studied and
   259 %have been extensively studied and
   247 %implemented since their invention in 1940s.
   260 %implemented since their invention in 1940s.
   281 %There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
   294 %There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
   282 %The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
   295 %The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
   283 %There have been prose definitions like the following
   296 %There have been prose definitions like the following
   284 %by Kuklewicz \cite{KuklewiczHaskell}: 
   297 %by Kuklewicz \cite{KuklewiczHaskell}: 
   285 %described the POSIX rule as (section 1, last paragraph):
   298 %described the POSIX rule as (section 1, last paragraph):
       
   299 \marginpar{\em Deleted some peripheral specifications.}
   286 \begin{quote}
   300 \begin{quote}
   287 	\begin{itemize}
   301 	\begin{itemize}
   288 		\item
   302 		\item
   289 regular expressions (REs) take the leftmost starting match, and the longest match starting there
   303 regular expressions (REs) take the leftmost starting match, and the longest match starting there
   290 earlier subpatterns have leftmost-longest priority over later subpatterns\\
   304 earlier subpatterns have leftmost-longest priority over later subpatterns\\
   291 \item
   305 \item
   292 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
   306 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
   293 \item
   307 \item
   294 REs have right associative concatenation which can be changed with parenthesis\\
   308 	$\ldots$
   295 \item
   309 %REs have right associative concatenation which can be changed with parenthesis\\
   296 parenthesized subexpressions return the match from their last usage\\
   310 %\item
   297 \item
   311 %parenthesized subexpressions return the match from their last usage\\
   298 text of component subexpressions must be contained in the text of the 
   312 %\item
   299 higher-level subexpressions\\
   313 %text of component subexpressions must be contained in the text of the 
   300 \item
   314 %higher-level subexpressions\\
   301 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
   315 %\item
   302 \item
   316 %if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
   303 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
   317 %\item
       
   318 %if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
   304 \end{itemize}
   319 \end{itemize}
   305 \end{quote}
   320 \end{quote}
   306 However, the author noted that various lexers that claim to be POSIX 
   321 However, the author noted that various lexers that claim to be POSIX 
   307 are rarely correct according to this standard.
   322 are rarely correct according to this standard.
   308 There are numerous occasions where programmers realised the subtlety and
   323 There are numerous occasions where programmers realised the subtlety and