204 %TODO: look up snort rules to use here--give readers idea of what regexes look like |
205 %TODO: look up snort rules to use here--give readers idea of what regexes look like |
205 |
206 |
206 |
207 |
207 Regular expressions, since their inception in the 1940s, |
208 Regular expressions, since their inception in the 1940s, |
208 have been subject to extensive study and implementation. |
209 have been subject to extensive study and implementation. |
209 Their primary application lies in lexing, |
210 Their primary application lies in text processing--finding |
210 a process whereby an unstructured input string is disassembled into a tree of tokens |
211 matches and identifying patterns in a string. |
211 with their guidance. |
212 %It is often used to match strings that comprises of numerous fields, |
212 This is often used to match strings that comprises of numerous fields, |
213 %where certain fields may recur or be omitted. |
213 where certain fields may recur or be omitted. |
214 For example, a simple regular expression that tries |
214 For example, the regular expression for recognising email addresses is |
215 to recognise email addresses is |
|
216 \marginpar{rephrased from "the regex for recognising" to "a simple regex that tries to match email"} |
215 \begin{center} |
217 \begin{center} |
216 \verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|. |
218 $[a-z0-9.\_]^\backslash+@[a-z0-9.-]^\backslash+\.\{a-z\}\{2,6\}$ |
|
219 %$[a-z0-9._]^+@[a-z0-9.-]^+\.[a-z]{2,6}$. |
217 \end{center} |
220 \end{center} |
218 Using this, regular expression matchers and lexers are able to extract |
221 \marginpar{Simplified example, but the distinction between . and escaped . is correct |
219 the domain names by the use of \verb|[a-zA-Z0-9.-]+|. |
222 and therefore left unchanged.} |
220 Consequently, they are an indispensible components in text processing tools |
223 |
221 of software systems such as compilers, IDEs, and firewalls. |
224 %Using this, regular expression matchers and lexers are able to extract |
|
225 %the domain names by the use of \verb|[a-zA-Z0-9.-]+|. |
|
226 \marginpar{Rewrote explanation for the expression.} |
|
227 The bracketed sub-expressions are used to extract specific parts of an email address. |
|
228 The local part is recognised by the expression enclosed in |
|
229 the first pair of brackets: $[a-z0-9._]$, and after the ``@'' sign |
|
230 is the part that recognises the domain, where $[a-z]{2, 6}$ specifically |
|
231 matches the top-level domain. |
|
232 %Consequently, they are an indispensible components in text processing tools |
|
233 %of software systems such as compilers, IDEs, and firewalls. |
222 |
234 |
223 The study of regular expressions is ongoing due to an |
235 The study of regular expressions is ongoing due to an |
224 issue known as catastrophic backtracking. |
236 issue known as catastrophic backtracking. |
225 This phenomenon refers to scenarios in which the regular expression |
237 This phenomenon refers to scenarios in which the regular expression |
226 matching or lexing engine exhibits a disproportionately long |
238 matching or lexing engine exhibits a disproportionately long |
227 runtime despite the simplicity of the input and expression. |
239 runtime despite the simplicity of the input and expression. |
228 |
240 |
229 The origin of catastrophic backtracking is rooted in the |
241 One cause of catastrophic backtracking lies within the |
230 ambiguities of lexing. |
242 ambiguities of lexing.\marginpar{rephrased "the origin of catastrophic ...} |
231 In the process of matching a multi-character string with |
243 In the process of matching a multi-character string with |
232 a regular expression that encompasses several sub-expressions, |
244 a regular expression that encompasses several sub-expressions, |
233 different positions can be designated to mark |
245 different positions can be designated to mark |
234 the boundaries of corresponding substrings of the sub-expressions. |
246 the boundaries of corresponding substrings of the sub-expressions. |
235 For instance, in matching the string $aaa$ with the |
247 For instance, in matching the string $aaa$ with the |
237 the initial match and the subsequent iteration could either be |
249 the initial match and the subsequent iteration could either be |
238 set between the first and second characters ($a | aa$) or between the second and third characters ($aa | a$). As both the length of the input string and the structural complexity of the regular expression increase, the number of potential delimiter combinations can grow exponentially, leading to a corresponding increase in complexity for algorithms that do not optimally navigate these possibilities. |
250 set between the first and second characters ($a | aa$) or between the second and third characters ($aa | a$). As both the length of the input string and the structural complexity of the regular expression increase, the number of potential delimiter combinations can grow exponentially, leading to a corresponding increase in complexity for algorithms that do not optimally navigate these possibilities. |
239 |
251 |
240 Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}). |
252 Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}). |
241 |
253 |
242 Various disambiguation strategies are employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph: |
254 Various disambiguation strategies are |
|
255 employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph: |
243 |
256 |
244 |
257 |
245 %Regular expressions |
258 %Regular expressions |
246 %have been extensively studied and |
259 %have been extensively studied and |
247 %implemented since their invention in 1940s. |
260 %implemented since their invention in 1940s. |
281 %There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX. |
294 %There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX. |
282 %The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible. |
295 %The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible. |
283 %There have been prose definitions like the following |
296 %There have been prose definitions like the following |
284 %by Kuklewicz \cite{KuklewiczHaskell}: |
297 %by Kuklewicz \cite{KuklewiczHaskell}: |
285 %described the POSIX rule as (section 1, last paragraph): |
298 %described the POSIX rule as (section 1, last paragraph): |
|
299 \marginpar{\em Deleted some peripheral specifications.} |
286 \begin{quote} |
300 \begin{quote} |
287 \begin{itemize} |
301 \begin{itemize} |
288 \item |
302 \item |
289 regular expressions (REs) take the leftmost starting match, and the longest match starting there |
303 regular expressions (REs) take the leftmost starting match, and the longest match starting there |
290 earlier subpatterns have leftmost-longest priority over later subpatterns\\ |
304 earlier subpatterns have leftmost-longest priority over later subpatterns\\ |
291 \item |
305 \item |
292 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\ |
306 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\ |
293 \item |
307 \item |
294 REs have right associative concatenation which can be changed with parenthesis\\ |
308 $\ldots$ |
295 \item |
309 %REs have right associative concatenation which can be changed with parenthesis\\ |
296 parenthesized subexpressions return the match from their last usage\\ |
310 %\item |
297 \item |
311 %parenthesized subexpressions return the match from their last usage\\ |
298 text of component subexpressions must be contained in the text of the |
312 %\item |
299 higher-level subexpressions\\ |
313 %text of component subexpressions must be contained in the text of the |
300 \item |
314 %higher-level subexpressions\\ |
301 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
315 %\item |
302 \item |
316 %if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
303 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ |
317 %\item |
|
318 %if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ |
304 \end{itemize} |
319 \end{itemize} |
305 \end{quote} |
320 \end{quote} |
306 However, the author noted that various lexers that claim to be POSIX |
321 However, the author noted that various lexers that claim to be POSIX |
307 are rarely correct according to this standard. |
322 are rarely correct according to this standard. |
308 There are numerous occasions where programmers realised the subtlety and |
323 There are numerous occasions where programmers realised the subtlety and |