202 %and a bit-coded lexing algorithm with proven correctness and time bounds. |
202 %and a bit-coded lexing algorithm with proven correctness and time bounds. |
203 |
203 |
204 %TODO: look up snort rules to use here--give readers idea of what regexes look like |
204 %TODO: look up snort rules to use here--give readers idea of what regexes look like |
205 |
205 |
206 |
206 |
207 |
207 Regular expressions, since their inception in the 1940s, |
208 |
208 have been subject to extensive study and implementation. |
209 |
209 Their primary application lies in lexing, |
210 Regular expressions |
210 a process whereby an unstructured input string is disassembled into a tree of tokens |
211 have been extensively studied and |
211 with their guidance. |
212 implemented since their invention in 1940s. |
212 This is often used to match strings that comprises of numerous fields, |
213 It is primarily used in lexing, where an unstructured input string is broken |
213 where certain fields may recur or be omitted. |
214 down into a tree of tokens, where that tree's construction is guided by the shape of the regular expression. |
214 For example, the regular expression for recognising email addresses is |
215 Being able to delimit different subsections of a string and potentially |
215 \begin{center} |
216 extract information from inputs, regular expression |
216 \verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|. |
217 matchers and lexers serve as an indispensible component in modern software systems' text processing tasks |
217 \end{center} |
218 such as compilers, IDEs, and firewalls. |
218 Using this, regular expression matchers and lexers are able to extract |
219 Research on them is far from over due to the well-known issue called catastrophic-backtracking. |
219 for example the domain names by the use of \verb|[a-zA-Z0-9.-]+|. |
220 This stems from the ambiguities of lexing: when matching a multiple-character string with a regular |
220 Consequently, they are an indispensible component in text processing tools |
221 exression that includes serveral sub-expressions, there might be different positions to set |
221 of software systems such as compilers, IDEs, and firewalls. |
222 the border between sub-expressions' corresponding sub-strings. |
222 |
223 For example, matching the string $aaa$ against the regular expression |
223 The study of regular expressions is ongoing due to an |
224 $(a+aa)^*$, the border between the initial match and the second iteration |
224 issue known as catastrophic backtracking. |
225 could be between the first and second character ($a | aa$) |
225 This phenomenon refers to scenarios in which the regular expression |
226 or between the second and third character ($aa | a$). |
226 matching or lexing engine exhibits a disproportionately long |
227 As the size of the input string and the structural complexity |
227 runtime despite the simplicity of the input and expression. |
228 of the regular expression increase, |
228 |
229 the number of different combinations of setting delimiters can grow exponentially, and |
229 The origin of catastrophic backtracking is rooted in the |
230 algorithms that explore these possibilities unwisely will also see an exponential complexity. |
230 ambiguities of lexing. |
231 |
231 In the process of matching a multi-character string with |
232 Catastrophic backtracking allow a class of computationally inexpensive attacks called |
232 a regular expression that encompasses several sub-expressions, |
233 Regular expression Denial of Service attacks (ReDoS). |
233 different positions can be designated to mark |
234 These attacks, be it deliberate or not, have caused real-world systems to go down (see more |
234 the boundaries of corresponding substrings of the sub-expressions. |
235 details of this in the practical overview section in chapter \ref{Introduction}). |
235 For instance, in matching the string $aaa$ with the |
236 There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX. |
236 regular expression $(a+aa)^*$, the divide between |
237 The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible. |
237 the initial match and the subsequent iteration could either be |
238 There have been prose definitions like the following |
238 set between the first and second characters ($a | aa$) or between the second and third characters ($aa | a$). As both the length of the input string and the structural complexity of the regular expression increase, the number of potential delimiter combinations can grow exponentially, leading to a corresponding increase in complexity for algorithms that do not optimally navigate these possibilities. |
239 by Kuklewicz \cite{KuklewiczHaskell}: |
239 |
240 described the POSIX rule as (section 1, last paragraph): |
240 Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}). |
|
241 |
|
242 Various disambiguation strategies are employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph: |
|
243 |
|
244 |
|
245 %Regular expressions |
|
246 %have been extensively studied and |
|
247 %implemented since their invention in 1940s. |
|
248 %It is primarily used in lexing, where an unstructured input string is broken |
|
249 %down into a tree of tokens. |
|
250 %That tree's construction is guided by the shape of the regular expression. |
|
251 %This is particularly useful in expressing the shape of a string |
|
252 %with many fields, where certain fields might repeat or be omitted. |
|
253 %Regular expression matchers and Lexers allow us to |
|
254 %identify and delimit different subsections of a string and potentially |
|
255 %extract information from inputs, making them |
|
256 %an indispensible component in modern software systems' text processing tasks |
|
257 %such as compilers, IDEs, and firewalls. |
|
258 %Research on them is far from over due to the well-known issue called catastrophic-backtracking, |
|
259 %which means the regular expression matching or lexing engine takes an unproportional time to run |
|
260 %despite the input and the expression being relatively simple. |
|
261 % |
|
262 %Catastrophic backtracking stems from the ambiguities of lexing: |
|
263 %when matching a multiple-character string with a regular |
|
264 %exression that includes serveral sub-expressions, there might be different positions to set |
|
265 %the border between sub-expressions' corresponding sub-strings. |
|
266 %For example, matching the string $aaa$ against the regular expression |
|
267 %$(a+aa)^*$, the border between the initial match and the second iteration |
|
268 %could be between the first and second character ($a | aa$) |
|
269 %or between the second and third character ($aa | a$). |
|
270 %As the size of the input string and the structural complexity |
|
271 %of the regular expression increase, |
|
272 %the number of different combinations of delimiters can grow exponentially, and |
|
273 %algorithms that explore these possibilities unwisely will also see an exponential complexity. |
|
274 % |
|
275 %Catastrophic backtracking allow a class of computationally inexpensive attacks called |
|
276 %Regular expression Denial of Service attacks (ReDoS), in which the hacker |
|
277 %simply sends out a small attack string to a server, |
|
278 %triggering high-complexity behaviours in its regular expression engine. |
|
279 %These attacks, be it deliberate or not, have caused real-world systems to go down (see more |
|
280 %details of this in the practical overview section in chapter \ref{Introduction}). |
|
281 %There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX. |
|
282 %The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible. |
|
283 %There have been prose definitions like the following |
|
284 %by Kuklewicz \cite{KuklewiczHaskell}: |
|
285 %described the POSIX rule as (section 1, last paragraph): |
241 \begin{quote} |
286 \begin{quote} |
242 \begin{itemize} |
287 \begin{itemize} |
243 \item |
288 \item |
244 regular expressions (REs) take the leftmost starting match, and the longest match starting there |
289 regular expressions (REs) take the leftmost starting match, and the longest match starting there |
245 earlier subpatterns have leftmost-longest priority over later subpatterns\\ |
290 earlier subpatterns have leftmost-longest priority over later subpatterns\\ |
256 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
301 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
257 \item |
302 \item |
258 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ |
303 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ |
259 \end{itemize} |
304 \end{itemize} |
260 \end{quote} |
305 \end{quote} |
261 But the author noted that lexers are rarely correct according to this standard. |
306 However, the author noted that lexers are rarely correct according to this standard. |
262 |
307 Being able to formally define and capture the idea of POSIX rules and matching/lexing algorithms and prove |
263 Formal proofs, such as the one written in Isabelle/HOL, is a powerful means |
308 the correctness of such algorithms against the POSIX semantics definitions |
|
309 would be valuable. |
|
310 |
|
311 Formal proofs are machine checked programs |
|
312 such as the ones written in Isabelle/HOL, is a powerful means |
264 for computer scientists to be certain about correctness of their algorithms and correctness properties. |
313 for computer scientists to be certain about correctness of their algorithms and correctness properties. |
265 This is done by automatically checking the end goal is derived solely from a list of axioms, definitions |
314 This is done by automatically checking that the end goal is derived solely from a list of axioms, definitions |
266 and proof rules that are known to be correct. |
315 and proof rules that are known to be correct. |
267 The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such |
316 The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such |
268 methods as their implementation and complexity analysis tend to be error-prone. |
317 methods as their implementation and complexity analysis tend to be error-prone. |
269 |
318 |
270 |
319 There have been many interesting steps taken in the direction of formal proofs and regular expressions |
271 |
320 lexing and matching. |
272 |
321 |
273 |
322 |
|
323 |
|
324 (ends) |
274 |
325 |
275 %Regular expressions are widely used in computer science: |
326 %Regular expressions are widely used in computer science: |
276 %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; |
327 %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; |
277 %command-line tools like $\mathit{grep}$ that facilitate easy |
328 %command-line tools like $\mathit{grep}$ that facilitate easy |
278 %text-processing \cite{grep}; network intrusion |
329 %text-processing \cite{grep}; network intrusion |