lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:a4d692a9a289
+:bc5571c38d1f
 \def\Left{\textit{Left}}
 \def\Right{\textit{Right}}
 \def\Stars{\textit{Stars}}
 \def\Char{\textit{Char}}
 \def\Seq{\textit{Seq}}
+\def\Alt{\textit{Alt}}
 \def\Der{\textit{Der}}
 \def\Ders{\textit{Ders}}
 \def\nullable{\mathit{nullable}}
 \def\Z{\mathit{Z}}
 \def\S{\mathit{S}}
 %TODO: look up snort rules to use here--give readers idea of what regexes look like
 Regular expressions, since their inception in the 1940s,
 have been subject to extensive study and implementation.
-Their primary application lies in lexing,
+Their primary application lies in text processing--finding
-a process whereby an unstructured input string is disassembled into a tree of tokens
+matches and identifying patterns in a string.
-with their guidance.
+%It is often used to match strings that comprises of numerous fields,
-This is often used to match strings that comprises of numerous fields,
+%where certain fields may recur or be omitted.
-where certain fields may recur or be omitted.
+For example, a simple regular expression that tries
-For example, the regular expression for recognising email addresses is
+to recognise email addresses is
+\marginpar{rephrased from "the regex for recognising" to "a simple regex that tries to match email"}
 \begin{center}
-\verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|.
+$[a-z0-9.\_]^\backslash+@[a-z0-9.-]^\backslash+\.\{a-z\}\{2,6\}$
+%$[a-z0-9._]^+@[a-z0-9.-]^+\.[a-z]{2,6}$.
 \end{center}
-Using this, regular expression matchers and lexers are able to extract
+\marginpar{Simplified example, but the distinction between . and escaped . is correct
-the domain names by the use of \verb|[a-zA-Z0-9.-]+|.
+and therefore left unchanged.}
-Consequently, they are an indispensible components in text processing tools
-of software systems such as compilers, IDEs, and firewalls.
+%Using this, regular expression matchers and lexers are able to extract
+%the domain names by the use of \verb|[a-zA-Z0-9.-]+|.
+\marginpar{Rewrote explanation for the expression.}
+The bracketed sub-expressions are used to extract specific parts of an email address.
+The local part is recognised by the expression enclosed in
+the first pair of brackets: $[a-z0-9._]$, and after the ``@'' sign
+is the part that recognises the domain, where $[a-z]{2, 6}$ specifically
+matches the top-level domain.
+%Consequently, they are an indispensible components in text processing tools
+%of software systems such as compilers, IDEs, and firewalls.
 The study of regular expressions is ongoing due to an
 issue known as catastrophic backtracking.
 This phenomenon refers to scenarios in which the regular expression
 matching or lexing engine exhibits a disproportionately long
 runtime despite the simplicity of the input and expression.
-The origin of catastrophic backtracking is rooted in the
+One cause of catastrophic backtracking lies within the
-ambiguities of lexing.
+ambiguities of lexing.\marginpar{rephrased "the origin of catastrophic ...}
 In the process of matching a multi-character string with
 a regular expression that encompasses several sub-expressions,
 different positions can be designated to mark
 the boundaries of corresponding substrings of the sub-expressions.
 For instance, in matching the string $aaa$ with the
 the initial match and the subsequent iteration could either be
 set between the first and second characters ($a | aa$) or between the second and third characters ($aa | a$). As both the length of the input string and the structural complexity of the regular expression increase, the number of potential delimiter combinations can grow exponentially, leading to a corresponding increase in complexity for algorithms that do not optimally navigate these possibilities.
 Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}).
-Various disambiguation strategies are employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
+Various disambiguation strategies are
+employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
 %Regular expressions
 %have been extensively studied and
 %implemented since their invention in 1940s.
 %There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
 %The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
 %There have been prose definitions like the following
 %by Kuklewicz \cite{KuklewiczHaskell}:
 %described the POSIX rule as (section 1, last paragraph):
+\marginpar{\em Deleted some peripheral specifications.}
 \begin{quote}
 	\begin{itemize}
 		\item
 regular expressions (REs) take the leftmost starting match, and the longest match starting there
 earlier subpatterns have leftmost-longest priority over later subpatterns\\
 \item
 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
 \item
-REs have right associative concatenation which can be changed with parenthesis\\
+	$\ldots$
-\item
+%REs have right associative concatenation which can be changed with parenthesis\\
-parenthesized subexpressions return the match from their last usage\\
+%\item
-\item
+%parenthesized subexpressions return the match from their last usage\\
-text of component subexpressions must be contained in the text of the
+%\item
-higher-level subexpressions\\
+%text of component subexpressions must be contained in the text of the
-\item
+%higher-level subexpressions\\
-if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+%\item
-\item
+%if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
-if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
+%\item
+%if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
 \end{itemize}
 \end{quote}
 However, the author noted that various lexers that claim to be POSIX
 are rarely correct according to this standard.
 There are numerous occasions where programmers realised the subtlety and

changeset 653	bc5571c38d1f
parent 652	a4d692a9a289
child 654	2ad20ba5b178