lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:ed53ce26ecb6
+:99b530103464
 more than half of the
 XSDs they found have bounded regular expressions in them.
 Often the counters are quite large, the largest up to ten million.
 An example XSD they gave
 was:
-\begin{verbatim}
+%\begin{verbatim}
-<sequence minOccurs="0" maxOccurs="65535">
+%<sequence minOccurs="0" maxOccurs="65535">
-<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+%  <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
-<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+%  <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
-</sequence>
+%</sequence>
-\end{verbatim}
+%\end{verbatim}
 This can be seen as the expression
 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
 regular expressions
 satisfying certain constraints (such as
 satisfying the floating point number format).
 for bounded regular expressions:
 For example, in the regular expression matching library in the Go
 language the regular expression $a^{1001}$ is not permitted, because no counter
 can be above 1000, and in the built-in Rust regular expression library
 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
-for being too big. These problems can of course be solved in matching algorithms where
+for being too big.
+As Becchi and Crawley\cite{Becchi08}  have pointed out,
+the reason for these restrictions
+are that they simulate a non-deterministic finite
+automata (NFA) with a breadth-first search.
+This way the number of active states could
+be equal to the counter number.
+When the counters are large,
+the memory requirement could become
+infeasible, and the pattern needs to be rejected straight away.
+\begin{figure}[H]
+\begin{center}
+\begin{tikzpicture} [node distance = 2cm, on grid, auto]
+	\node (q0) [state, initial] {$0$};
+	\node (q1) [state, right = of q0] {$1$};
+	\node (q2) [state, right = of q1] {$2$};
+	\node (qdots) [right = of q2] {$\ldots$};
+	\node (qn) [state, right = of qdots] {$n$};
+	\node (qn1) [state, right = of qn] {$n+1$};
+	\node (qn2) [state, right = of qn1] {$n+2$};
+	\node (qn3) [state, accepting, right = of qn2] {$n+3$};
+\path [-stealth, thick]
+	(q0) edge [loop above] node {a} ()
+(q0) edge node {a}   (q1)
+(q1) edge node {.}   (q2)
+(q2) edge node {.}   (qdots)
+(qdots) edge node {.} (qn)
+(qn) edge node {.} (qn1)
+(qn1) edge node {b} (qn2)
+(qn2) edge node {$c$} (qn3);
+\end{tikzpicture}
+%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+%   \node[state,initial] (q_0)   {$0$};
+%   \node[state, ] (q_1) [right=of q_0] {$1$};
+%   \node[state, ] (q_2) [right=of q_1] {$2$};
+%   \node[state,
+%   \node[state, accepting, ](q_3) [right=of q_2] {$3$};
+%    \path[->]
+%    (q_0) edge  node {a} (q_1)
+%    	  edge [loop below] node {a,b} ()
+%    (q_1) edge  node  {a,b} (q_2)
+%    (q_2) edge  node  {a,b} (q_3);
+%\end{tikzpicture}
+\end{center}
+\caption{The example given by Becchi and Crawley
+	that NFA simulation can consume large
+	amounts of memory: $.^*a.^{\{n\}}bc$ matching
+	strings of the form $aaa\ldots aaaabc$.
+	When traversing in a breadth-first manner,
+all states from 0 till $n+1$ will become active.}
+\end{figure}
+%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
+%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
+%in terms of input string length.
+%TODO:try out these lexers
+These problems can of course be solved in matching algorithms where
 automata go beyond the classic notion and for instance include explicit
 counters \cite{Turo_ov__2020}.
+These solutions can be quite effective,
+with the ability to process
+gigabytes of string input per second
+even with large counters \cite{Becchi08}.
+But formally reasoning about these automata can be challenging
+and un-intuitive.
+Therefore, correctness and runtime claims made about these solutions need to be
+taken with a grain of salt.
 In the work reported in \cite{CSL2022} and here,
 we add better support using derivatives
 for bounded regular expressions $r^{\{n\}}$.
 The results
 %recursive functions and inductive datatypes.
 Finally, bounded regular expressions do not destroy our finite
 boundedness property, which we shall prove later on.
 \subsection{Back-References}
-%Some constructs can even raise the expressive
+The other way to simulate an $\mathit{NFA}$ for matching is choosing
-%power to the non-regular realm, for example
+a single transition each time, keeping all the other options in
-%the back-references.
+a queue or stack, and backtracking if that choice eventually
-%bounded repetitions, as we have discussed in the
+fails. This method, often called a  "depth-first-search",
-%previous section.
+is efficient in a lot of cases, but could end up
-%This super-linear behaviour of the
+with exponential run time.
-%regex matching engines we have?
+The backtracking method is employed in regex libraries
+that support \emph{back-references}, for example
+in Java and Python.
 The syntax and expressive power of regexes
-can go quickly beyond the regular language hierarchy
+go beyond the regular language hierarchy
 with back-references.
 %\section{Back-references and The Terminology Regex}
 %When one constructs an $\NFA$ out of a regular expression
 %there is often very little to be done in the first phase, one simply
 %one by keeping track of all active states after consuming
 %a character, and update that set of states iteratively.
 %This can be viewed as a breadth-first-search of the $\mathit{NFA}$
 %for a path terminating
 %at an accepting state.
-Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
-type of $\mathit{NFA}$ simulation and guarantees a linear runtime
-in terms of input string length.
-%TODO:try out these lexers
+Given a regular expression like this (the sequence
-The other way to use $\mathit{NFA}$ for matching is choosing
-a single transition each time, keeping all the other options in
-a queue or stack, and backtracking if that choice eventually
-fails. This method, often called a  "depth-first-search",
-is efficient in a lot of cases, but could end up
-with exponential run time.\\
-%TODO:COMPARE java python lexer speed with Rust and Go
-The reason behind backtracking algorithms in languages like
-Java and Python is that they support back-references.
-\subsubsection{Back References}
-If we have a regular expression like this (the sequence
 operator is omitted for brevity):
 \begin{center}
-	$r_1(r_2(r_3r_4))$
+	$r_1r_2r_3r_4$
 \end{center}
-We could label sub-expressions of interest
+one could label sub-expressions of interest
 by parenthesizing them and giving
 them a number by the order in which their opening parentheses appear.
 One possible way of parenthesizing and labelling is given below:
 \begin{center}
 	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
 \end{center}
-$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$, $r_4$ are labelled
+The sub-expressions
-by 1 to 4. $1$ would refer to the entire expression
+$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
-$(r_1(r_2(r_3)(r_4)))$, $2$ referring to $r_2(r_3)(r_4)$, etc.
+by 1 to 4, and can be ``referred back'' by their respective numbers.
-These sub-expressions are called "capturing groups".
+%These sub-expressions are called "capturing groups".
-We can use the following syntax to denote that we want a string just matched by a
+To do so, we use the syntax $\backslash i$
-sub-expression (capturing group) to appear at a certain location again,
+to denote that we want the sub-string
-exactly as it was:
+of the input just matched by the i-th
+sub-expression to appear again,
+exactly the same as it first appeared:
 \begin{center}
 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
 \underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
 \end{center}
-The backslash and number $i$ are used to denote such
+%The backslash and number $i$ are the
-so-called "back-references".
+%so-called "back-references".
-Let $e$ be an expression made of regular expressions
+%Let $e$ be an expression made of regular expressions
-and back-references. $e$ contains the expression $e_i$
+%and back-references. $e$ contains the expression $e_i$
-as its $i$-th capturing group.
+%as its $i$-th capturing group.
-The semantics of back-reference can be recursively
+%The semantics of back-reference can be recursively
-written as:
+%written as:
+%\begin{center}
+%	\begin{tabular}{c}
+%		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
+%		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
+%	\end{tabular}
+%\end{center}
+A concrete example
+for back-references would be
 \begin{center}
-	\begin{tabular}{c}
+$((a|b|c|\ldots|z)^*)\backslash 1$,
-		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
-		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
-	\end{tabular}
 \end{center}
-The concrete example
+which would match
-$((a|b|c|\ldots|z)^*)\backslash 1$
+strings that can be split into two identical halves,
-would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
+for example $\mathit{bobo}$, $\mathit{weewee}$ and etc.
-Back-reference is a construct in the "regex" standard
+Back-reference is a regex construct
 that programmers found useful, but not exactly
 regular any more.
 In fact, that allows the regex construct to express
 languages that cannot be contained in context-free
 languages either.

changeset 606	99b530103464
parent 605	ed53ce26ecb6
child 607	e6fc9b72c0e3