lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:16d67f9c07d4
+:ed53ce26ecb6
 can occur
 and (ii) why we choose our
 approach (Brzozowski derivatives and formal proofs).
-\section{Terminology, and the Problem with Bounded Repetitions}
+\section{Regex, and the Problems with Regex Matchers}
 Regular expressions and regular expression matchers
 have of course been studied for many, many years.
-Theoretical results in automata theory says
+Theoretical results in automata theory say
 that basic regular expression matching should be linear
-w.r.t the input, provided that the regular expression
+w.r.t the input.
-$r$ had been pre-processed and turned into a
+This assumes that the regular expression
-deterministic finite automata (DFA).
+$r$ was pre-processed and turned into a
+deterministic finite automata (DFA) before matching,
+which could be exponential\cite{Sakarovitch2009}.
 By basic we mean textbook definitions such as the one
 below, involving only characters, alternatives,
 sequences, and Kleene stars:
 \[
 	r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^*
 \]
 Modern regular expression matchers used by programmers,
 however,
 support richer constructs such as bounded repetitions
 and back-references.
-The syntax and expressive power of those
+To differentiate, people use the word \emph{regex} to refer
-matching engines
+to those expressions with richer constructs while reserving the
-make ``regular expressions'' quite different from
+term \emph{regular expression}
-their original meaning in the formal languages
+for the more traditional meaning in formal languages theory.
-theory.
+We follow this convention
-To differentiate, people tend to use the word \emph{regex} to refer
+in this thesis.
-those expressions with richer constructs, and regular expressions
+In the future, we aim to support all the popular features of regexes,
-for the more traditional meaning.
-For example, the PCRE standard (Peral Compatible Regular Expressions)
-is such a regex syntax standard.
-We follow this convention in this thesis.
-We aim to support all the popular features of regexes in the future,
 but for this work we mainly look at regular expressions.
-\subsection{A Little Introduction to Regexes: Bounded Repetitions
-and Back-references}
+%Most modern regex libraries
+%the so-called PCRE standard (Peral Compatible Regular Expressions)
+%has the back-references
 Regexes come with a lot of constructs
-that makes it more convenient for
+beyond the basic ones
+that make it more convenient for
 programmers to write regular expressions.
+Depending on the types of these constructs
+the task of matching and lexing with them
+will have different levels of complexity increase.
 Some of those constructs are syntactic sugars that are
 simply short hand notations
-that save the programmers a few keystrokes,
+that save the programmers a few keystrokes.
-for example the
+These will not cause trouble for regex libraries.
+\noindent
+For example the
 non-binary alternative involving three or more choices:
 \[
-	r = (a | b | c | \ldots | z)^*
+	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
 \]
-, the range operator $-$ which means the alternative
+the range operator $-$ used to express the alternative
-of all characters between its operands:
+of all characters between its operands in a concise way:
 \[
-	r = [0-9a-zA-Z] \; \text{(all alpha-numeric characters)}
+	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)}
 \]
 and the
-wildcard character $.$ meaning any character
+wildcard character $.$ used to refer to any single character:
 \[
-	. = [0-9a-zA-Z+-()*&\ldots]
+	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
 \]
+\noindent
+\subsection{Bounded Repetitions}
 Some of those constructs do make the expressions much
-more compact, and matching time could be greatly increase.
+more compact.
-For example, $r^{n}$ is exponentially more concise compared with
+For example, the bounded regular expressions
-the expression $\underbrace{r}_\text{n copies of r}$,
+(where $n$ and $m$ are constant natural numbers)
-and therefore a naive algorithm that simply unfolds
+$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$,
-$r^{n}$ into $\underbrace{r}_\text{n copies of r}$
+defined as
-will suffer exponential runtime increase.
+\begin{center}
-Some constructs can even raise the expressive
+	\begin{tabular}{lcl}
-power to the non-regular realm, for example
+		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
-the back-references.
+		$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
+		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
-bounded repetitions, as we have discussed in the
+		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
-previous section.
+	\end{tabular}
-This super-linear behaviour of the
+\end{center}
-regex matching engines we have?
+are exponentially smaller compared with
-One of the most recent work in the context of lexing
+their unfolded form: for example $r^{\{n\}}$
-is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
+as opposed to
-This is relevant work and we will compare later on
+\[
-our derivative-based matcher we are going to present.
+	\underbrace{r\ldots r}_\text{n copies of r}.
-There is also some newer work called
+\]
-Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
+%Therefore, a naive algorithm that simply unfolds
-For that the problem is dealing with the bounded regular expressions of the form
+%them into their desugared forms
-$r^{n}$ where $n$ is a constant specifying that $r$ must repeat
+%will suffer from at least an exponential runtime increase.
-exactly $n$ times. The Verbatim++ lexer becomes excruciatingly slow
-on the bounded repetitions construct.
-In the work reported in \cite{CSL2022} and here, we add better support
-for them.
-The other repetition constructs include
-$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which specify
-intervals for how many times $r$ should match.
-$r^{\ldots m}$ means repeating
-at most $m$ times, $r^{n\ldots}$ means repeating at least $n$ times and
-$r^{n\ldots m}$ means repeating between $n$ and $m$ times.
-The results presented in this thesis extend straightforwardly to them
-too.
-Their formal definitions will be given later.
-Bounded repetitions are important because they
-tend to occur often in practical use, for example in RegExLib,
-Snort, as well as in XML Schema definitions (XSDs).
-According to Bj\"{o}rklund et al \cite{xml2015},
-bounded regular expressions occur frequently in the latter and can have
-counters up to ten million. An example XSD with a large counter they gave
-was:
-\begin{verbatim}
-<sequence minOccurs="0" maxOccurs="65535">
-<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
-<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
-</sequence>
-\end{verbatim}
-This can be seen as the expression
-$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
-regular expressions
-satisfying certain constraints (such as
-satisfying the floating point number format).
 The problem here is that tools based on the classic notion of
 automata need to expand $r^{n}$ into $n$ connected
 copies of the automaton for $r$. This leads to very inefficient matching
 algorithms  or algorithms that consume large amounts of memory.
+Implementations using $\DFA$s will
+either become excruciatingly slow
+(for example Verbatim++\cite{Verbatimpp}) or get
+out of memory errors (for example $\mathit{LEX}$ and
+$\mathit{JFLEX}$\footnote{which are lexer generators
+in C and JAVA that generate $\mathit{DFA}$-based
+lexers. The user provides a set of regular expressions
+and configurations to them, and then
+gets an output program encoding a minimized $\mathit{DFA}$
+that can be compiled and run.
+When given the above countdown regular expression,
+a small $n$ (a few dozen) would result in a
+determinised automata
+with millions of states.}) under large counters.
 A classic example is the regular expression $(a+b)^*  a (a+b)^{n}$
-where the minimal DFA requires at least $2^{n+1}$ states (more on this
+where the minimal DFA requires at least $2^{n+1}$ states.
-later).
+For example, when $n$ is equal to 2,
-Therefore regular expressions matching libraries that rely on the classic
-notion of DFAs  often impose adhoc limits
-for bounded regular expressions:
-For example, in the regular expression matching library in the Go
-language the regular expression $a^{1001}$ is not permitted, because no counter
-can be above 1000, and in the built-in Rust regular expression library
-expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
-for being too big. These problems can of course be solved in matching algorithms where
-automata go beyond the classic notion and for instance include explicit
-counters \cite{Turo_ov__2020}.
-The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
-straightforwardly extended to deal with bounded regular expressions
-and moreover the resulting code still consists of only simple
-recursive functions and inductive datatypes.
-Finally, bounded regular expressions do not destroy our finite
-boundedness property, which we shall prove later on.
-\section{Back-references and The Terminology Regex}
-Bounded repetitions, usually written in the form
-$r^{\{c\}}$ (where $c$ is a constant natural number),
-denotes a regular expression accepting strings
-that can be divided into $c$ substrings, where each
-substring is in $r$.
-For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
 an $\mathit{NFA}$ describing it would look like:
 \begin{center}
 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
 \node[state,initial] (q_0)   {$q_0$};
 \node[state, red] (q_1) [right=of q_0] {$q_1$};
 	  edge [loop below] node {a,b} ()
 (q_1) edge  node  {a,b} (q_2)
 (q_2) edge  node  {a,b} (q_3);
 \end{tikzpicture}
 \end{center}
-The red states are "countdown states" which counts down
+which requires at least $2^3$ states
+for its subset construction.\footnote{The
+red states are "countdown states" which counts down
 the number of characters needed in addition to the current
 string to make a successful match.
 For example, state $q_1$ indicates a match that has
 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
 and just consumed the "delimiter" $a$ in the middle, and
 because the subset construction during determinisation will generate
 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
 Generalizing this to regular expressions with larger
 bounded repetitions number, we have that
 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
-would require at least $2^{n+1}$ states, if $r$ contains
+would require at least $2^{n+1}$ states, if $r$ itself contains
 more than 1 string.
 This is to represent all different
-scenarios which "countdown" states are active.
+scenarios which "countdown" states are active.}
-For those regexes, tools that uses $\DFA$s will get
-out of memory errors.
+One of the most recent work in the context of lexing
+%with this issue
+is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
-The time cost of regex matching algorithms in general
+This is relevant work and we will compare later on
-involve two different phases, and different things can go differently wrong on
+our derivative-based matcher we are going to present.
-these phases.
+There is also some newer work called
-$\DFA$s usually have problems in the first (construction) phase
+Verbatim++\cite{Verbatimpp}, which does not use derivatives,
-, whereas $\NFA$s usually run into trouble
+but deterministic finite automaton instead.
-on the second phase.
+%An example that gives problem to automaton approaches would be
+%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
+%It requires at least $2^{n+1}$ states to represent
-\section{Error-prone POSIX Implementations}
+%as a DFA.
-The problems with practical implementations
-of reare not limited to slowness on certain
-cases.
+Bounded repetitions are very important because they
-Another thing about these libraries is that there
+tend to occur a lot in practical use.
-is no correctness guarantee.
+For example in the regex library RegExLib,
-In some cases, they either fail to generate a lexing result when there exists a match,
+the rules library of Snort, as well as in XML Schema definitions (XSDs).
-or give results that are inconsistent with the $\POSIX$ standard.
+According to Bj\"{o}rklund et al \cite{xml2015},
-A concrete example would be the regex
+more than half of the
-\begin{center}
+XSDs they found have bounded regular expressions in them.
-	$(aba + ab + a)* \text{and the string} ababa$
+Often the counters are quite large, the largest up to ten million.
-\end{center}
+An example XSD they gave
-The correct $\POSIX$ match for the above would be
+was:
-with the entire string $ababa$,
+\begin{verbatim}
-split into two Kleene star iterations, $[ab] [aba]$ at positions
+<sequence minOccurs="0" maxOccurs="65535">
-$[0, 2), [2, 5)$
+<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
-respectively.
+<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
-But trying this out in regex101\parencite{regex101}
+</sequence>
-with different language engines would yield
+\end{verbatim}
-the same two fragmented matches: $[aba]$ at $[0, 3)$
+This can be seen as the expression
-and $a$ at $[4, 5)$.
+$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
+regular expressions
-Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
+satisfying certain constraints (such as
-correctly implementing the POSIX (maximum-munch)
+satisfying the floating point number format).
-rule of regular expression matching.
+It is therefore quite unsatisfying that
-As Grathwohl\parencite{grathwohl2014crash} wrote,
+some regular expressions matching libraries
-\begin{quote}
+impose adhoc limits
-	The POSIX strategy is more complicated than the
+for bounded regular expressions:
-	greedy because of the dependence on information about
+For example, in the regular expression matching library in the Go
-	the length of matched strings in the various subexpressions.
+language the regular expression $a^{1001}$ is not permitted, because no counter
-\end{quote}
+can be above 1000, and in the built-in Rust regular expression library
-%\noindent
+expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
-To summarise the above, regular expressions are important.
+for being too big. These problems can of course be solved in matching algorithms where
-They are popular and programming languages' library functions
+automata go beyond the classic notion and for instance include explicit
-for them are very fast on non-catastrophic cases.
+counters \cite{Turo_ov__2020}.
-But there are problems with current practical implementations.
-First thing is that the running time might blow up.
+In the work reported in \cite{CSL2022} and here,
-The second problem is that they might be error-prone on certain
+we add better support using derivatives
-very simple cases.
+for bounded regular expressions $r^{\{n\}}$.
-In the next part of the chapter, we will look into reasons why
+The results
-certain regex engines are running horribly slow on the "catastrophic"
+extend straightforwardly to
-cases and propose a solution that addresses both of these problems
+repetitions with an interval such as
-based on Brzozowski and Sulzmann and Lu's work.
+$r^{\{n\ldots m\}}$.
+The merit of Brzozowski derivatives
+on this problem is that
+it can be naturally extended to support bounded repetitions.
-\subsection{Different Phases of a Matching/Lexing Algorithm}
+Moreover these extensions are still made up of only
+inductive datatypes and recursive functions,
+making it handy to deal with using theorem provers.
-Most lexing algorithms can be roughly divided into
+%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
-two phases during its run.
+%straightforwardly extended to deal with bounded regular expressions
-The first phase is the "construction" phase,
+%and moreover the resulting code still consists of only simple
-in which the algorithm builds some
+%recursive functions and inductive datatypes.
-suitable data structure from the input regex $r$, so that
+Finally, bounded regular expressions do not destroy our finite
-it can be easily operated on later.
+boundedness property, which we shall prove later on.
-We denote
-the time cost for such a phase by $P_1(r)$.
-The second phase is the lexing phase, when the input string
+\subsection{Back-References}
-$s$ is read and the data structure
+%Some constructs can even raise the expressive
-representing that regex $r$ is being operated on.
+%power to the non-regular realm, for example
-We represent the time
+%the back-references.
-it takes by $P_2(r, s)$.\\
+%bounded repetitions, as we have discussed in the
-For $\mathit{DFA}$,
+%previous section.
-we have $P_2(r, s) = O( |s| )$,
+%This super-linear behaviour of the
-because we take at most $|s|$ steps,
+%regex matching engines we have?
-and each step takes
+The syntax and expressive power of regexes
-at most one transition--
+can go quickly beyond the regular language hierarchy
-a deterministic-finite-automata
+with back-references.
-by definition has at most one state active and at most one
+%\section{Back-references and The Terminology Regex}
-transition upon receiving an input symbol.
-But unfortunately in the  worst case
+%When one constructs an $\NFA$ out of a regular expression
-$P_1(r) = O(exp^{|r|})$. An example will be given later.
+%there is often very little to be done in the first phase, one simply
-For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold
+%construct the $\NFA$ states based on the structure of the input regular expression.
-expressions like $r^n$ into
-\[
+%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
-	\underbrace{r \cdots r}_{\text{n copies of r}}.
+%one by keeping track of all active states after consuming
-\]
+%a character, and update that set of states iteratively.
-The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
+%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
-On the other hand, if backtracking is used, the worst-case time bound bloats
+%for a path terminating
-to $|r| * 2^{|s|}$.
+%at an accepting state.
-%on the input
-%And when calculating the time complexity of the matching algorithm,
-%we are assuming that each input reading step requires constant time.
-%which translates to that the number of
-%states active and transitions taken each time is bounded by a
-%constant $C$.
-%But modern  regex libraries in popular language engines
-% often want to support much richer constructs than just
-% sequences and Kleene stars,
-%such as negation, intersection,
-%bounded repetitions and back-references.
-%And de-sugaring these "extended" regular expressions
-%into basic ones might bloat the size exponentially.
-%TODO: more reference for exponential size blowup on desugaring.
-\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
-The good things about $\mathit{DFA}$s is that once
-generated, they are fast and stable, unlike
-backtracking algorithms.
-However, they do not scale well with bounded repetitions.
-\subsubsection{Problems with Bounded Repetitions}
-\subsubsection{Tools that uses $\mathit{DFA}$s}
-%TODO:more tools that use DFAs?
-$\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
-in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
-lexers. The user provides a set of regular expressions
-and configurations to such lexer generators, and then
-gets an output program encoding a minimized $\mathit{DFA}$
-that can be compiled and run.
-When given the above countdown regular expression,
-a small number $n$ would result in a determinised automata
-with millions of states.
-For this reason, regex libraries that support
-bounded repetitions often choose to use the $\mathit{NFA}$
-approach.
-\subsection{Why $\mathit{NFA}$s can be slow in the second phase}
-When one constructs an $\NFA$ out of a regular expression
-there is often very little to be done in the first phase, one simply
-construct the $\NFA$ states based on the structure of the input regular expression.
-In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
-one by keeping track of all active states after consuming
-a character, and update that set of states iteratively.
-This can be viewed as a breadth-first-search of the $\mathit{NFA}$
-for a path terminating
-at an accepting state.
 Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
 type of $\mathit{NFA}$ simulation and guarantees a linear runtime
 in terms of input string length.
 %TODO:try out these lexers
 The other way to use $\mathit{NFA}$ for matching is choosing
 %when the language is still regular).
 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
 %TODO: verify the fact Rust does not allow 1000+ reps
+%The time cost of regex matching algorithms in general
+%involve two different phases, and different things can go differently wrong on
+%these phases.
+%$\DFA$s usually have problems in the first (construction) phase
+%, whereas $\NFA$s usually run into trouble
+%on the second phase.
+\section{Error-prone POSIX Implementations}
+The problems with practical implementations
+of reare not limited to slowness on certain
+cases.
+Another thing about these libraries is that there
+is no correctness guarantee.
+In some cases, they either fail to generate a lexing result when there exists a match,
+or give results that are inconsistent with the $\POSIX$ standard.
+A concrete example would be the regex
+\begin{center}
+	$(aba + ab + a)* \text{and the string} ababa$
+\end{center}
+The correct $\POSIX$ match for the above would be
+with the entire string $ababa$,
+split into two Kleene star iterations, $[ab] [aba]$ at positions
+$[0, 2), [2, 5)$
+respectively.
+But trying this out in regex101\parencite{regex101}
+with different language engines would yield
+the same two fragmented matches: $[aba]$ at $[0, 3)$
+and $a$ at $[4, 5)$.
+Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
+correctly implementing the POSIX (maximum-munch)
+rule of regular expression matching.
+As Grathwohl\parencite{grathwohl2014crash} wrote,
+\begin{quote}
+	The POSIX strategy is more complicated than the
+	greedy because of the dependence on information about
+	the length of matched strings in the various subexpressions.
+\end{quote}
+%\noindent
+To summarise the above, regular expressions are important.
+They are popular and programming languages' library functions
+for them are very fast on non-catastrophic cases.
+But there are problems with current practical implementations.
+First thing is that the running time might blow up.
+The second problem is that they might be error-prone on certain
+very simple cases.
+In the next part of the chapter, we will look into reasons why
+certain regex engines are running horribly slow on the "catastrophic"
+cases and propose a solution that addresses both of these problems
+based on Brzozowski and Sulzmann and Lu's work.
+\subsection{Different Phases of a Matching/Lexing Algorithm}
+Most lexing algorithms can be roughly divided into
+two phases during its run.
+The first phase is the "construction" phase,
+in which the algorithm builds some
+suitable data structure from the input regex $r$, so that
+it can be easily operated on later.
+We denote
+the time cost for such a phase by $P_1(r)$.
+The second phase is the lexing phase, when the input string
+$s$ is read and the data structure
+representing that regex $r$ is being operated on.
+We represent the time
+it takes by $P_2(r, s)$.\\
+For $\mathit{DFA}$,
+we have $P_2(r, s) = O( |s| )$,
+because we take at most $|s|$ steps,
+and each step takes
+at most one transition--
+a deterministic-finite-automata
+by definition has at most one state active and at most one
+transition upon receiving an input symbol.
+But unfortunately in the  worst case
+$P_1(r) = O(exp^{|r|})$. An example will be given later.
+For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold
+expressions like $r^n$ into
+\[
+	\underbrace{r \cdots r}_{\text{n copies of r}}.
+\]
+The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
+On the other hand, if backtracking is used, the worst-case time bound bloats
+to $|r| * 2^{|s|}$.
+%on the input
+%And when calculating the time complexity of the matching algorithm,
+%we are assuming that each input reading step requires constant time.
+%which translates to that the number of
+%states active and transitions taken each time is bounded by a
+%constant $C$.
+%But modern  regex libraries in popular language engines
+% often want to support much richer constructs than just
+% sequences and Kleene stars,
+%such as negation, intersection,
+%bounded repetitions and back-references.
+%And de-sugaring these "extended" regular expressions
+%into basic ones might bloat the size exponentially.
+%TODO: more reference for exponential size blowup on desugaring.
+\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
+The good things about $\mathit{DFA}$s is that once
+generated, they are fast and stable, unlike
+backtracking algorithms.
+However, they do not scale well with bounded repetitions.
+\subsubsection{Problems with Bounded Repetitions}
 So we have practical implementations
 on regular expression matching/lexing which are fast
 but do not come with any guarantees that it will not grind to a halt
 or give wrong answers.
 Our goal is to have a regex lexing algorithm that comes with

changeset 605	ed53ce26ecb6
parent 604	16d67f9c07d4
child 606	99b530103464