--- a/etnms/etnms.tex Wed Mar 04 13:25:52 2020 +0000
+++ b/etnms/etnms.tex Fri Apr 10 11:58:11 2020 +0100
@@ -91,75 +91,59 @@
\maketitle
\begin{abstract}
- Brzozowski introduced in 1964 a beautifully simple algorithm for
- regular expression matching based on the notion of derivatives of
- regular expressions. In 2014, Sulzmann and Lu extended this
- algorithm to not just give a YES/NO answer for whether or not a
- regular expression matches a string, but in case it does also
- answers with \emph{how} it matches the string. This is important for
- applications such as lexing (tokenising a string). The problem is to
- make the algorithm by Sulzmann and Lu fast on all inputs without
- breaking its correctness. Being fast depends on a complete set of
- simplification rules, some of which
- have been put forward by Sulzmann and Lu. We have extended their
- rules in order to obtain a tight bound on the size of regular expressions.
- We have tested these extended rules, but have not
- formally established their correctness. We have also not yet looked
- at extended regular expressions, such as bounded repetitions,
- negation and back-references.
+Regular expressions, a useful concept in computer science
+that was initially designed to match string patterns,
+need fast matching algorithms.
+If we do matching by naively converting
+the regular expression to a
+Nondeterministic-Finite-Automaton(NFA) and simulating it,
+they can run into trouble on certain evil regular expression and string
+pairs.
+This necessitates the introudction of a
+method called derivatives of regular expressions\cite{Brzozowski1964}.
+With some variation introduced by Sulzmann and Lu\cite{Sulzmann2014},
+it circumvents the problem of
+catastrphoic backtracking elegantly when
+compared to backtracking NFA regex engines.
+We believe that such a method can help to address
+the status-quo where badly written code for
+regular expression matching can cause a software grief.
+We have come up with a set of simplification rules
+that makes the space and time requirements below a tight bound.
+We have proved certain properties of the algorithm
+and simplification.
+We have not yet worked out a proof for the correctness
+although test data suggested this.
\end{abstract}
+
\section{Introduction}
-%Regular expressions' derivatives, which have received
-%renewed interest in the new millenium, is a beautiful....
-While we believe derivatives of regular expressions, written
-$r\backslash s$, are a beautiful concept (in terms of ease of
-implementing them in functional programming languages and in terms of
-reasoning about them formally), they have one major drawback: every
-derivative step can make regular expressions grow drastically in
-size. This in turn has negative effect on the runtime of the
-corresponding lexing algorithms. Consider for example the regular
-expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The
-corresponding derivative contains already 8668 nodes where we assume
-the derivative is given as a tree. The reason for the poor runtime of
-the derivative-based lexing algorithms is that they need to traverse
-such trees over and over again. The solution is to find a complete set
-of simplification rules that keep the sizes of derivatives uniformly
-small.
-
-This has been partially addressed by the function $\blexer_{simp}$,
-which after the simplification the $(a+aa)^*$ example's 8000 nodes will be
-reduced to just 6 and stays constant in each derivative step.
-The part that still needs more work is the correctness proof of this
-function, namely,
-\begin{equation}\label{mainthm}
-\blexers \; r \; s = \blexer \;r\;s
-\end{equation}
-
-\noindent
-and this is what this report is mainly about. A condensed
-version of the last report will be provided in the next section
-to help the reader understand the report, and the attempts
-on the problem will follow.
-
-
-\section{Recapitulation of Concepts From the Last Report}
-
-\subsection*{Regular Expressions and Derivatives}
-Suppose (basic) regular expressions are given by the following grammar:
-
-\[ r ::= \ZERO \mid \ONE
- \mid c
- \mid r_1 \cdot r_2
- \mid r_1 + r_2
- \mid r^*
-\]
-
-\noindent
-The ingenious contribution of Brzozowski is the notion of \emph{derivatives} of
-regular expressions, written~$\_ \backslash \_$. It uses the auxiliary notion of
-$\nullable$ defined below.
+The following brief introduction gives the background for
+derivatives--how they are originally defined and the
+variety\cite{Sulzmann2014} that is more suitable for lexing.
+Regular expressions can be formalized as follows:\\
+\begin{center}
+ $r ::= \ZERO | \ONE | c | r_1 +r_2 | r_1 \cdot r_2 | r^*$
+\end{center}
+Derivatives of regular expression can be
+seen as the regular expression that corresponds to
+the language derived by computing
+the left quotient of the original regular language
+w.r.t a character.
+Derivative of regular expression is also a
+regular expression because
+the left quotient of a regular language is also regular.
+With this we know such a function exist without having
+to worry that it might not be defined.
+In fact, regular expressions derivatives
+are not just defined but also simple and elegant.
+They are composed of two recursive functions,
+and can be easily explained by intuition.
+The function
+$\nullable$ defined below,
+tests whether the empty string belongs
+to the regular expression being checked.
\begin{center}
\begin{tabular}{lcl}
@@ -170,7 +154,24 @@
$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
\end{tabular}
- \end{center}
+\end{center}
+\noindent
+The empty set does not contain any string and
+therefore not the empty string, the empty string
+regular expression contains the empty string
+by definition, the character regular expression
+is the singleton that contains character only,
+and therefore does not contain the empty string,
+the alternative regular expression(or "or" expression)
+might have one of its children regular expressions
+being nullable and any one of its children being nullable
+would suffice. The sequence regular expression
+would require both children to have the empty string
+to compose an empty string and the Kleene star
+operation naturally introduced the empty string.
+All nice and easy.
+This definition is used in the derivative operation
+as a condition:
\begin{center}
\begin{tabular}{lcl}
@@ -186,11 +187,26 @@
\end{tabular}
\end{center}
\noindent
-And defines how a regular expression evolves into
+\noindent
+The function derivative, written $\backslash c$,
+defines how a regular expression evolves into
a new regular expression after all the string it contains
is chopped off a certain head character $c$.
+The most involved cases are the sequence
+and star case.
+The sequence case says that if the first regular expression
+contains an empty string then second component of the sequence
+might be chosen as the target regular expression to be chopped
+off its head character.
+The star regular expression unwraps the iteration of
+regular expression and attack the star regular expression
+to its back again to make sure there are 0 or more iterations
+following this unfolded iteration.
-The main property of the derivative operation is that
+
+The main property of the derivative operation
+that enables us to reason about the correctness of
+an algorithm using derivatives is
\begin{center}
$c\!::\!s \in L(r)$ holds
@@ -232,18 +248,61 @@
relatively easily shown that this matcher is correct (that is given
an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
-
+Beautiful and simple definition.
+
+If we implement the above algorithm naively, however,
+the algorithm can be excruciatingly slow. For example, when starting with the regular
+expression $(a + aa)^*$ and building 12 successive derivatives
+w.r.t.~the character $a$, one obtains a derivative regular expression
+with more than 8000 nodes (when viewed as a tree). Operations like
+$\backslash$ and $\nullable$ need to traverse such trees and
+consequently the bigger the size of the derivative the slower the
+algorithm.
+
+Brzozowski was quick in finding that during this process a lot useless
+$\ONE$s and $\ZERO$s are generated and therefore not optimal.
+He also introduced some "similarity rules" such
+as $P+(Q+R) = (P+Q)+R$ to merge syntactically
+different but language-equivalent sub-regexes to further decrease the size
+of the intermediate regexes.
+
+More simplifications are possible, such as deleting duplicates
+and opening up nested alternatives to trigger even more simplifications.
+And suppose we apply simplification after each derivative step, and compose
+these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
+\textit{simp}(a \backslash c)$. Then we can build
+a matcher without having cumbersome regular expressions.
+
+
+If we want the size of derivatives in the algorithm to
+stay even lower, we would need more aggressive simplifications.
+Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
+deleting duplicates whenever possible. For example, the parentheses in
+$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
+\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
+example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
+$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
+to achieve a very tight size bound, namely,
+ the same size bound as that of the \emph{partial derivatives}.
+
+Building derivatives and then simplify them.
+So far so good. But what if we want to
+do lexing instead of just a YES/NO answer?
+This requires us to go back again to the world
+without simplification first for a moment.
+Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
+elegant(arguably as beautiful as the original
+derivatives definition) solution for this.
+
\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
-One limitation of Brzozowski's algorithm is that it only produces a
-YES/NO answer for whether a string is being matched by a regular
-expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
-to allow generation of an actual matching, called a \emph{value} or
+
+They first defined the datatypes for storing the
+lexing information called a \emph{value} or
sometimes also \emph{lexical value}. These values and regular
expressions correspond to each other as illustrated in the following
table:
-
\begin{center}
\begin{tabular}{c@{\hspace{20mm}}c}
\begin{tabular}{@{}rrl@{}}
@@ -271,6 +330,36 @@
\end{center}
\noindent
+One regular expression can have multiple lexical values. For example
+for the regular expression $(a+b)^*$, it has a infinite list of
+values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
+$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
+$\ldots$, and vice versa.
+Even for the regular expression matching a certain string, there could
+still be more than one value corresponding to it.
+Take the example where $r= (a^*\cdot a^*)^*$ and the string
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
+The number of different ways of matching
+without allowing any value under a star to be flattened
+to an empty string can be given by the following formula:
+\begin{center}
+ $C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
+\end{center}
+and a closed form formula can be calculated to be
+\begin{equation}
+ C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
+\end{equation}
+which is clearly in exponential order.
+A lexer aimed at getting all the possible values has an exponential
+worst case runtime. Therefore it is impractical to try to generate
+all possible matches in a run. In practice, we are usually
+interested about POSIX values, which by intuition always
+match the leftmost regular expression when there is a choice
+and always match a sub part as much as possible before proceeding
+to the next token. For example, the above example has the POSIX value
+$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
+The output of an algorithm we want would be a POSIX matching
+encoded as a value.
The contribution of Sulzmann and Lu is an extension of Brzozowski's
algorithm by a second phase (the first phase being building successive
derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
@@ -298,7 +387,7 @@
values incrementally by \emph{injecting} back the characters into the
earlier values $v_n, \ldots, v_0$. This is the second phase of the
algorithm from the right to left. For the first value $v_n$, we call the
-function $\textit{mkeps}$, which builds the lexical value
+function $\textit{mkeps}$, which builds a POSIX lexical value
for how the empty string has been matched by the (nullable) regular
expression $r_n$. This function is defined as
@@ -320,7 +409,8 @@
the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
After injecting back $n$ characters, we get the lexical value for how $r_0$
-matches $s$. For this Sulzmann and Lu defined a function that reverses
+matches $s$. The POSIX value is maintained throught out the process.
+For this Sulzmann and Lu defined a function that reverses
the ``chopping off'' of characters during the derivative phase. The
corresponding function is called \emph{injection}, written
$\textit{inj}$; it takes three arguments: the first one is a regular
@@ -345,55 +435,54 @@
\noindent This definition is by recursion on the ``shape'' of regular
expressions and values.
-
-
-\subsection*{Simplification of Regular Expressions}
+The clauses basically do one thing--identifying the ``holes'' on
+value to inject the character back into.
+For instance, in the last clause for injecting back to a value
+that would turn into a new star value that corresponds to a star,
+we know it must be a sequence value. And we know that the first
+value of that sequence corresponds to the child regex of the star
+with the first character being chopped off--an iteration of the star
+that had just been unfolded. This value is followed by the already
+matched star iterations we collected before. So we inject the character
+back to the first value and form a new value with this new iteration
+being added to the previous list of iterations, all under the $Stars$
+top level.
-The main drawback of building successive derivatives according
-to Brzozowski's definition is that they can grow very quickly in size.
-This is mainly due to the fact that the derivative operation generates
-often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, if
-implemented naively both algorithms by Brzozowski and by Sulzmann and Lu
-are excruciatingly slow. For example when starting with the regular
-expression $(a + aa)^*$ and building 12 successive derivatives
-w.r.t.~the character $a$, one obtains a derivative regular expression
-with more than 8000 nodes (when viewed as a tree). Operations like
-$\textit{der}$ and $\nullable$ need to traverse such trees and
-consequently the bigger the size of the derivative the slower the
-algorithm.
+We have mentioned before that derivatives without simplification
+can get clumsy, and this is true for values as well--they reflect
+the regular expressions size by definition.
-Fortunately, one can simplify regular expressions after each derivative
-step.
-Various simplifications of regular expressions are possible, such
-as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
-\cdot \ONE$, and $r + r$ to just $r$.
-Suppose we apply simplification after each derivative step, and compose
-these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
-\textit{simp}(a \backslash c)$. Then we can build values without having
-a cumbersome regular expression, and fortunately if we are careful
-enough in making some extra rectifications, the POSIX value of how
+One can introduce simplification on the regex and values, but have to
+be careful in not breaking the correctness as the injection
+function heavily relies on the structure of the regexes and values
+being correct and match each other.
+It can be achieved by recording some extra rectification functions
+during the derivatives step, and applying these rectifications in
+each run during the injection phase.
+And we can prove that the POSIX value of how
regular expressions match strings will not be affected---although is much harder
to establish. Some initial results in this regard have been
obtained in \cite{AusafDyckhoffUrban2016}.
+%Brzozowski, after giving the derivatives and simplification,
+%did not explore lexing with simplification or he may well be
+%stuck on an efficient simplificaiton with a proof.
+%He went on to explore the use of derivatives together with
+%automaton, and did not try lexing using derivatives.
-If we want the size of derivatives in Sulzmann and Lu's algorithm to
-stay even lower, we would need more aggressive simplifications.
-Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
-deleting duplicates whenever possible. For example, the parentheses in
-$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
-\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
-example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
-$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
-to achieve a very tight size bound, namely,
- the same size bound as that of the \emph{partial derivatives}.
-And we want to get rid of complex and fragile rectification of values.
+We want to get rid of complex and fragile rectification of values.
+Can we not create those intermediate values $v_1,\ldots v_n$,
+and get the lexing information that should be already there while
+doing derivatives in one pass, without a second phase of injection?
+In the meantime, can we make sure that simplifications
+are easily handled without breaking the correctness of the algorithm?
+Sulzmann and Lu solved this problem by
+introducing additional informtaion to the
+regular expressions called \emph{bitcodes}.
-In order to implement the idea of ``spilling out alternatives'' and to
-make them compatible with the $\textit{inj}$-mechanism, we use
-\emph{bitcodes}. They were first introduced by Sulzmann and Lu.
-Here bits and bitcodes (lists of bits) are defined as:
+\subsection*{Bit-coded Algorithm}
+Bits and bitcodes (lists of bits) are defined as:
\begin{center}
$b ::= 1 \mid 0 \qquad
@@ -674,14 +763,18 @@
operation from characters to strings (just like the derivatives for un-annotated
regular expressions).
+Remember tha one of the important reasons we introduced bitcodes
+is that they can make simplification more structured and therefore guaranteeing
+the correctness.
\subsection*{Our Simplification Rules}
-The main point of the bitcodes and annotated regular expressions is that
-we can apply rather aggressive (in terms of size) simplification rules
-in order to keep derivatives small. We have developed such
-``aggressive'' simplification rules and generated test data that show
-that the expected bound can be achieved. Obviously we could only
+In this section we introduce aggressive (in terms of size) simplification rules
+on annotated regular expressions
+in order to keep derivatives small. Such simplifications are promising
+as we have
+generated test data that show
+that a good tight bound can be achieved. Obviously we could only
partially cover the search space as there are infinitely many regular
expressions and strings.
@@ -1756,6 +1849,92 @@
Hopefully in the next few weeks we will be able to find one.
This would be an important milestone for my dissertation.
+
+\section{Future Plans}
+
+Before the proof comes out, a plan is needed to make
+sure some progress is happening.
+We plan to get
+a practical implementation of our current
+method, which would involve extending the set of allowed syntax
+to include common regular expression constructs
+such as bounded repitions, negation(complement)
+and character range or even Boolean functions.
+Adding these concise way of expressing regular expressions
+while keeping the correctness of the simplification
+makes the work more practical.
+For example, using the added constructs we can
+demonstrate that our implementation can actually
+help the cloudflare web-application firewall
+to run smoothly even if it had been fed with
+evil regular expression
+$.*(?:.*=.*)))$
+and strings like $x=xxxxxxxxxx....$.
+This could help prove that our algorithm is not just
+some nice theory work but is actually competent in reality
+as this regular expression and string pair made the cloudflare
+servers to grind
+to a halt on July 2nd, 2019.
+After all, simply the "basic" regular expressions
+we gave in the introduction is really so basic, that
+even Brzozowski's 1964 paper has included more
+varieties than that(Boolean functions instead of
+just alternatives)\cite{Brzozowski1964}.
+A function only implementing the basic ones
+would lack practical interests.
+Making the algorithm practical also means to make it fast.
+Right now the Scala implementation
+is slower than the naive algorithm without bitcodes, even with those
+drastic simplification measures, which is not suggested by theory.
+We might need to implement it in other languages such as C
+to see if the problem is language specific.
+This is about 3 months' work.
+
+We believe that there
+is still potential for more simplification as there are a lot
+of rules for regular expression similarity, however,
+this needs to be explored and checked.
+Looking for maximal simplification and
+best size bound and proving them would take around 3 months to complete.
+
+Context-free languages are a bit harder to deal with by derivatives
+when compared with regular expressions.
+The main reason is that the child node of a context grammar
+can be the grammar itself--this left-recursion makes
+the computation procedure an impossibility when trying to solve directly.
+An example would be the following\cite{might2011cfgder}:
+\begin{center}
+$L = L \cdot x | \ONE$
+\end{center}
+then a derivative would give us the following:
+\begin{center}
+$L\backslash x = (L \backslash x)\cdot x | \ONE$
+\end{center}
+which is essentially what we have previously.
+This recursion can go on forever if we do not use some
+other methods such as introducing methods
+or manually decide that the two equations
+are actually equivalent and $L$ and $L\backslash x$
+denote the same language. And then give a mathematical proof
+that this is actually the case.
+This seems well beyond any mechanical algorithm one can
+expect to be able to do.....
+People have tried various ways to deal with the problems
+of looping when computing derivatives of context-free grammars
+such as memoization and laziness, with some degree of success\cite{might2011cfgder}.
+However this field seems largely unexplored and further
+optimizations seem possible. It would be great
+if we could find a simple augmentation to the current
+derivative method so that the subset of context free language
+of interest can be handled.
+This may take longer than the two previous tasks, but we still
+give it 3 months time and see how it goes.
+
+
+
+
+
+
\bibliographystyle{plain}
\bibliography{root}
--- a/lex_blex_Frankensteined.scala Wed Mar 04 13:25:52 2020 +0000
+++ b/lex_blex_Frankensteined.scala Fri Apr 10 11:58:11 2020 +0100
@@ -137,7 +137,9 @@
case Stars(v::vs) => S::code(v) ::: code(Stars(vs))
}
-
+ //note that left and right are not recorded
+ //although they guide the retrival
+ //in contrast with stars
def retrieve(r: ARexp, v: Val): Bits = (r,v) match {
case (AONE(bs), Empty) => bs
case (ACHAR(bs, c), Chr(d)) => bs
@@ -256,7 +258,40 @@
case RECD(x, r) => Rec(x, mkeps(r))
//case PLUS(r) => Stars(List(mkeps(r)))
}
+ def haschr(r: Rexp) : Boolean = r match {
+ case CHAR(c) => true
+ case STAR(r0) => haschr(r0)
+ case SEQ(r1, r2) => haschr(r1) && nullable(r2) || haschr(r2) && nullable(r1)
+ case ALTS(List(r1, r2)) => haschr(r1) || haschr(r2)
+ case ONE => false
+ case ZERO => false
+ }
+ def haschar(r: Rexp, c: Char) : Boolean = r match {
+ case CHAR(d) => if(c == d) true else false
+ case SEQ(r1, r2) => if(haschar(r1, c) && nullable(r2)) true else if(haschar(r2, c) && nullable(r1) ) true else false
+ case STAR(r) => if(haschar(r, c)) true else false
+ case ALTS(List(r1, r2)) => if(haschar(r1, c) || haschar(r2, c)) true else false
+ case ONE => false
+ case ZERO => false
+ }
+ def mkchr(r: Rexp) : Val = r match {
+ case SEQ(r1, r2) =>
+ if(haschr(r1) && nullable(r2)) Sequ(mkchr(r1), mkeps(r2))
+ else if(nullable(r1) && haschr(r2)) Sequ(mkeps(r1), mkchr(r2))
+ else throw new Exception(r.toString)
+ case ALTS(List(r1, r2)) => if (haschr(r1)) Left(mkchr(r1)) else Right(mkchr(r2))
+ case STAR(r0) => Stars(List(mkchr(r0)))
+ case CHAR(c) => Chr(c)
+ case _ => throw new Exception("the regex you gave me can't make a char")
+ }
+ def mkchar(r: Rexp, c: Char) : Val = r match {
+ case CHAR(d) => Chr(c)//if(c == d) Chr(c) else error
+ case ALTS(List(r1, r2)) =>
+ if (haschar(r1, c)) Left(mkchar(r1, c)) else Right(mkchar(r2, c))
+ case SEQ(r1, r2) => {if(haschar(r1, c)) Sequ(mkchar(r1, c), mkeps(r2)) else Sequ(mkeps(r1), mkchar(r2, c))}
+ case STAR(r) =>Stars(List(mkchar(r, c)))
+ }
def inj(r: Rexp, c: Char, v: Val) : Val = (r, v) match {
case (STAR(r), Sequ(v1, Stars(vs))) => Stars(inj(r, c, v1)::vs)
case (SEQ(r1, r2), Sequ(v1, v2)) => Sequ(inj(r1, c, v1), v2)
@@ -268,6 +303,28 @@
case (RECD(x, r1), _) => Rec(x, inj(r1, c, v))
//case (PLUS(r), Sequ(v1, Stars(vs))) => Stars(inj(r, c, v1)::vs)
}
+ def fuzzy_inj(r: ARexp, c: Char, v: Val) : Val = v match {
+ case Stars(vs) => r match {//vs
+ case ASTAR(bs, r0) => inj( erase(r), c, Sequ(mkeps(erase(bder(c, r0))), v) )
+ case ASEQ(bs, r1, r2) => inj(erase(r), c, Sequ(mkeps(erase(bder(c, r1))), v) )
+ }
+ case Sequ(v1, v2) => r match {
+ case ASTAR(bs, r0) => inj(erase(r), c, Sequ(mkchar(erase(bder(c, r0)), c), v2))
+ case _ => v
+ }
+ case _ => v
+ }
+ /*def gen_rect(r: Rexp) : Val => Val = {
+ //lingqi
+ //buyao sanle
+ val r1 = bsimp(r)
+
+ }
+ def fuzzy_inj(r: ARexp, c: Char, v: Val) : Val = {
+ val f = gen_rect(r)
+ val vo = f(v)
+ inj(r, c, vo)
+ }*/
def lex(r: Rexp, s: List[Char]) : Val = s match {
case Nil => if (nullable(r)) mkeps(r) else throw new Exception("Not matched")
case c::cs => inj(r, c, lex(der(c, r), cs))
@@ -453,16 +510,16 @@
if(rel_dist >0){//the regex we are dealing with is not what v points at
rs match{
case Nil => throw new Exception("Trying to simplify a non-existent value")
- case AZERO :: rs1 => flats_vsimp(rs1, rel_dist - 1, remove(v))
- case AALTS(bs, rs1) :: rs2 => flats_vsimp(rs2, rel_dist - 1, augment(v, rs1.length - 1))//rs1 is guaranteed to have a len geq 2
- case r1 :: rs2 => flats_vsimp(rs2, rel_dist - 1, v)
+ case AZERO :: rs1 => right_shift(rs1, rel_dist - 1, remove(v))
+ case AALTS(bs, rs1) :: rs2 => right_shift(rs2, rel_dist - 1, augment(v, rs1.length - 1))//rs1 is guaranteed to have a len geq 2
+ case r1 :: rs2 => right_shift(rs2, rel_dist - 1, v)
}
}
else if(rel_dist == 0){//we are dealing with regex v is pointing to -- "v->r itself"
rs match{//r1 cannot be zero
- AALTS(bs, rs1) :: rs2 => flats_vsimp( )
+ AALTS(bs, rs1) :: rs2 => right_shift( )
AZERO::rs2 => throw new Exception("Value corresponds to zero")
- r1::rs2 => flats_vsimp(rs2, rel_dist - 1, v)
+ r1::rs2 => right_shift(rs2, rel_dist - 1, v)
}
}
@@ -470,12 +527,13 @@
}
*/
- def flats_vsimp(rs: List[ARexp], position: Int): Int = (rs, position) match {
+ //gives how much the regex at i has shifted right after flatten(on a list of simplified regexes)
+ def right_shift(rs: List[ARexp], i: Int): Int = (rs, i) match {
case (_, 0) => 0
case (Nil, _) => 0
- case (AZERO :: rs1, _) => flats_vsimp(rs1, position - 1) - 1
- case (AALTS(bs, rs1) :: rs2, _) => rs1.length - 1 + flats_vsimp(rs2, position - 1)
- case (r1 :: rs2, _) => flats_vsimp(rs2, position - 1)
+ case (AZERO :: rs1, _) => right_shift(rs1, i - 1) - 1
+ case (AALTS(bs, rs1) :: rs2, _) => rs1.length - 1 + right_shift(rs2, i - 1)
+ case (r1 :: rs2, _) => right_shift(rs2, i - 1)
}
def rflats(rs: List[Rexp]): List[Rexp] = rs match {
case Nil => Nil
@@ -537,6 +595,12 @@
}
//only print at the top level
+ //find_pos returns the index of a certain regex in a list of regex
+ //the leftmost regex is given the index 0 and the regex next to it
+ //is given 1 and so on
+ //it needs the value to point out which regex it wants to get index of
+ //find_pos_aux does essentially the same thing as find_pos, except that
+ //it receives an alts instead of a list of regular expressions
def find_pos(v: Val, rs: List[ARexp]): Int = (v, rs) match{
case (v, r::Nil) => 0
case (Right(v), r::rs) => find_pos(v, rs) + 1
@@ -559,21 +623,134 @@
case Right(v) => return simple_end(v)
case v => return true
}
- def isend(v: Val, rs: List[ARexp], position: Int): Boolean = {//TODO: here the slice api i am not familiar with so this call might be incorrect and crash the bsimp2
- val rsbh = rs.slice(position + 1, rs.length)
+
+ //tells if rs[i] after flatten is at the right end
+ def isend(v: Val, rs: List[ARexp], i: Int): Boolean = {//TODO: here the slice api i am not familiar with so this call might be incorrect and crash the bsimp2
+ val rsbh = rs.slice(i + 1, rs.length)
val out_end = if(flats(rsbh) == Nil) true else false
val inner_end = simple_end(v)
inner_end && out_end
}
+ //get the coat of v and wears it on vs
def get_coat(v: Val, rs: List[Rexp], vs: Val): Val = (v, rs) match{//the dual operation of remove(so-called by myself)
case (Right(v), r::Nil) => Right(vs)
case (Left(v), r::rs) => Left(vs)
case (Right(v), r::rs) => Right(get_coat(v, rs, vs))
}
+ //coat does the job of "coating" a value
+ //given the number of right outside it
def coat(v: Val, i: Int) : Val = i match {
case 0 => v
case i => coat(Right(v), i - 1)
}
+ def decoat(v:Val, i: Int) : Val = i match {
+ case 0 => v
+ case i => v match {
+ case Right(v0) => decoat(v0, i - 1)
+ case _ => throw new Exception("bad args decoat")
+ }
+ }
+ //given a regex, and a value, return the rectification function for how to rebuild the original value from the simplified value
+
+ def vunsimp(r: ARexp, v: Val) : Val => Val = (r, v) match {
+ case (ASEQ(bs1, r1, r2), Sequ(v1, v2)) => (bsimp2(r1, v1), bsimp2(r2, v2)) match {
+ case ((AZERO, _), (_, _) )=> throw new Exception("bad arguments")
+ case ((_, _), (AZERO, _)) => throw new Exception("bad arguments")
+ case ((AONE(bs2), v1s) , (r2s, v2s)) => (vtails => Sequ(v1, vunsimp(r2, v2)(vtails))) //(fuse(bs1 ++ bs2, r2s), v2s )//v2 tells how to retrieve bits in r2s, which is enough as the bits of the first part of the sequence has already been integrated to the top level of the second regx.
+ case ((r1s, v1s), (r2s, v2s)) => (vsmall => vsmall match {
+ case Sequ(v1small, v2small) => Sequ(vunsimp(r1, v1)(v1small), vunsimp(r2, v2)(v2small))
+ case _ => {
+ println(vsmall) ;
+ throw new Exception("bad arguments sequ")
+ }
+ //(ASEQ(bs1, r1s, r2s), Sequ(v1s, v2s))
+ })
+ }
+ case (AALTS(bs1, rs), v) => {
+ val init_ind = find_pos(v, rs)
+ val rightend1 = if(init_ind + 1 == rs.length) true else false
+ val inner_rectfunct = vunsimp(rs(init_ind), remove(v, rs))//remove all the outer layers of left and right in v to match the regx rs[i]
+ val vpointr = bsimp2(rs(init_ind), remove(v, rs))
+ val target_vs = vpointr._2
+ val target_rs = vpointr._1
+
+ val rs_simp = rs.map(bsimp)
+ val target_vs_kernel = target_rs match {
+ case AALTS(bs2, rs2) => remove(target_vs, rs2)//remove the secondary layer of left and right
+ case r => target_vs
+ }
+ val target_vs_outerlayers = target_rs match {
+ case AALTS(bs2, rs2) => find_pos(target_vs, rs2)//remove the secondary layer of left and right
+ case r => 0
+ }
+ val rightend2 = target_rs match {
+ case AALTS(bs2, rs2) => if(find_pos(target_vs, rs2) == rs2.length - 1) true else false
+ case r => false
+ }
+ val isalts = target_rs match {
+ case AALTS(bs2, rs2) => true
+ case r => false
+ }
+
+
+ val flat_res = flats(rs_simp)
+ val flats_shit1 = right_shift(rs_simp, init_ind)
+ val flats_shift2 = find_pos_aux(target_vs, target_rs)
+ val flats_shift = flats_shit1 + flats_shift2//right_shift used to be called flats_vsimp
+ val new_ind = init_ind + flats_shift
+
+ val dist_res = distinctBy(flat_res, erase)
+ val front_part = distinctBy(flat_res.slice(0, new_ind + 1), erase)
+
+ val vdb = if(dist_res.length == front_part.length )//that means the regex we are interested in is at the end of the list
+ {
+ coat(target_vs_kernel, front_part.length - 1)
+ }
+ else{
+ coat(Left(target_vs_kernel), front_part.length - 1)
+ }
+ if(rightend1){
+ if(rightend2){
+ kernel_coated: Val =>
+ decoat(kernel_coated, front_part.length - 1) match {
+ case Left(vk) => coat(inner_rectfunct(coat(vk, target_vs_outerlayers)), init_ind)//add clause: if rs_simp(init_ind) is an alts
+ case vk => coat(inner_rectfunct(coat(vk, target_vs_outerlayers)), init_ind)
+ }
+ }
+ else{
+ kernel_coated: Val =>
+ decoat(kernel_coated, front_part.length - 1) match {
+ case Left(vk) => if(isalts) coat(inner_rectfunct(coat(Left(vk), target_vs_outerlayers)), init_ind)
+ else coat(inner_rectfunct(coat((vk), target_vs_outerlayers)), init_ind)//add clause: if rs_simp(init_ind) is an alts
+ case vk => if(isalts) coat(inner_rectfunct(coat(Left(vk), target_vs_outerlayers)), init_ind)
+ else coat(inner_rectfunct(coat((vk), target_vs_outerlayers)), init_ind)
+ }
+ }
+ }
+ else{
+ if(rightend2){
+ kernel_coated: Val =>
+ decoat(kernel_coated, front_part.length - 1) match {
+ case Left(vk) => coat(Left(inner_rectfunct(coat(vk, target_vs_outerlayers))), init_ind)//add clause: if rs_simp(init_ind) is an alts
+ case vk => coat(Left(inner_rectfunct(coat(vk, target_vs_outerlayers))), init_ind)
+ }
+ }
+ else{
+ kernel_coated: Val =>
+ decoat(kernel_coated, front_part.length - 1) match {
+ case Left(vk) => if(isalts) coat(Left(inner_rectfunct(coat(Left(vk), target_vs_outerlayers))), init_ind)
+ else coat(Left(inner_rectfunct(coat(vk, target_vs_outerlayers))), init_ind)//add clause: if rs_simp(init_ind) is an alts
+ case vk => if(isalts) coat(Left(inner_rectfunct(coat(Left(vk), target_vs_outerlayers))), init_ind)
+ else coat(Left(inner_rectfunct(coat(vk, target_vs_outerlayers))), init_ind)
+ }
+ }
+
+ }
+
+ }
+ //case ASTAR(bs, r) => ASTAR(bs, bsimp(r))
+ case (r, v) => (v => v)
+ }
//This version takes a regex and a value, return a simplified regex and its corresponding simplified value
def bsimp2(r: ARexp, v: Val): (ARexp, Val) = (r,v) match{
case (ASEQ(bs1, r1, r2), Sequ(v1, v2)) => (bsimp2(r1, v1), bsimp2(r2, v2)) match {
@@ -583,37 +760,32 @@
case ((r1s, v1s), (r2s, v2s)) => (ASEQ(bs1, r1s, r2s), Sequ(v1s, v2s))
}
case (AALTS(bs1, rs), v) => {
- //phase 1 transformation so that aalts(bs1, rs) => aalts(bs1, rsf) and v => vf
val init_ind = find_pos(v, rs)
- val vs = bsimp2(rs(init_ind), remove(v, rs))//remove all the outer layers of left and right in v to match the regx rs[i]
- //println(vs)
+ val vpointr = bsimp2(rs(init_ind), remove(v, rs))//remove all the outer layers of left and right in v to match the regx rs[i]
+ val target_sv = vpointr._2
+ val target_sr = vpointr._1
+
val rs_simp = rs.map(bsimp)
- val vs_kernel = rs_simp(init_ind) match {
- case AALTS(bs2, rs2) => remove(vs._2, rs2)//remove the secondary layer of left and right
- case r => vs._2
+ val target_vs_kernel = target_sr match {
+ case AALTS(bs2, rs2) => remove(target_sv, rs2)//if rs(init_ind) after simp is also an alts, we remove the R(....L(v)) outside v
+ case r => target_sv
}
val flat_res = flats(rs_simp)
- val vs_for_coating = if(isend(vs._2, rs_simp, init_ind)||flat_res.length == 1) vs_kernel else Left(vs_kernel)
- val r_s = rs_simp(init_ind)//or vs._1
- val shift = flats_vsimp(rs_simp, init_ind) + find_pos_aux(vs._2, rs_simp(init_ind))
- val new_ind = init_ind + shift
- val vf = coat(vs_for_coating, new_ind)
- //flats2 returns a list of regex and a single v
- //now |- vf: ALTS(bs1, flat_res)
- //phase 2 transformation so that aalts(bs1, rsf) => aalts(bs, rsdb) and vf => vdb
+ val flats_shift1 = right_shift(rs_simp, init_ind)
+ val flats_base = find_pos_aux(target_sv, target_sr)
+ val flats_shift = flats_shift1 + flats_base//right_shift used to be called flats_vsimp
+ val new_ind = init_ind + flats_shift
+
val dist_res = distinctBy(flat_res, erase)
val front_part = distinctBy(flat_res.slice(0, new_ind + 1), erase)
- //val size_reduction = new_ind + 1 - front_part.length
+
val vdb = if(dist_res.length == front_part.length )//that means the regex we are interested in is at the end of the list
{
- coat(vs_kernel, front_part.length - 1)
+ coat(target_vs_kernel, front_part.length - 1)
}
else{
- coat(Left(vs_kernel), front_part.length - 1)
+ coat(Left(target_vs_kernel), front_part.length - 1)
}
- //println(vdb)
- //we don't need to transform vdb as this phase will not make enough changes to the regex to affect value.
- //the above statement needs verification. but can be left as is now.
dist_res match {
case Nil => (AZERO, undefined)
case s :: Nil => (fuse(bs1, s), vdb)
@@ -623,6 +795,13 @@
//case ASTAR(bs, r) => ASTAR(bs, bsimp(r))
case (r, v) => (r, v)
}
+ //the below are all residuals from the bsimp2 function
+ //val vs_for_coating = if(isend(vs._2, rs_simp, init_ind)||flat_res.length == 1) vs_kernel else Left(vs_kernel)
+ //val vf = coat(vs_for_coating, new_ind)
+ //flats2 returns a list of regex and a single v
+ //now |- vf: ALTS(bs1, flat_res)
+ //phase 2 transformation so that aalts(bs1, rsf) => aalts(bs, rsdb) and vf => vdb
+ //val size_reduction = new_ind + 1 - front_part.length
def vsimp(r: ARexp, v: Val): Val = bsimp2(r, v)._2
/*This version was first intended for returning a function however a value would be simpler.
def bsimp2(r: ARexp, v: Val): (ARexp, Val => Val) = (r,v) match{
@@ -643,7 +822,7 @@
val vs_for_coating = if(isend(vs, rs_simp, init_ind)) vs_kernel else Left(vs_kernel)
val r_s = rs_simp[init_ind]
- val shift = flats_vsimp(vs, rs_simp, init_ind) + find_pos(vs, rs_simp[init_ind])
+ val shift = right_shift(vs, rs_simp, init_ind) + find_pos(vs, rs_simp[init_ind])
val vf = coat(vs_for_coating, shift + init_ind)
val flat_res = flats(rs_simp)//flats2 returns a list of regex and a single v
@@ -724,16 +903,16 @@
}
*/
// main unsimplified lexing function (produces a value)
- def blex(r: ARexp, s: List[Char]) : Bits = s match {
+ def blex(s: List[Char], r: ARexp) : Bits = s match {
case Nil => if (bnullable(r)) mkepsBC(r) else throw new Exception("Not matched")
case c::cs => {
val der_res = bder(c,r)
- blex(der_res, cs)
+ blex(cs, der_res)
}
}
- def bpre_lexing(r: Rexp, s: String) = blex(internalise(r), s.toList)
- def blexing(r: Rexp, s: String) : Val = decode(r, blex(internalise(r), s.toList))
+ def bpre_lexing(r: Rexp, s: String) = blex( s.toList, internalise(r) )
+ def blexing(s: String, r: Rexp) : Val = decode(r, blex( s.toList, internalise(r) ) )
var bder_time = 0L
var bsimp_time = 0L