lexing: changeset 539:7cf9f17aa179

--- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex	Thu Jun 09 12:57:53 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex	Thu Jun 09 22:07:44 2022 +0100
@@ -118,8 +118,37 @@
 
 
 
-
-
+\section{ $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
+We 
+Unlike in \ref{fig:BetterWaterloo}, the size with simplification now stay nicely constant with current
+simplification rules (we put the graphs together to show the contrast)
+\begin{tabular}{ll}
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    ylabel={derivative size},
+        width=7cm,
+    height=4cm, 
+    legend entries={Lexer with $\bsimp$},  
+    legend pos=  south east,
+    legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {BitcodedLexer.data};
+\end{axis}
+\end{tikzpicture} %\label{fig:BitcodedLexer}
+&
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    ylabel={derivative size},
+    width = 7cm,
+    height = 4cm,
+    legend entries={Lexer without $\bsimp$},  
+    legend pos=  north west,
+    legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
+\end{axis}
+\end{tikzpicture} 
+\end{tabular}

--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Thu Jun 09 12:57:53 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Thu Jun 09 22:07:44 2022 +0100
@@ -289,16 +289,14 @@
 \end{lemma}
 
 \noindent
-
-
 The main property of the derivative operation
 that enables us to reason about the correctness of
 an algorithm using derivatives is 
 
-\begin{center}
+\begin{lemma}\label{derStepwise}
 $c\!::\!s \in L(r)$ holds
 if and only if $s \in L(r\backslash c)$.
-\end{center}
+\end{lemma}
 
 \noindent
 We can generalise the derivative operation shown above for single characters
@@ -314,8 +312,7 @@
 \noindent
 When there is no ambiguity we will use  $\backslash$ to denote
 string derivatives for brevity.
-
-and then define Brzozowski's  regular-expression matching algorithm as:
+Brzozowski's  regular-expression matcher algorithm can then be described as:
 
 \begin{definition}
 $\textit{match}\;s\;r \;\dn\; \nullable(r\backslash s)$
@@ -332,16 +329,106 @@
 \end{equation}
 
 \noindent
+ It can  be
+relatively  easily shown that this matcher is correct:
+\begin{lemma}
+$\textit{match} \; s\; r  = \textit{true} \Longleftrightarrow s \in L(r)$
+\end{lemma}
+\begin{proof}
+By the stepwise property of $\backslash$ (\ref{derStepwise})
+\end{proof}
+\noindent
+If we implement the above algorithm naively, however,
+the algorithm can be excruciatingly slow.
 
 
+\begin{figure}
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    ylabel={time in secs},
+    ymode = log,
+    legend entries={Naive Matcher},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
+\end{axis}
+\end{tikzpicture} 
+\caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
+\end{figure}
+   
+\noindent
+For this we need to introduce certain 
+rewrite rules for the intermediate results,
+such as $r + r \rightarrow r$,
+and make sure those rules do not change the 
+language of the regular expression.
+We have a simplification function (that is as simple as possible
+while having much power on making a regex simpler):
+\begin{verbatim}
+def simp(r: Rexp) : Rexp = r match {
+  case SEQ(r1, r2) => 
+    (simp(r1), simp(r2)) match {
+      case (ZERO, _) => ZERO
+      case (_, ZERO) => ZERO
+      case (ONE, r2s) => r2s
+      case (r1s, ONE) => r1s
+      case (r1s, r2s) => SEQ(r1s, r2s)
+    }
+  case ALTS(r1, r2) => {
+    (simp(r1), simp(r2)) match {
+      case (ZERO, r2s) => r2s
+      case (r1s, ZERO) => r1s
+      case (r1s, r2s) =>
+        if(r1s == r2s) r1s else ALTS(r1s, r2s)
+    }
+  }
+  case r => r
+}
+\end{verbatim}
+If we repeatedly incorporate these 
+rules during the matching algorithm, 
+we have a lexer with simplification:
+\begin{verbatim}
+def ders_simp(s: List[Char], r: Rexp) : Rexp = s match {
+  case Nil => simp(r)
+  case c :: cs => ders_simp(cs, simp(der(c, r)))
+}
+
+def simp_matcher(s: String, r: Rexp) : Boolean = 
+  nullable(ders_simp(s.toList, r))
+
+\end{verbatim}
+\noindent
+After putting in those rules, the example of \ref{NaiveMatcher}
+is now very tame in the length of inputs:
+
+
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    ylabel={time in secs},
+    ymode = log,
+    xmode = log,
+    legend entries={Matcher With Simp},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
+\end{axis}
+\end{tikzpicture} \label{fig:BetterMatcher}
+
+
+\noindent
+Note how the x-axis is in logarithmic scale.
 Building derivatives and then testing the existence
-of empty string in the resulting regular expression's language.
+of empty string in the resulting regular expression's language,
+and add simplification rules when necessary.
 So far, so good. But what if we want to 
 do lexing instead of just getting a YES/NO answer?
-Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
+\citeauthor{Sulzmann2014} first came up with a nice and 
 elegant (arguably as beautiful as the definition of the original derivative) solution for this.
 
-\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
+\section{Values and the Lexing Algorithm by Sulzmann and Lu}
 Here we present the hybrid phases of a regular expression lexing 
 algorithm using the function $\inj$, as given by Sulzmann and Lu.
 They first defined the datatypes for storing the 
@@ -430,6 +517,24 @@
 \begin{lemma}
 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
 \end{lemma}
+\begin{proof}
+By induction on $s$, $r$ and $v_1$. The induction principle is
+the \POSIX rules. Each case is proven by a combination of
+the induction rules for $\POSIX$ values and the inductive hypothesis.
+Probably the most cumbersome cases are the sequence and star with non-empty iterations.
+
+We give the reasoning about the sequence case as follows:
+When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$, 
+we know that there could not be a longer string $r_1'$ such that $(s_1', r_1) \rightarrow v_1'$
+and $(s_2', r_2) \rightarrow v2'$ and $s_1' @s_2' = s$ all hold.
+For possible values of $s_1'$ and $s_2'$ where $s_1'$ is shorter, they cannot
+possibly form a $\POSIX$ for $s$.
+If we have some other values $v_1'$ and $v_2'$ such that 
+$(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$,
+Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$, 
+which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$
+is the same as $\Seq(v_1, v_2)$. 
+\end{proof}
 Now we know what a $\POSIX$ value is, the problem is how do we achieve 
 such a value in a lexing algorithm, using derivatives?
 
@@ -437,7 +542,7 @@
 
 The contribution of Sulzmann and Lu is an extension of Brzozowski's
 algorithm by a second phase (the first phase being building successive
-derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value 
+derivatives---see \ref{graph:successive_ders}). In this second phase, a POSIX value 
 is generated if the regular expression matches the string.
 Two functions are involved: $\inj$ and $\mkeps$.
 The function $\mkeps$ constructs a value from the last
@@ -534,7 +639,7 @@
 back to the first value and form a new value with this latest iteration
 being added to the previous list of iterations, all under the $\Stars$
 top level.
-The POSIX value is maintained throughout the process.
+The POSIX value is maintained throughout the process:
 \begin{lemma}
 $(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
 \end{lemma}\label{injPosix}
@@ -544,11 +649,11 @@
 and taking into consideration the possibility of a non-match,
 we have a lexer with the following recursive definition:
 \begin{center}
-\begin{tabular}{lcr}
+\begin{tabular}{lcl}
 $\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
 $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
-& & $\None \implies \None$\\
-& & $\mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
+& & $\quad \None \implies \None$\\
+& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
 \end{tabular}
 \end{center}
  \noindent
@@ -568,17 +673,14 @@
  by lemma \ref{injPosix}.
  \end{proof}
 
-For convenience, we shall employ the following notations: the regular
+
+Pictorially, the algorithm is as follows (
+For convenience, we employ the following notations: the regular
 expression we start with is $r_0$, and the given string $s$ is composed
-of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
-left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
-to the characters $c_0$, $c_1$  until we exhaust the string and obtain
-the derivative $r_n$. We test whether this derivative is
-$\textit{nullable}$ or not. If not, we know the string does not match
-$r$, and no value needs to be generated. If yes, we start building the
-values incrementally by \emph{injecting} back the characters into the
-earlier values $v_n, \ldots, v_0$.
-Pictorially, the algorithm is as follows:
+of characters $c_0 c_1 \ldots c_{n-1}$. The
+values built incrementally by \emph{injecting} back the characters into the
+earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
+are always in the same subscript, i.e. $\vdash v_i : r_i$):
 
 \begin{ceqn}
 \begin{equation}\label{graph:2}
@@ -589,74 +691,67 @@
 \end{equation}
 \end{ceqn}
 
-
 \noindent
- This is the second phase of the
-algorithm from right to left. For the first value $v_n$, we call the
-function $\textit{mkeps}$, which builds a POSIX lexical value
-for how the empty string has been matched by the (nullable) regular
-expression $r_n$. This function is defined as
-
-
-
-We have mentioned before that derivatives without simplification 
-can get clumsy, and this is true for values as well--they reflect
-the size of the regular expression by definition.
-
-One can introduce simplification on the regex and values but have to
+As we did earlier in this chapter on the matcher, one can 
+introduce simplification on the regex.
+However, now we need to do a backward phase and make sure
+the values align with the regular expressions.
+Therefore one has to
 be careful not to break the correctness, as the injection 
 function heavily relies on the structure of the regexes and values
 being correct and matching each other.
 It can be achieved by recording some extra rectification functions
 during the derivatives step, and applying these rectifications in 
 each run during the injection phase.
+
+\ChristianComment{Do I introduce the lexer with rectification here?}
 And we can prove that the POSIX value of how
 regular expressions match strings will not be affected---although it is much harder
 to establish. 
 Some initial results in this regard have been
 obtained in \cite{AusafDyckhoffUrban2016}. 
 
-
-
-%Brzozowski, after giving the derivatives and simplification,
-%did not explore lexing with simplification, or he may well be 
-%stuck on an efficient simplification with proof.
-%He went on to examine the use of derivatives together with 
-%automaton, and did not try lexing using products.
-
-We want to get rid of the complex and fragile rectification of values.
-Can we not create those intermediate values $v_1,\ldots v_n$,
-and get the lexing information that should be already there while
-doing derivatives in one pass, without a second injection phase?
-In the meantime, can we make sure that simplifications
-are easily handled without breaking the correctness of the algorithm?
+However, even with these simplification rules, we could still end up in
+trouble, when we encounter cases that require more involved and aggressive
+simplifications.
+\section{A Case Requring More Aggressive Simplification}
+For example, when starting with the regular
+expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10)
+w.r.t.~the character $a$, one obtains a derivative regular expression
+with more than 9000 nodes (when viewed as a tree)
+even with simplification.
+\begin{figure}
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    ylabel={size},
+    legend entries={Naive Matcher},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
+\end{axis}
+\end{tikzpicture} 
+\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
+\end{figure}\label{fig:BetterWaterloo}
+   
+That is because our lexing algorithm currently keeps a lot of 
+"useless values that will never not be used. 
+These different ways of matching will grow exponentially with the string length.
 
-Sulzmann and Lu solved this problem by
-introducing additional information to the 
-regular expressions called \emph{bitcodes}.
-
-
-
-
-
-With the formally-specified rules for what a POSIX matching is,
-they proved in Isabelle/HOL that the algorithm gives correct results.
-But having a correct result is still not enough, 
-we want at least some degree of $\mathbf{efficiency}$.
-
-
-A pair of regular expression and string can have multiple lexical values. 
-Take the example where $r= (a^*\cdot a^*)^*$ and the string 
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-If we do not allow any empty iterations in its lexical values,
-there will be $n - 1$ "splitting points" on $s$ we can choose to 
+For $r= (a^*\cdot a^*)^*$ and  
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
+if we do not allow any empty iterations in its lexical values,
+there will be $n - 1$ "splitting points" on $s$ we can independently choose to 
 split or not so that each sub-string
-segmented by those chosen splitting points will form different iterations:
+segmented by those chosen splitting points will form different iterations.
+For example when $n=4$,
 \begin{center}
 \begin{tabular}{lcr}
-$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$\\
-$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$\\
-$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$\\
+$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration, this iteration will be divided between the inner sequence $a^*\cdot a^*$)\\
+$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$ (two iterations)\\
+$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$ (two iterations)\\
+$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
+$a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
  & $\textit{etc}.$ &
  \end{tabular}
 \end{center}
@@ -670,112 +765,27 @@
 have an exponential runtime on ambiguous regular expressions.
 With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
 of a match. This means Sulzmann and Lu's injection-based algorithm 
-will be exponential by nature.
-Somehow one has to decide which lexical value to keep and
-output in a lexing algorithm.
+ exponential by nature.
+Somehow one has to make sure which
+ lexical values are $\POSIX$ and need to be kept in a lexing algorithm.
 
 
  For example, the above $r= (a^*\cdot a^*)^*$  and 
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
-The output of an algorithm we want would be a POSIX matching
-encoded as a value.
+We want to keep this value only, and remove all the regular expression subparts
+not corresponding to this value during lexing.
+To do this, a two-phase algorithm with rectification is a bit too fragile.
+Can we not create those intermediate values $v_1,\ldots v_n$,
+and get the lexing information that should be already there while
+doing derivatives in one pass, without a second injection phase?
+In the meantime, can we make sure that simplifications
+are easily handled without breaking the correctness of the algorithm?
 
+Sulzmann and Lu solved this problem by
+introducing additional information to the 
+regular expressions called \emph{bitcodes}.
 
 
 
 
-%kind of redundant material
-
-
-where we start with  a regular expression  $r_0$, build successive
-derivatives until we exhaust the string and then use \textit{nullable}
-to test whether the result can match the empty string. It can  be
-relatively  easily shown that this matcher is correct  (that is given
-an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
-
-Beautiful and simple definition.
-
-If we implement the above algorithm naively, however,
-the algorithm can be excruciatingly slow. 
-
-
-\begin{figure}
-\centering
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,-0.05)}},
-    ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=10000,
-    ytick={0,1000,...,10000},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={JavaScript},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
-\end{axis}
-\end{tikzpicture}\\
-\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
-           of the form $\underbrace{aa..a}_{n}$.}
-\end{tabular}    
-\caption{EightThousandNodes} \label{fig:EightThousandNodes}
-\end{figure}
-
-
-(8000 node data to be added here)
-For example, when starting with the regular
-expression $(a + aa)^*$ and building a few successive derivatives (around 10)
-w.r.t.~the character $a$, one obtains a derivative regular expression
-with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
-The reason why $(a + aa) ^*$ explodes so drastically is that without
-pruning, the algorithm will keep records of all possible ways of matching:
-\begin{center}
-$(a + aa) ^* \backslash [aa] = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
-\end{center}
-
-\noindent
-Each of the above alternative branches correspond to the match 
-$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
-These different ways of matching will grow exponentially with the string length,
-and without simplifications that throw away some of these very similar matchings,
-it is no surprise that these expressions grow so quickly.
-Operations like
-$\backslash$ and $\nullable$ need to traverse such trees and
-consequently the bigger the size of the derivative the slower the
-algorithm. 
-
-Brzozowski was quick in finding that during this process a lot useless
-$\ONE$s and $\ZERO$s are generated and therefore not optimal.
-He also introduced some "similarity rules", such
-as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
-different but language-equivalent sub-regexes to further decrease the size
-of the intermediate regexes. 
-
-More simplifications are possible, such as deleting duplicates
-and opening up nested alternatives to trigger even more simplifications.
-And suppose we apply simplification after each derivative step, and compose
-these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
-\textit{simp}(a \backslash c)$. Then we can build
-a matcher with simpler regular expressions.
-
-If we want the size of derivatives in the algorithm to
-stay even lower, we would need more aggressive simplifications.
-Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
-delete duplicates whenever possible. For example, the parentheses in
-$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
-\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
-example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
-$a^*+a+\ONE$.  These more aggressive simplification rules are for
- a very tight size bound, possibly as low
-  as that of the \emph{partial derivatives}\parencite{Antimirov1995}. 
-
-
-

--- a/thys2/blexer2.sc	Thu Jun 09 12:57:53 2022 +0100
+++ b/thys2/blexer2.sc	Thu Jun 09 22:07:44 2022 +0100
@@ -21,6 +21,7 @@
 case object ANYCHAR extends Rexp
 case class CHAR(c: Char) extends Rexp
 case class ALTS(r1: Rexp, r2: Rexp) extends Rexp 
+case class ALTSS(rs: List[Rexp]) extends Rexp
 case class SEQ(r1: Rexp, r2: Rexp) extends Rexp 
 case class STAR(r: Rexp) extends Rexp 
 case class RECD(x: String, r: Rexp) extends Rexp  
@@ -270,6 +271,7 @@
   case CHAR(_) => false
   case ANYCHAR => false
   case ALTS(r1, r2) => nullable(r1) || nullable(r2)
+  case ALTSS(rs) => rs.exists(nullable)
   case SEQ(r1, r2) => nullable(r1) && nullable(r2)
   case STAR(_) => true
   case RECD(_, r1) => nullable(r1)
@@ -284,6 +286,7 @@
   case CHAR(d) => if (c == d) ONE else ZERO
   case ANYCHAR => ONE 
   case ALTS(r1, r2) => ALTS(der(c, r1), der(c, r2))
+  case ALTSS(rs) => ALTSS(rs.map(der(c, _)))
   case SEQ(r1, r2) => 
     if (nullable(r1)) ALTS(SEQ(der(c, r1), r2), der(c, r2))
     else SEQ(der(c, r1), r2)
@@ -294,6 +297,84 @@
   case NOT(r) =>  NOT(der(c, r))
 }
 
+def ders(s: List[Char], r: Rexp) : Rexp = s match {
+  case Nil => r
+  case c :: cs => ders(cs, der(c, r))
+}
+
+def ders_simp(s: List[Char], r: Rexp) : Rexp = s match {
+  case Nil => simp(r)
+  case c :: cs => ders_simp(cs, simp(der(c, r)))
+}
+
+
+def simp2(r: Rexp) : Rexp = r match {
+  case SEQ(r1, r2) => 
+    (simp2(r1), simp2(r2)) match {
+      case (ZERO, _) => ZERO
+      case (_, ZERO) => ZERO
+      case (ONE, r2s) => r2s
+      case (r1s, ONE) => r1s
+      case (r1s, r2s) => 
+        SEQ(r1s, r2s)
+    }
+  case ALTS(r1, r2) => 
+    val r1r2s = fltsplain(simp2(r1) :: simp2(r2) :: Nil).distinct 
+    r1r2s match {
+      case Nil => ZERO
+      case r :: Nil => r
+      case r1 :: r2 :: rs => 
+        ALTSS(r1 :: r2 :: rs)
+    }
+  case ALTSS(rs) => 
+    // println(rs)
+    (fltsplain(rs.map(simp2))).distinct match {
+      case Nil => ZERO
+      case r :: Nil => r
+      case r1 :: r2 :: rs =>
+        ALTSS(r1 :: r2 :: rs)
+    }
+  case r => r
+}
+
+
+def simp(r: Rexp) : Rexp = r match {
+  case SEQ(r1, r2) => 
+    (simp(r1), simp(r2)) match {
+      case (ZERO, _) => ZERO
+      case (_, ZERO) => ZERO
+      case (ONE, r2s) => r2s
+      case (r1s, ONE) => r1s
+      case (r1s, r2s) => SEQ(r1s, r2s)
+    }
+  case ALTS(r1, r2) => {
+    (simp(r1), simp(r2)) match {
+      case (ZERO, r2s) => r2s
+      case (r1s, ZERO) => r1s
+      case (r1s, r2s) =>
+        if(r1s == r2s) r1s else ALTS(r1s, r2s)
+    }
+  }
+  case r => r
+}
+
+def fltsplain(rs: List[Rexp]) : List[Rexp] = rs match {
+  case Nil => Nil
+  case ZERO :: rs => fltsplain(rs)
+  case ALTS(r1, r2) :: rs => r1 :: r2 :: fltsplain(rs) 
+  case ALTSS(rs) :: rs1 => rs ::: fltsplain(rs1)
+  case r :: rs => r :: fltsplain(rs)
+}
+
+
+
+
+def matcher(s: String, r: Rexp) : Boolean = 
+  nullable(ders(s.toList, r))
+
+def simp_matcher(s: String, r: Rexp) : Boolean = 
+  nullable(ders_simp(s.toList, r))
+
 
 // extracts a string from a value
 def flatten(v: Val) : String = v match {
@@ -1258,6 +1339,7 @@
       println(asize(bders_simp(prog0.toList, internalise(STARREG))))
       // println(asize(bdersStrong7(prog0.toList, internalise(STARREG))))
     }
+    
   }
 }
 // small()
@@ -1269,8 +1351,36 @@
     println(asize(bders_simp(prog.toList, internalise(r))))
   }
 }
-aaa_star()
+// aaa_star()
+def naive_matcher() {
+  val r = STAR(STAR("a") ~ STAR("a"))
 
+  for(i <- 0 to 20) {
+    val s = "a" * i 
+    val t1 = System.nanoTime
+    matcher(s, r)
+    val duration = (System.nanoTime - t1) / 1e9d
+    print(i)
+    print(" ")
+    // print(duration)
+    // print(" ")
+    print(asize(bders_simp(s.toList, internalise(r))))
+    //print(size(ders_simp(s.toList, r)))
+    println()
+  }
+  // for(i <- 1 to 40000000 by 500000) {
+  //   val s = "a" * i
+  //   val t1 = System.nanoTime
+  //   val derssimp_result = ders_simp(s.toList, r)
+  //   val duration = (System.nanoTime - t1) / 1e9d
+  //   print(i)
+  //   print(" ")
+  //   print(duration)
+  //   println()
+  // }
+  
+}
+naive_matcher()
 def generator_test() {
 
   test(rexp(4), 1000000) { (r: Rexp) =>

author	Chengsong
	Thu, 09 Jun 2022 22:07:44 +0100
changeset 539	7cf9f17aa179
parent 538	8016a2480704
child 540	3a1fd5ea2484

ChengsongTanPhdThesis/Chapters/Bitcoded2.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Inj.tex		file \| annotate \| diff \| comparison \| revisions
thys2/blexer2.sc		file \| annotate \| diff \| comparison \| revisions