lexing: changeset 500:4d9eecfc936a

--- a/ChengsongTanPhdThesis/Chapters/Chapter1.tex	Sun May 01 11:50:46 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Chapter1.tex	Mon May 02 00:23:39 2022 +0100
@@ -13,13 +13,19 @@
 \newcommand{\file}[1]{\texttt{\bfseries#1}}
 \newcommand{\option}[1]{\texttt{\itshape#1}}
 
-
+\newcommand{\bderssimp}[2]{{#1} \backslash_{bsimp} {#2}}
 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
 \newcommand{\ZERO}{\mbox{\bf 0}}
 \newcommand{\ONE}{\mbox{\bf 1}}
 \def\lexer{\mathit{lexer}}
 \def\mkeps{\mathit{mkeps}}
 
+\def\AZERO{\textit{AZERO}}
+\def\AONE{\textit{AONE}}
+\def\ACHAR{\textit{ACHAR}}
+\def\ASEQ{\textit{ASEQ}}
+\def\AALTS{\textit{AALTS}}
+\def\ASTAR{\textit{ASTAR}}
 \def\DFA{\textit{DFA}}
 \def\bmkeps{\textit{bmkeps}}
 \def\retrieve{\textit{retrieve}}
@@ -37,11 +43,25 @@
 \def\Z{\mathit{Z}}
 \def\S{\mathit{S}}
 \def\rup{r^\uparrow}
+%\def\bderssimp{\mathit{bders}\_\mathit{simp}}
+\def\distinctWith{\textit{distinctWith}}
+
 \def\simp{\mathit{simp}}
 \def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
 \def\map{\mathit{map}}
 \def\distinct{\mathit{distinct}}
 \def\blexersimp{\mathit{blexer}\_\mathit{simp}}
+\def\map{\textit{map}}
+\def\vsuf{\textit{vsuf}}
+\def\sflataux{\textit{sflat}\_\textit{aux}}
+\def\rrexp{\textit{rrexp}}
+\def\rsize{\textit{rsize}}
+\def\asize{\textit{asize}}
+\def\rerase{\textit{rerase}}
+\def\erase{\textit{erase}}
+\def\STAR{\textit{STAR}}
+\def\flts{\textit{flts}}
+
 %----------------------------------------------------------------------------------------
 %This part is about regular expressions, Brzozowski derivatives,
 %and a bit-coded lexing algorithm with proven correctness and time bounds.
@@ -50,13 +70,14 @@
 
 
 Regular expressions are widely used in computer science: 
-be it in IDEs with syntax hightlighting and auto completion, 
+be it in text-editors\parencite{atomEditor} with syntax hightlighting and auto completion, 
 command line tools like $\mathit{grep}$ that facilitates easy 
 text processing , network intrusion
 detection systems that rejects suspicious traffic, or compiler
-front ends--there is always an important phase of the
-task that involves matching a regular 
-exression with a string.
+front ends--the majority of the solutions to these tasks 
+involve lexing with regular 
+expressions.
+
 Given its usefulness and ubiquity, one would imagine that
 modern regular expression matching implementations
 are mature and fully-studied.

--- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex	Sun May 01 11:50:46 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Chapter2.tex	Mon May 02 00:23:39 2022 +0100
@@ -4,6 +4,9 @@
 
 \label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
 
+
+
+
 %----------------------------------------------------------------------------------------
 %	SECTION 1
 %----------------------------------------------------------------------------------------
@@ -53,17 +56,120 @@
 
 Nunc posuere quam at lectus tristique eu ultrices augue venenatis. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Aliquam erat volutpat. Vivamus sodales tortor eget quam adipiscing in vulputate ante ullamcorper. Sed eros ante, lacinia et sollicitudin et, aliquam sit amet augue. In hac habitasse platea dictumst.
 
+
+
+
+
+
+
+
+
 %-----------------------------------
-%	SUBSECTION 2
+%	SECTION 2
 %-----------------------------------
 
-\subsection{Subsection 2}
-Morbi rutrum odio eget arcu adipiscing sodales. Aenean et purus a est pulvinar pellentesque. Cras in elit neque, quis varius elit. Phasellus fringilla, nibh eu tempus venenatis, dolor elit posuere quam, quis adipiscing urna leo nec orci. Sed nec nulla auctor odio aliquet consequat. Ut nec nulla in ante ullamcorper aliquam at sed dolor. Phasellus fermentum magna in augue gravida cursus. Cras sed pretium lorem. Pellentesque eget ornare odio. Proin accumsan, massa viverra cursus pharetra, ipsum nisi lobortis velit, a malesuada dolor lorem eu neque.
+\section{Finiteness Property}
+In this section let us sketch our argument for why the size of the simplified
+derivatives with the aggressive simplification function can be finitely bounded.
+For convenience, we use a new datatype which we call $\textit{rrexp}$ to denote
+the difference between it and annotated regular expressions. 
+For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, but keeps
+everything else intact.
+It is similar to annotated regular expression being $\erase$ed, but with all its structure being intact
+(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
+$\AALTS$.
+We denote the operation of erasing the bits and turning an annotated regular expression 
+into an $\rrexp$ as $\rerase$.
+%TODO: definition of rerase
+That we can bound the size of annotated regular expressions by 
+$\rrexp$ is that the bitcodes grow linearly w.r.t the input string, and don't contribute to the structural size of 
+an annotated regular expression:
+$\rsize (\rerase a) = \asize a$
+
+ Suppose
+we have a size function for bitcoded regular expressions, written
+$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
+(we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$
+there exists a bound $N$
+such that 
+
+\begin{center}
+$\forall s. \; \llbracket{\bderssimp r s}\rrbracket \leq N$
+\end{center}
+
+\noindent
+We prove this by induction on $r$. The base cases for $\AZERO$,
+$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
+for sequences of the form $\ASEQ \textit{bs} r_1 r_2$. In this case our induction
+hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp(r_1, s) \rrbracket \leq N_1$ and
+$\exists N_2. \forall s. \; \llbracket \bderssimp(r_2, s) \rrbracket \leq N_2$. We can reason as follows
+%
+\begin{center}
+\begin{tabular}{lcll}
+& & $ \llbracket \bderssimp( (\ASEQ\, \textit{bs} \, r_1 \,r_2),  s) \rrbracket$\\
+& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;((ASEQ [] (\bderssimp r_1 s) r_2) ::
+    [\bderssimp r_2 s' \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
+& $\leq$ &
+    $\llbracket\textit{\distinctWith}\,((\ASEQ [] (\bderssimp r_1 s) r_2$) ::
+    [$\bderssimp r_2 s' \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
+& $\leq$ & $\llbracket\ASEQ [] (\bderssimp r_1 s) r_2\rrbracket +
+             \llbracket\textit{distinctWith}\,[\bderssimp r_2 s' \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
+& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
+      \llbracket \distinctWith\,[ \bderssimp r_2 s' \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
+& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
+\end{tabular}
+\end{center}
+
+% tell Chengsong about Indian paper of closed forms of derivatives
+
+\noindent
+where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp r_1 s'$ is nullable ($s'$ being a suffix of $s$).
+The reason why we could write the derivatives of a sequence this way is described in section 2.
+The term (2) is used to control (1) since it we know that you can obtain an overall
+smaller regex list
+by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
+Section 3 is dedicated to its proof.
+In (3) we know that  $\llbracket\ASEQ [] (\bderssimp r_1 s) r_2\rrbracket$ is 
+bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
+than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
+for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
+We reason similarly for  $\STAR$.\medskip
+
+\noindent
+Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
+far from the actual bound we can expect. We can do better than this, but this does not improve
+the finiteness property we are proving. If we are interested in a polynomial bound,
+one would hope to obtain a similar tight bound as for partial
+derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
+ $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
+partial derivatives). Unfortunately to obtain the exact same bound would mean
+we need to introduce simplifications, such as
+ $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
+which exist for partial derivatives. However, if we introduce them in our
+setting we would lose the POSIX property of our calculated values. We leave better
+bounds for future work.\\[-6.5mm]
+
 
 %----------------------------------------------------------------------------------------
 %	SECTION 2
 %----------------------------------------------------------------------------------------
 
-\section{Main Section 2}
+\section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
+There is a nice property about the compound regular expression 
+$r_1 \cdot r_2$ in general,
+that the derivatives of it against a string $s$ can be described by
+the derivatives w.r.t $r_1$ and $r_2$ over substrings of $s$:
+\begin{lemma}
+$\textit{sflat}\_{aux} ( (r_1 \cdot r_2) \backslash s) = (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf s r_1))$
+\end{lemma}
+
+
+%----------------------------------------------------------------------------------------
+%	SECTION 3
+%----------------------------------------------------------------------------------------
+
+\section{interaction between $\distinctWith$ and $\flts$}
+Note that it is not immediately obvious that 
+$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket   \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket  $
 
 Sed ullamcorper quam eu nisl interdum at interdum enim egestas. Aliquam placerat justo sed lectus lobortis ut porta nisl porttitor. Vestibulum mi dolor, lacinia molestie gravida at, tempus vitae ligula. Donec eget quam sapien, in viverra eros. Donec pellentesque justo a massa fringilla non vestibulum metus vestibulum. Vestibulum in orci quis felis tempor lacinia. Vivamus ornare ultrices facilisis. Ut hendrerit volutpat vulputate. Morbi condimentum venenatis augue, id porta ipsum vulputate in. Curabitur luctus tempus justo. Vestibulum risus lectus, adipiscing nec condimentum quis, condimentum nec nisl. Aliquam dictum sagittis velit sed iaculis. Morbi tristique augue sit amet nulla pulvinar id facilisis ligula mollis. Nam elit libero, tincidunt ut aliquam at, molestie in quam. Aenean rhoncus vehicula hendrerit.

--- a/ChengsongTanPhdThesis/example.bib	Sun May 01 11:50:46 2022 +0100
+++ b/ChengsongTanPhdThesis/example.bib	Mon May 02 00:23:39 2022 +0100
@@ -135,7 +135,13 @@
 	url = {https://regex101.com/},
 	keywords = {regex tester debugger}
 }
-
+@misc{atomEditor,
+	title = {Atom Editor},
+	author = {Dunno},
+	year = {2022},
+	url = {https://atom.io},
+	keywords = {text editor}
+}
 @techreport{grathwohl2014crash,
   title={A Crash-Course in Regular Expression Parsing and Regular Expressions as Types},
   author={Grathwohl, Niels Bj{\o}rn Bugge and Henglein, Fritz and Rasmussen, Ulrik Terp},

--- a/ChengsongTanPhdThesis/main.tex	Sun May 01 11:50:46 2022 +0100
+++ b/ChengsongTanPhdThesis/main.tex	Mon May 02 00:23:39 2022 +0100
@@ -44,6 +44,7 @@
 \usepackage{mathpazo} % Use the Palatino font by default
 \usepackage{hyperref}
 \usepackage[backend=bibtex,style=authoryear,natbib=true]{biblatex} % Use the bibtex backend with the authoryear citation style (which resembles APA)
+\usepackage{stmaryrd}
 
 \addbibresource{example.bib} % The filename of the bibliography
 
@@ -253,12 +254,25 @@
 \textbf{LAH} & \textbf{L}ist \textbf{A}bbreviations \textbf{H}ere\\
 \textbf{WSF} & \textbf{W}hat (it) \textbf{S}tands \textbf{F}or\\
 
+
+\newtheorem{theorem}{Theorem}
+\newtheorem{lemma}{Lemma}
+
+
+
+\newcommand{\bderssimp}[2]{{#1} \backslash_{bsimp} {#2}}
 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
 \newcommand{\ZERO}{\mbox{\bf 0}}
 \newcommand{\ONE}{\mbox{\bf 1}}
 \def\lexer{\mathit{lexer}}
 \def\mkeps{\mathit{mkeps}}
 
+\def\AZERO{\textit{AZERO}}
+\def\AONE{\textit{AONE}}
+\def\ACHAR{\textit{ACHAR}}
+\def\ASEQ{\textit{ASEQ}}
+\def\AALTS{\textit{AALTS}}
+\def\ASTAR{\textit{ASTAR}}
 \def\DFA{\textit{DFA}}
 \def\bmkeps{\textit{bmkeps}}
 \def\retrieve{\textit{retrieve}}
@@ -276,6 +290,24 @@
 \def\Z{\mathit{Z}}
 \def\S{\mathit{S}}
 \def\rup{r^\uparrow}
+%\def\bderssimp{\mathit{bders}\_\mathit{simp}}
+\def\distinctWith{\textit{distinctWith}}
+
+\def\simp{\mathit{simp}}
+\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
+\def\map{\mathit{map}}
+\def\distinct{\mathit{distinct}}
+\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
+\def\map{\textit{map}}
+\def\vsuf{\textit{vsuf}}
+\def\sflataux{\textit{sflat}\_\textit{aux}}
+\def\rrexp{\textit{rrexp}}
+\def\rsize{\textit{rsize}}
+\def\asize{\textit{asize}}
+\def\rerase{\textit{rerase}}
+\def\erase{\textit{erase}}
+\def\STAR{\textit{STAR}}
+\def\flts{\textit{flts}}
 
 \end{abbreviations}

--- a/thys2/blexer2.sc	Sun May 01 11:50:46 2022 +0100
+++ b/thys2/blexer2.sc	Mon May 02 00:23:39 2022 +0100
@@ -137,7 +137,7 @@
   if(len <= 0) single(Nil)
   else{
     for { 
-      c <- chars_range('a', 'd')
+      c <- chars_range('a', 'c')
       tail <- char_list(len - 1)
     } yield c :: tail
   }
@@ -492,17 +492,18 @@
     case AALTS(bs1, rs) => {
         // println("alts case")
           val rs_simp = rs.map(strongBsimp5(_))
-
           val flat_res = flats(rs_simp)
-
-          val dist_res = distinctBy5(flat_res)//distinctBy(flat_res, erase)
-
-          dist_res match {
+          var dist_res = distinctBy5(flat_res)//distinctBy(flat_res, erase)
+          var dist2_res = distinctBy5(dist_res)
+          while(dist_res != dist2_res){
+            dist_res = dist2_res
+            dist2_res = distinctBy5(dist_res)
+          }
+          (dist2_res) match {
             case Nil => AZERO
             case s :: Nil => fuse(bs1, s)
             case rs => AALTS(bs1, rs)  
           }
-        
     }
     case r => r
   }
@@ -604,20 +605,34 @@
 def L(acc: Rexp, rs: List[Rexp]) : Rexp = rs match {
   case Nil => acc
   case r :: rs1 => 
-    if(acc == ONE) 
-      L(r, rs1) 
-    else
+    // if(acc == ONE) 
+    //   L(r, rs1) 
+    // else
       L(SEQ(acc, r), rs1)
 }
 
-def pAKC(rs: List[Rexp], r: ARexp, ctx: List[Rexp]) : ARexp = {
-  if (breakIntoTerms(L(erase(r), ctx)).map(oneSimp).forall(rs.contains)) {//rs.flatMap(breakIntoTerms
+def rprint(r: Rexp) : Unit = println(shortRexpOutput(r))
+def rsprint(rs: List[Rexp]) = rs.foreach(r => println(shortRexpOutput(r)))
+def aprint(a: ARexp) = println(shortRexpOutput(erase(a)))
+def asprint(as: List[ARexp]) = as.foreach(a => println(shortRexpOutput(erase(a))))
+
+def pAKC(acc: List[Rexp], r: ARexp, ctx: List[Rexp]) : ARexp = {
+  // println("pakc")
+  // println(shortRexpOutput(erase(r)))
+  // println("acc")
+  // rsprint(acc)
+  // println("ctx---------")
+  // rsprint(ctx)
+  // println("ctx---------end")
+  // rsprint(breakIntoTerms(L(erase(r), ctx)).map(oneSimp))
+
+  if (breakIntoTerms(L(erase(r), ctx)).map(oneSimp).forall(acc.contains)) {//acc.flatMap(breakIntoTerms
     AZERO
   }
   else{
     r match {
       case ASEQ(bs, r1, r2) => 
-      (pAKC(rs, r1, erase(r2) :: ctx)) match{
+      (pAKC(acc, r1, erase(r2) :: ctx)) match{
         case AZERO => 
           AZERO
         case AONE(bs1) => 
@@ -631,9 +646,9 @@
         // ctx.foreach(r => println(shortRexpOutput(r)))
         // println(s"rs0 is ")
         // rs0.foreach(r => println(shortRexpOutput(erase(r))))
-        // println(s"rs is ")
-        // rs.foreach(r => println(shortRexpOutput(r)))
-        rs0.map(r => pAKC(rs, r, ctx)).filter(_ != AZERO) match {
+        // println(s"acc is ")
+        // acc.foreach(r => println(shortRexpOutput(r)))
+        rs0.map(r => pAKC(acc, r, ctx)).filter(_ != AZERO) match {
           case Nil => 
             // println("after pruning Nil")
             AZERO
@@ -1091,22 +1106,54 @@
   //   })
   //   !boolList.exists(b => b == false)
   // }
-  //test(rexp(3), 10000000) { (r: Rexp) => 
-  test(single(STAR(SEQ(SEQ(ALTS(CHAR('b'),STAR(CHAR('b'))),CHAR('b')),ALTS(ALTS(STAR(CHAR('c')),SEQ(CHAR('a'),CHAR('c'))),ALTS(ZERO,ONE))))), 1) {(r: Rexp) =>
+  //*** example where bdStrong5 has a smaller size than bdStrong
+  // test(single(STAR(SEQ(ALTS(SEQ(STAR(CHAR('a')),ALTS(ALTS(ONE,ZERO),SEQ(ONE,ONE))),CHAR('a')),ONE))), 1) { (r: Rexp) => 
+  //*** depth 5 example bdStrong5 larger size than bdStrong
+  // test(single(STAR(SEQ(SEQ(ALTS(CHAR('b'),STAR(CHAR('b'))),CHAR('b')),(ALTS(STAR(CHAR('c')), ONE))))), 1) {(r: Rexp) =>
+ 
+ 
+ 
+  //sanity check from Christian's request
+  // val r = ("a" | "ab") ~ ("bc" | "c")
+  // val a = internalise(r)
+  // val aval = blexing_simp(r, "abc")
+  // println(aval)
+
+  //sample counterexample:(depth 7)
+  //STAR(SEQ(ALTS(STAR(STAR(STAR(STAR(CHAR(c))))),ALTS(CHAR(c),CHAR(b))),ALTS(ZERO,SEQ(ALTS(ALTS(STAR(CHAR(c)),SEQ(CHAR(b),CHAR(a))),CHAR(c)),STAR(ALTS(ALTS(ONE,CHAR(a)),STAR(CHAR(c))))))))
+  //(depth5)
+  //STAR(SEQ(ALTS(ALTS(STAR(CHAR(b)),SEQ(ONE,CHAR(b))),SEQ(STAR(CHAR(a)),CHAR(b))),ALTS(ZERO,ALTS(STAR(CHAR(b)),STAR(CHAR(a))))))
+  test(rexp(4), 10000000) { (r: Rexp) => 
+  // ALTS(SEQ(SEQ(ONE,CHAR('a')),STAR(CHAR('a'))),SEQ(ALTS(CHAR('c'),ONE),STAR(ZERO))))))), 1) { (r: Rexp) => 
     val ss = stringsFromRexp(r)
     val boolList = ss.map(s => {
       val bdStrong = bdersStrong(s.toList, internalise(r))
       val bdStrong5 = bdersStrong5(s.toList, internalise(r))
+      val bdFurther5 = strongBsimp5(bdStrong5)
+      val sizeFurther5 = asize(bdFurther5)
+      val pdersSet = pderUNIV(r)
       val size5 = asize(bdStrong5)
       val size0 = asize(bdStrong)
-      println(s)
-      println(shortRexpOutput(erase(bdStrong5)))
-      println(shortRexpOutput(erase(bdStrong)))
-      println(size5, size0)
       // println(s)
-      size5 <= size0
+      // println("pdersSet size")
+      // println(pdersSet.size)
+      // pdersSet.foreach(r => println(shortRexpOutput(r)))
+      // println("after bdStrong5")
+      
+      // println(shortRexpOutput(erase(bdStrong5)))
+      // println(breakIntoTerms(erase(bdStrong5)).size)
+      
+      // println("after bdStrong")
+      // println(shortRexpOutput(erase(bdStrong)))
+      // println(breakIntoTerms(erase(bdStrong)).size)
+      // println(size5, size0, sizeFurther5)
+      // aprint(strongBsimp5(bdStrong))
+      // println(asize(strongBsimp5(bdStrong5)))
+      // println(s)
+      size5 <= size0 
     })
     // println(boolList)
+    //!boolList.exists(b => b == false)
     !boolList.exists(b => b == false)
   }

author	Chengsong
	Mon, 02 May 2022 00:23:39 +0100
changeset 500	4d9eecfc936a
parent 498	ab626b60ee64
child 501	a6d72af04096

ChengsongTanPhdThesis/Chapters/Chapter1.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Chapter2.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/example.bib		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/main.tex		file \| annotate \| diff \| comparison \| revisions
thys2/blexer2.sc		file \| annotate \| diff \| comparison \| revisions