lexing: changeset 541:5bf9f94c02e1

--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Thu Jun 09 22:08:06 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Sun Jun 12 17:03:09 2022 +0100
@@ -10,29 +10,33 @@
 
 In this chapter, we define the basic notions 
 for regular languages and regular expressions.
+This is essentially a description in "English"
+of your formalisation in Isabelle/HOL.
 We also give the definition of what $\POSIX$ lexing means.
 
 \section{Basic Concepts}
-Usually in formal language theory there is an alphabet 
+Usually formal language theory starts with an alphabet 
 denoting a set of characters.
-Here we only use the datatype of characters from Isabelle,
-which roughly corresponds to the ASCII character.
-Then using the usual $[]$ notation for lists,
-we can define strings using chars:
+Here we just use the datatype of characters from Isabelle,
+which roughly corresponds to the ASCII characters.
+In what follows we shall leave the information about the alphabet
+implicit.
+Then using the usual bracket notation for lists,
+we can define strings made up of characters: 
 \begin{center}
 \begin{tabular}{lcl}
-$\textit{string}$ & $\dn$ & $[] | c  :: cs$\\
-& & $(c\; \text{has char type})$
+$\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
 \end{tabular}
 \end{center}
-And strings can be concatenated to form longer strings,
-in the same way as we concatenate two lists,
-which we denote as $@$. We omit the precise 
+Where $c$ is a variable ranging over characters.
+Strings can be concatenated to form longer strings in the same
+way as we concatenate two lists, which we write as @.
+We omit the precise 
 recursive definition here.
 We overload this concatenation operator for two sets of strings:
 \begin{center}
 \begin{tabular}{lcl}
-$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
+$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
 \end{tabular}
 \end{center}
 We also call the above \emph{language concatenation}.
@@ -41,11 +45,11 @@
 \begin{center}
 \begin{tabular}{lcl}
 $A^0 $ & $\dn$ & $\{ [] \}$\\
-$A^{n+1}$ & $\dn$ & $A^n @ A$
+$A^{n+1}$ & $\dn$ & $A @ A^n$
 \end{tabular}
 \end{center}
 The union of all the natural number powers of a language   
-is defined as the Kleene star operator:
+is usually defined as the Kleene star operator:
 \begin{center}
 \begin{tabular}{lcl}
  $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
@@ -65,28 +69,28 @@
 \inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*}
 \end{mathpar}
 \end{center}
-
-We also define an operation of "chopping of" a character from
-a language, which we call $\Der$, meaning "Derivative for a language":
+\ChristianComment{Yes, used the inferrule command in mathpar}
+We also define an operation of "chopping off" a character from
+a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
 \end{tabular}
 \end{center}
 \noindent
-This can be generalised to "chopping off" a string from all strings within set $A$,
-with the help of the concatenation operator:
+This can be generalised to "chopping off" a string from all strings within set $A$, 
+namely:
 \begin{center}
 \begin{tabular}{lcl}
-$\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
+$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
 \end{tabular}
 \end{center}
 \noindent
-which is essentially the left quotient $A \backslash L'$ of $A$ against 
-the singleton language $L' = \{w\}$
+which is essentially the left quotient $A \backslash L$ of $A$ against 
+the singleton language with $L = \{w\}$
 in formal language theory.
-For this dissertation the $\textit{Ders}$ definition with 
-a single string suffices.
+However for the purposes here, the $\textit{Ders}$ definition with 
+a single string is sufficient.
 
 With the  sequencing, Kleene star, and $\textit{Der}$ operator on languages,
 we have a  few properties of how the language derivative can be defined using 
@@ -510,7 +514,12 @@
 For instance, when lexing a code snippet 
 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
 as an identifier rather than a keyword.
-
+\ChristianComment{Do I also introduce lexical values $LV$ here?}
+We know that $\POSIX$ values are also part of the normal values:
+\begin{lemma}
+$(r, s) \rightarrow v \implies \vdash v: r$
+\end{lemma}
+\noindent
 The good property about a $\POSIX$ value is that 
 given the same regular expression $r$ and string $s$,
 one can always uniquely determine the $\POSIX$ value for it:
@@ -659,12 +668,12 @@
  \noindent
  The central property of the $\lexer$ is that it gives the correct result by
  $\POSIX$ standards:
- \begin{lemma}
+ \begin{theorem}
  \begin{tabular}{l}
- $s \in L(r) \Longleftrightarrow  (\exists v. \; r \; s = \Some(v) \land (r, \; s) \rightarrow v)$\\
- $s \notin L(r) \Longleftrightarrow (\lexer \; r\; s = \None)$
+ $\lexer \; r \; s = \Some(v) \Longleftrightarrow (r, \; s) \rightarrow v$\\
+ $\lexer \;r \; s = \None \Longleftrightarrow \neg(\exists v. (r, s) \rightarrow v)$
  \end{tabular}
- \end{lemma}
+ \end{theorem}
  
  
  \begin{proof}
@@ -674,7 +683,7 @@
  \end{proof}
 
 
-Pictorially, the algorithm is as follows (
+We now give a pictorial view of the algorithm (
 For convenience, we employ the following notations: the regular
 expression we start with is $r_0$, and the given string $s$ is composed
 of characters $c_0 c_1 \ldots c_{n-1}$. The
@@ -735,7 +744,7 @@
 \end{figure}\label{fig:BetterWaterloo}
    
 That is because our lexing algorithm currently keeps a lot of 
-"useless values that will never not be used. 
+"useless" values that will not be used. 
 These different ways of matching will grow exponentially with the string length.
 
 For $r= (a^*\cdot a^*)^*$ and  
@@ -744,10 +753,10 @@
 there will be $n - 1$ "splitting points" on $s$ we can independently choose to 
 split or not so that each sub-string
 segmented by those chosen splitting points will form different iterations.
-For example when $n=4$,
+For example when $n=4$, we give out a few of the many possibilities of splitting:
 \begin{center}
 \begin{tabular}{lcr}
-$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration, this iteration will be divided between the inner sequence $a^*\cdot a^*$)\\
+$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration)\\
 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$ (two iterations)\\
 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$ (two iterations)\\
 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\

--- a/thys2/blexer2.sc	Thu Jun 09 22:08:06 2022 +0100
+++ b/thys2/blexer2.sc	Sun Jun 12 17:03:09 2022 +0100
@@ -915,6 +915,11 @@
   res.toSet
 }
 
+def attachCtxcc(r: Rexp, ctx: List[Rexp]) : Set[Rexp] = {
+  val res = turnIntoTerms((L(r, ctx))).map(oneSimp)
+  res.toSet
+}
+
 def ABIncludedByC[A, B, C](a: A, b: B, c: C, f: (A, B) => C, subseteqPred: (C, C) => Boolean) : Boolean = {
   subseteqPred(f(a, b), c)
 }
@@ -923,7 +928,9 @@
   res
 }
 def prune6(acc: Set[Rexp], r: ARexp, ctx: List[Rexp]) : ARexp = {
-  if (ABIncludedByC(r, ctx, acc, attachCtx, rs1_subseteq_rs2)) {//acc.flatMap(breakIntoTerms
+  if (ABIncludedByC(a = r, b = ctx, c = acc, 
+                    f = attachCtx, subseteqPred = rs1_subseteq_rs2)) 
+  {//acc.flatMap(breakIntoTerms
     AZERO
   }
   else{
@@ -951,6 +958,36 @@
   }
 }
 
+def prune6cc(acc: Set[Rexp], r: Rexp, ctx: List[Rexp]) : Rexp = {
+  if (ABIncludedByC(a = r, b = ctx, c = acc, 
+                    f = attachCtxcc, subseteqPred = rs1_subseteq_rs2)) 
+  {//acc.flatMap(breakIntoTerms
+    ZERO
+  }
+  else{
+    r match {
+      case SEQ(r1, r2) => 
+      (prune6cc(acc, r1, r2 :: ctx)) match{
+        case ZERO => 
+          ZERO
+        case ONE => 
+          r2
+        case r1p => 
+          SEQ(r1p, r2)
+      }
+      case ALTS(r1, r2) => 
+        List(r1, r2).map(r => prune6cc(acc, r, ctx)).filter(_ != AZERO) match {
+          case Nil => 
+            ZERO
+          case r :: Nil => 
+            r 
+          case ra :: rb :: Nil => 
+            ALTS(ra, rb)
+        }
+      case r => r
+    }
+  }
+}
 
 def distinctBy6(xs: List[ARexp], acc: Set[Rexp] = Set()) : List[ARexp] = xs match {
   case Nil => 
@@ -973,6 +1010,26 @@
   }
 }
 
+def distinctByacc(xs: List[Rexp], acc: Set[Rexp] = Set()) : Set[Rexp] = xs match {
+  case Nil => 
+    acc
+  case x :: xs => {
+    if(acc.contains(x)){
+      distinctByacc(xs, acc)
+    }
+    else{
+      val pruned = prune6cc(acc, x, Nil)
+      val newTerms = turnIntoTerms(pruned)
+      pruned match {
+        case ZERO => 
+          distinctByacc(xs, acc)
+        case xPrime => 
+          distinctByacc(xs, newTerms.map(oneSimp) ++: acc)//distinctBy5(xs, addToAcc.map(oneSimp(_)) ::: acc)
+      }
+    }
+  }
+}
+
 def breakIntoTerms(r: Rexp) : List[Rexp] = r match {
   case SEQ(r1, r2)  => breakIntoTerms(r1).map(r11 => SEQ(r11, r2))
   case ALTS(r1, r2) => breakIntoTerms(r1) ::: breakIntoTerms(r2)
@@ -1380,21 +1437,23 @@
   // }
   
 }
-naive_matcher()
+// naive_matcher()
 def generator_test() {
 
-  test(rexp(4), 1000000) { (r: Rexp) => 
+  test(single(SEQ(SEQ(STAR(CHAR('b')),STAR(STAR(SEQ(CHAR('a'),CHAR('b'))))),
+  SEQ(SEQ(CHAR('b'),STAR(ALTS(CHAR('a'),ONE))),ONE))), 1) { (r: Rexp) => 
   // ALTS(SEQ(SEQ(ONE,CHAR('a')),STAR(CHAR('a'))),SEQ(ALTS(CHAR('c'),ONE),STAR(ZERO))))))), 1) { (r: Rexp) => 
-    val ss = Set("b")//stringsFromRexp(r)
-    val boolList = ss.filter(s => s != "").map(s => {
+    val ss = stringsFromRexp(r)
+    val boolList = ss.map(s => {
       //val bdStrong = bdersStrong(s.toList, internalise(r))
-      val bdStrong6 = bdersStrong7(s.toList, internalise(r))
+      val bdStrong6 = bdersStrong6(s.toList, internalise(r))
       val bdStrong6Set = turnIntoTerms(erase(bdStrong6))
       val pdersSet = pderUNIV(r)//.flatMap(r => turnIntoTerms(r))
       val pdersSetBroken = pdersSet.flatMap(r => turnIntoTerms(r))
-      bdStrong6Set.size <= pdersSet.size || bdStrong6Set.size <= pdersSetBroken.size
+      rs1_subseteq_rs2(bdStrong6Set.toSet, distinctByacc(pdersSet.toList))
+      //bdStrong6Set.size <= pdersSet.size || bdStrong6Set.size <= pdersSetBroken.size ||
+      //rs1_subseteq_rs2(bdStrong6Set.toSet, pdersSet union pdersSetBroken)//|| bdStrong6Set.size <= pdersSetBroken.size
     })
-    //println(boolList)
     //!boolList.exists(b => b == false)
     !boolList.exists(b => b == false)
   }
@@ -1408,13 +1467,19 @@
           //  CHAR('c')))))//SEQ(STAR(CHAR('c')),STAR(SEQ(STAR(CHAR('c')),ONE)))//STAR(SEQ(ALTS(STAR(CHAR('c')),CHAR('c')),SEQ(ALTS(CHAR('c'),ONE),ONE)))
 //counterexample1: STAR(SEQ(ALTS(STAR(ZERO),ALTS(CHAR(a),CHAR(b))),SEQ(ONE,ALTS(CHAR(a),CHAR(b)))))
 //counterexample2: SEQ(ALTS(SEQ(CHAR(a),STAR(ONE)),STAR(ONE)),ALTS(CHAR(a),SEQ(ALTS(CHAR(c),CHAR(a)),CHAR(b))))
+
+//new ce1 : STAR(SEQ(ALTS(ALTS(ONE,CHAR(a)),SEQ(ONE,CHAR(b))),ALTS(CHAR(a),ALTS(CHAR(b),CHAR(a)))))
+//new ce2 : ALTS(CHAR(b),SEQ(ALTS(ZERO,ALTS(CHAR(b),CHAR(b))),ALTS(ALTS(CHAR(a),CHAR(b)),SEQ(CHAR(c),ONE))))
+//new ce3 : SEQ(CHAR(b),ALTS(ALTS(ALTS(ONE,CHAR(a)),SEQ(CHAR(c),ONE)),SEQ(STAR(ZERO),SEQ(ONE,CHAR(b)))))
 def counterexample_check() {
-  val r = SEQ(SEQ(STAR(ALTS(CHAR('a'),CHAR('b'))),
-    ALTS(ALTS(CHAR('c'),CHAR('b')),STAR(ONE))),STAR(CHAR('b')))
+  val r = SEQ(SEQ(STAR(CHAR('b')),STAR(STAR(SEQ(CHAR('a'),CHAR('b'))))),
+  SEQ(SEQ(CHAR('b'),STAR(ALTS(CHAR('a'),ONE))),ONE))//SEQ(SEQ(STAR(ALTS(CHAR('a'),CHAR('b'))),
+    //ALTS(ALTS(CHAR('c'),CHAR('b')),STAR(ONE))),STAR(CHAR('b')))
   val s = "b"
   val bdStrong5 = bdersStrong7(s.toList, internalise(r))
-  val bdStrong5Set = breakIntoTerms(erase(bdStrong5))
+  val bdStrong5Set = turnIntoTerms(erase(bdStrong5))
   val pdersSet = pderUNIV(r)//.map(r => erase(bsimp(internalise(r))))//.flatMap(r => turnIntoTerms(r))//.map(oneSimp).flatMap(r => breakIntoTerms(r))
+  val apdersSet = pdersSet.map(internalise)
   println("original regex ")
   rprint(r)
   println("after strong bsimp")
@@ -1423,9 +1488,23 @@
   rsprint(bdStrong5Set)
   println("after pderUNIV")
   rsprint(pdersSet.toList)
+  println("pderUNIV distinctBy6")
+  //asprint(distinctBy6(apdersSet.toList))
+  rsprint(distinctByacc(pdersSet.toList))
+  // rsprint(turnIntoTerms(pdersSet.toList(3)))
+  // println("NO 3 not into terms")
+  // rprint((pdersSet.toList()))
+  // println("after pderUNIV broken")
+  // rsprint(pdersSet.flatMap(r => turnIntoTerms(r)).toList)
 
 }
-// counterexample_check()
+counterexample_check()
+
+def breakable(r: Rexp) : Boolean = r match {
+  case SEQ(ALTS(_, _), _) => true
+  case SEQ(r1, r2) => breakable(r1)
+  case _ => false
+}
 
 def linform_test() {
   val r = STAR(SEQ(STAR(CHAR('c')),ONE))//SEQ(STAR(CHAR('c')),STAR(SEQ(STAR(CHAR('c')),ONE))) //

author	Chengsong
	Sun, 12 Jun 2022 17:03:09 +0100
changeset 541	5bf9f94c02e1
parent 540	3a1fd5ea2484
child 542	a7344c9afbaf

ChengsongTanPhdThesis/Chapters/Inj.tex		file \| annotate \| diff \| comparison \| revisions
thys2/blexer2.sc		file \| annotate \| diff \| comparison \| revisions