# HG changeset patch # User Chengsong # Date 1655049789 -3600 # Node ID 5bf9f94c02e17101e16027e4b78e534ed97ac4a1 # Parent 3a1fd5ea2484222c56a2c42ac1c869d73a4f9eb1 some comments implemented diff -r 3a1fd5ea2484 -r 5bf9f94c02e1 ChengsongTanPhdThesis/Chapters/Inj.tex --- a/ChengsongTanPhdThesis/Chapters/Inj.tex Thu Jun 09 22:08:06 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Sun Jun 12 17:03:09 2022 +0100 @@ -10,29 +10,33 @@ In this chapter, we define the basic notions for regular languages and regular expressions. +This is essentially a description in "English" +of your formalisation in Isabelle/HOL. We also give the definition of what $\POSIX$ lexing means. \section{Basic Concepts} -Usually in formal language theory there is an alphabet +Usually formal language theory starts with an alphabet denoting a set of characters. -Here we only use the datatype of characters from Isabelle, -which roughly corresponds to the ASCII character. -Then using the usual $[]$ notation for lists, -we can define strings using chars: +Here we just use the datatype of characters from Isabelle, +which roughly corresponds to the ASCII characters. +In what follows we shall leave the information about the alphabet +implicit. +Then using the usual bracket notation for lists, +we can define strings made up of characters: \begin{center} \begin{tabular}{lcl} -$\textit{string}$ & $\dn$ & $[] | c :: cs$\\ -& & $(c\; \text{has char type})$ +$\textit{s}$ & $\dn$ & $[] \; |\; c :: s$ \end{tabular} \end{center} -And strings can be concatenated to form longer strings, -in the same way as we concatenate two lists, -which we denote as $@$. We omit the precise +Where $c$ is a variable ranging over characters. +Strings can be concatenated to form longer strings in the same +way as we concatenate two lists, which we write as @. +We omit the precise recursive definition here. We overload this concatenation operator for two sets of strings: \begin{center} \begin{tabular}{lcl} -$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\ +$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\ \end{tabular} \end{center} We also call the above \emph{language concatenation}. @@ -41,11 +45,11 @@ \begin{center} \begin{tabular}{lcl} $A^0 $ & $\dn$ & $\{ [] \}$\\ -$A^{n+1}$ & $\dn$ & $A^n @ A$ +$A^{n+1}$ & $\dn$ & $A @ A^n$ \end{tabular} \end{center} The union of all the natural number powers of a language -is defined as the Kleene star operator: +is usually defined as the Kleene star operator: \begin{center} \begin{tabular}{lcl} $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\ @@ -65,28 +69,28 @@ \inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*} \end{mathpar} \end{center} - -We also define an operation of "chopping of" a character from -a language, which we call $\Der$, meaning "Derivative for a language": +\ChristianComment{Yes, used the inferrule command in mathpar} +We also define an operation of "chopping off" a character from +a language, which we call $\Der$, meaning \emph{Derivative} (for a language): \begin{center} \begin{tabular}{lcl} $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\ \end{tabular} \end{center} \noindent -This can be generalised to "chopping off" a string from all strings within set $A$, -with the help of the concatenation operator: +This can be generalised to "chopping off" a string from all strings within set $A$, +namely: \begin{center} \begin{tabular}{lcl} -$\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\ +$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\ \end{tabular} \end{center} \noindent -which is essentially the left quotient $A \backslash L'$ of $A$ against -the singleton language $L' = \{w\}$ +which is essentially the left quotient $A \backslash L$ of $A$ against +the singleton language with $L = \{w\}$ in formal language theory. -For this dissertation the $\textit{Ders}$ definition with -a single string suffices. +However for the purposes here, the $\textit{Ders}$ definition with +a single string is sufficient. With the sequencing, Kleene star, and $\textit{Der}$ operator on languages, we have a few properties of how the language derivative can be defined using @@ -510,7 +514,12 @@ For instance, when lexing a code snippet $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized as an identifier rather than a keyword. - +\ChristianComment{Do I also introduce lexical values $LV$ here?} +We know that $\POSIX$ values are also part of the normal values: +\begin{lemma} +$(r, s) \rightarrow v \implies \vdash v: r$ +\end{lemma} +\noindent The good property about a $\POSIX$ value is that given the same regular expression $r$ and string $s$, one can always uniquely determine the $\POSIX$ value for it: @@ -659,12 +668,12 @@ \noindent The central property of the $\lexer$ is that it gives the correct result by $\POSIX$ standards: - \begin{lemma} + \begin{theorem} \begin{tabular}{l} - $s \in L(r) \Longleftrightarrow (\exists v. \; r \; s = \Some(v) \land (r, \; s) \rightarrow v)$\\ - $s \notin L(r) \Longleftrightarrow (\lexer \; r\; s = \None)$ + $\lexer \; r \; s = \Some(v) \Longleftrightarrow (r, \; s) \rightarrow v$\\ + $\lexer \;r \; s = \None \Longleftrightarrow \neg(\exists v. (r, s) \rightarrow v)$ \end{tabular} - \end{lemma} + \end{theorem} \begin{proof} @@ -674,7 +683,7 @@ \end{proof} -Pictorially, the algorithm is as follows ( +We now give a pictorial view of the algorithm ( For convenience, we employ the following notations: the regular expression we start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1 \ldots c_{n-1}$. The @@ -735,7 +744,7 @@ \end{figure}\label{fig:BetterWaterloo} That is because our lexing algorithm currently keeps a lot of -"useless values that will never not be used. +"useless" values that will not be used. These different ways of matching will grow exponentially with the string length. For $r= (a^*\cdot a^*)^*$ and @@ -744,10 +753,10 @@ there will be $n - 1$ "splitting points" on $s$ we can independently choose to split or not so that each sub-string segmented by those chosen splitting points will form different iterations. -For example when $n=4$, +For example when $n=4$, we give out a few of the many possibilities of splitting: \begin{center} \begin{tabular}{lcr} -$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration, this iteration will be divided between the inner sequence $a^*\cdot a^*$)\\ +$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration)\\ $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$ (two iterations)\\ $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$ (two iterations)\\ $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\ diff -r 3a1fd5ea2484 -r 5bf9f94c02e1 thys2/blexer2.sc --- a/thys2/blexer2.sc Thu Jun 09 22:08:06 2022 +0100 +++ b/thys2/blexer2.sc Sun Jun 12 17:03:09 2022 +0100 @@ -915,6 +915,11 @@ res.toSet } +def attachCtxcc(r: Rexp, ctx: List[Rexp]) : Set[Rexp] = { + val res = turnIntoTerms((L(r, ctx))).map(oneSimp) + res.toSet +} + def ABIncludedByC[A, B, C](a: A, b: B, c: C, f: (A, B) => C, subseteqPred: (C, C) => Boolean) : Boolean = { subseteqPred(f(a, b), c) } @@ -923,7 +928,9 @@ res } def prune6(acc: Set[Rexp], r: ARexp, ctx: List[Rexp]) : ARexp = { - if (ABIncludedByC(r, ctx, acc, attachCtx, rs1_subseteq_rs2)) {//acc.flatMap(breakIntoTerms + if (ABIncludedByC(a = r, b = ctx, c = acc, + f = attachCtx, subseteqPred = rs1_subseteq_rs2)) + {//acc.flatMap(breakIntoTerms AZERO } else{ @@ -951,6 +958,36 @@ } } +def prune6cc(acc: Set[Rexp], r: Rexp, ctx: List[Rexp]) : Rexp = { + if (ABIncludedByC(a = r, b = ctx, c = acc, + f = attachCtxcc, subseteqPred = rs1_subseteq_rs2)) + {//acc.flatMap(breakIntoTerms + ZERO + } + else{ + r match { + case SEQ(r1, r2) => + (prune6cc(acc, r1, r2 :: ctx)) match{ + case ZERO => + ZERO + case ONE => + r2 + case r1p => + SEQ(r1p, r2) + } + case ALTS(r1, r2) => + List(r1, r2).map(r => prune6cc(acc, r, ctx)).filter(_ != AZERO) match { + case Nil => + ZERO + case r :: Nil => + r + case ra :: rb :: Nil => + ALTS(ra, rb) + } + case r => r + } + } +} def distinctBy6(xs: List[ARexp], acc: Set[Rexp] = Set()) : List[ARexp] = xs match { case Nil => @@ -973,6 +1010,26 @@ } } +def distinctByacc(xs: List[Rexp], acc: Set[Rexp] = Set()) : Set[Rexp] = xs match { + case Nil => + acc + case x :: xs => { + if(acc.contains(x)){ + distinctByacc(xs, acc) + } + else{ + val pruned = prune6cc(acc, x, Nil) + val newTerms = turnIntoTerms(pruned) + pruned match { + case ZERO => + distinctByacc(xs, acc) + case xPrime => + distinctByacc(xs, newTerms.map(oneSimp) ++: acc)//distinctBy5(xs, addToAcc.map(oneSimp(_)) ::: acc) + } + } + } +} + def breakIntoTerms(r: Rexp) : List[Rexp] = r match { case SEQ(r1, r2) => breakIntoTerms(r1).map(r11 => SEQ(r11, r2)) case ALTS(r1, r2) => breakIntoTerms(r1) ::: breakIntoTerms(r2) @@ -1380,21 +1437,23 @@ // } } -naive_matcher() +// naive_matcher() def generator_test() { - test(rexp(4), 1000000) { (r: Rexp) => + test(single(SEQ(SEQ(STAR(CHAR('b')),STAR(STAR(SEQ(CHAR('a'),CHAR('b'))))), + SEQ(SEQ(CHAR('b'),STAR(ALTS(CHAR('a'),ONE))),ONE))), 1) { (r: Rexp) => // ALTS(SEQ(SEQ(ONE,CHAR('a')),STAR(CHAR('a'))),SEQ(ALTS(CHAR('c'),ONE),STAR(ZERO))))))), 1) { (r: Rexp) => - val ss = Set("b")//stringsFromRexp(r) - val boolList = ss.filter(s => s != "").map(s => { + val ss = stringsFromRexp(r) + val boolList = ss.map(s => { //val bdStrong = bdersStrong(s.toList, internalise(r)) - val bdStrong6 = bdersStrong7(s.toList, internalise(r)) + val bdStrong6 = bdersStrong6(s.toList, internalise(r)) val bdStrong6Set = turnIntoTerms(erase(bdStrong6)) val pdersSet = pderUNIV(r)//.flatMap(r => turnIntoTerms(r)) val pdersSetBroken = pdersSet.flatMap(r => turnIntoTerms(r)) - bdStrong6Set.size <= pdersSet.size || bdStrong6Set.size <= pdersSetBroken.size + rs1_subseteq_rs2(bdStrong6Set.toSet, distinctByacc(pdersSet.toList)) + //bdStrong6Set.size <= pdersSet.size || bdStrong6Set.size <= pdersSetBroken.size || + //rs1_subseteq_rs2(bdStrong6Set.toSet, pdersSet union pdersSetBroken)//|| bdStrong6Set.size <= pdersSetBroken.size }) - //println(boolList) //!boolList.exists(b => b == false) !boolList.exists(b => b == false) } @@ -1408,13 +1467,19 @@ // CHAR('c')))))//SEQ(STAR(CHAR('c')),STAR(SEQ(STAR(CHAR('c')),ONE)))//STAR(SEQ(ALTS(STAR(CHAR('c')),CHAR('c')),SEQ(ALTS(CHAR('c'),ONE),ONE))) //counterexample1: STAR(SEQ(ALTS(STAR(ZERO),ALTS(CHAR(a),CHAR(b))),SEQ(ONE,ALTS(CHAR(a),CHAR(b))))) //counterexample2: SEQ(ALTS(SEQ(CHAR(a),STAR(ONE)),STAR(ONE)),ALTS(CHAR(a),SEQ(ALTS(CHAR(c),CHAR(a)),CHAR(b)))) + +//new ce1 : STAR(SEQ(ALTS(ALTS(ONE,CHAR(a)),SEQ(ONE,CHAR(b))),ALTS(CHAR(a),ALTS(CHAR(b),CHAR(a))))) +//new ce2 : ALTS(CHAR(b),SEQ(ALTS(ZERO,ALTS(CHAR(b),CHAR(b))),ALTS(ALTS(CHAR(a),CHAR(b)),SEQ(CHAR(c),ONE)))) +//new ce3 : SEQ(CHAR(b),ALTS(ALTS(ALTS(ONE,CHAR(a)),SEQ(CHAR(c),ONE)),SEQ(STAR(ZERO),SEQ(ONE,CHAR(b))))) def counterexample_check() { - val r = SEQ(SEQ(STAR(ALTS(CHAR('a'),CHAR('b'))), - ALTS(ALTS(CHAR('c'),CHAR('b')),STAR(ONE))),STAR(CHAR('b'))) + val r = SEQ(SEQ(STAR(CHAR('b')),STAR(STAR(SEQ(CHAR('a'),CHAR('b'))))), + SEQ(SEQ(CHAR('b'),STAR(ALTS(CHAR('a'),ONE))),ONE))//SEQ(SEQ(STAR(ALTS(CHAR('a'),CHAR('b'))), + //ALTS(ALTS(CHAR('c'),CHAR('b')),STAR(ONE))),STAR(CHAR('b'))) val s = "b" val bdStrong5 = bdersStrong7(s.toList, internalise(r)) - val bdStrong5Set = breakIntoTerms(erase(bdStrong5)) + val bdStrong5Set = turnIntoTerms(erase(bdStrong5)) val pdersSet = pderUNIV(r)//.map(r => erase(bsimp(internalise(r))))//.flatMap(r => turnIntoTerms(r))//.map(oneSimp).flatMap(r => breakIntoTerms(r)) + val apdersSet = pdersSet.map(internalise) println("original regex ") rprint(r) println("after strong bsimp") @@ -1423,9 +1488,23 @@ rsprint(bdStrong5Set) println("after pderUNIV") rsprint(pdersSet.toList) + println("pderUNIV distinctBy6") + //asprint(distinctBy6(apdersSet.toList)) + rsprint(distinctByacc(pdersSet.toList)) + // rsprint(turnIntoTerms(pdersSet.toList(3))) + // println("NO 3 not into terms") + // rprint((pdersSet.toList())) + // println("after pderUNIV broken") + // rsprint(pdersSet.flatMap(r => turnIntoTerms(r)).toList) } -// counterexample_check() +counterexample_check() + +def breakable(r: Rexp) : Boolean = r match { + case SEQ(ALTS(_, _), _) => true + case SEQ(r1, r2) => breakable(r1) + case _ => false +} def linform_test() { val r = STAR(SEQ(STAR(CHAR('c')),ONE))//SEQ(STAR(CHAR('c')),STAR(SEQ(STAR(CHAR('c')),ONE))) //