# HG changeset patch # User Chengsong # Date 1690302509 -3600 # Node ID 660cf698eb268c39730d92291792bc77574ec3e0 # Parent 6da4516ea87d6aa44b12c764a1d39a44c5ad8746 added example of how inj and lexer works diff -r 6da4516ea87d -r 660cf698eb26 ChengsongTanPhdThesis/Chapters/Bitcoded1.tex --- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Mon Jul 24 11:09:48 2023 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Tue Jul 25 17:28:29 2023 +0100 @@ -10,14 +10,30 @@ %Chapter 3\ref{Chapter3}. In this chapter, we are going to describe the bit-coded algorithm introduced by Sulzmann and Lu \parencite{Sulzmann2014} and their correctness proof. +Just like in chapter \ref{Inj}, the algorithms and proofs have been included +for self-containedness reasons, +even though they have been originally found and described by +Sulzmann and Lu (\cite{Sulzmann2014}) and +Ausaf et al. (\cite{AusafDyckhoffUrban2016} and \cite{Ausaf}). +%In addition to this, the +The details of the proofs in this thesis +also follow more closely the actual Isabelle formalisation. +For example, lemma \ref{flexStepwise} and \ref{retrieveStepwise} +are not included in the publications by Ausaf et al., despite them being +some of the key lemmas leading to the correctness result. + +We will first motivate the bit-coded algorithm in section \ref{bitMotivate}, +and then introduce their formal definitions in section \ref{bitDef}, +followed by a description of the correctness proof of $\blexer$ in section \ref{blexerProof}. + %to address the growth problem of %derivatives of %regular expressions. -We have implemented their algorithm in Scala and Isabelle, -and found problems -in their algorithm, such as de-duplication not working properly and redundant -fixpoint construction. -\section{The Motivation Behind Using Bitcodes} +%We have implemented their algorithm in Scala and Isabelle, +%and found problems +%in their algorithm, such as de-duplication not working properly and redundant +%fixpoint construction. +\section{The Motivation Behind Using Bitcodes}\label{bitMotivate} Let us give again the definition of $\lexer$ from Chapter \ref{Inj}: \begin{center} \begin{tabular}{lcl} @@ -316,7 +332,7 @@ The stack seems to grow at least quadratically with respect to the input (not taking into account the size bloat of $r_i$), which can be inefficient and prone to stack overflows. -\section{Bitcoded Algorithm} +\section{Bitcoded Algorithm}\label{bitDef} To address this, Sulzmann and Lu defined a new datatype called \emph{annotated regular expression}, @@ -1042,7 +1058,7 @@ %----------------------------------- % SUBSECTION 1 %----------------------------------- -\section{Correctness Proof of $\textit{Blexer}$} +\section{Correctness Proof of $\textit{Blexer}$}\label{blexerProof} Why is $\blexer$ correct? In other words, why is it the case that $\blexer$ outputs the same value as $\lexer$? diff -r 6da4516ea87d -r 660cf698eb26 ChengsongTanPhdThesis/Chapters/Inj.tex --- a/ChengsongTanPhdThesis/Chapters/Inj.tex Mon Jul 24 11:09:48 2023 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Tue Jul 25 17:28:29 2023 +0100 @@ -1332,7 +1332,7 @@ %TODO: FILL in the other defs \begin{center} \begin{tabular}{lcl} -$L \; \ZERO$ & $\dn$ & $\phi$\\ +$L \; \ZERO$ & $\dn$ & $\varnothing$\\ $L \; \ONE$ & $\dn$ & $\{[]\}$\\ $L \; c$ & $\dn$ & $\{[c]\}$\\ $L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\ @@ -2072,15 +2072,7 @@ injected back to that sub-value; $\inj$ assembles all parts to form a new value. -\subsection{An Example of how Injection Works} -We will provide a few examples on why -\begin{center} - \begin{tabular}{lcl} - $\inj \; (a\cdot b)\cdot c \; \Seq \; \ONE \; b$ & $=$ & $(a+e)$\\ - $A$ & $B$ & $C$ - \end{tabular} -\end{center} For instance, the last clause is an injection into a sequence value $v_{i+1}$ whose second child @@ -2111,6 +2103,101 @@ wrapped under the $\Stars$ constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$. +Putting all together, Sulzmann and Lu obtained the following algorithm +outlined in the injection-based lexing diagram \ref{graph:inj}: +\begin{center} +\begin{tabular}{lcl} + $\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\ + $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\ + & & $\quad \phantom{\mid}\; \None \implies \None$\\ + & & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$ +\end{tabular} +\end{center} +\noindent + +\subsection{Examples on How Injection and Lexer Works} +We will provide a few examples on how $\inj$ and $\lexer$ +works by showing their values in each recursive call on some +concrete examples. +We start with the call $\lexer \; (a+aa)^*\cdot c\; aac$ +on the lexer, note that the value's character constructor +$\Char \;c$ is abbreviated as $c$ for readability. +Similarly the last derivative's sub-expression +is abbreviated as $r_{=0}$\footnote{which is equal to +$((\ZERO + (\ZERO a+ \ZERO))\cdot(a+aa)^* + (\ZERO + \ZERO a)\cdot (a+aa)^*)+ +((\ZERO+\ZERO a)\cdot(a+aa)^* + (\ZERO+\ZERO a)\cdot(a+aa)^*), +$.} +whose language interpretation is equivalent to that of +$\ZERO$ and therefore not crucial to be displayed fully expanded, +as they will not be injected into.%our example run. +\begin{center} + \begin{tabular}{lcl} + $(a+\textcolor{magenta}{a}\textcolor{blue}{a})^* \cdot \textcolor{red}{c}$ & $\stackrel{\backslash \textcolor{magenta}{a}}{\rightarrow}$ & $((\ONE + \textcolor{magenta}{\ONE} \textcolor{blue}{a})\cdot (a+aa)^*)\cdot \textcolor{red}{c}+\ZERO$\\ + %& $\stackrel{\rightarrow}{\backslash a}$ & $((\ONE + \ONE a)\cdot (a+aa)^*)\cdot c + \ZERO$\\ + & $\stackrel{\backslash \textcolor{blue}{a}}{\rightarrow}$ & + $(((\ZERO+(\ZERO a+ \textcolor{blue}{\ONE}))\cdot (a+aa)^* + (\ONE+\ONE a)\cdot (a+aa)^* )\cdot \textcolor{red}{\mathbf{c}} + \ZERO)+\ZERO$\\ + & $\stackrel{\backslash \textcolor{red}{c}}{\rightarrow}$ & + $((r_{=0}\cdot c + \textcolor{red}{\ONE})+\ZERO)+\ZERO$\\ + & $\stackrel{\mkeps}{\rightarrow}$ & $\Left (\Left \; (\Right \; \textcolor{red}{\Empty}))$ \\ + & $\stackrel{\inj \;\textcolor{red}{c} }{\rightarrow}$ & + $\Left \; (\Left \; (\Seq \;(\Left \; (\Seq \; (\Right \; (\Right\; \textcolor{blue}{\Empty})) \; \Stars \, [])) \; \textcolor{red}{c}))$\\ + & $\stackrel{\inj \;\textcolor{blue}{a}}{\rightarrow}$ & + $\Left\; (\Seq \; (\Seq\; (\Right \; (\Seq \; \textcolor{magenta}{\Empty }\; \textcolor{blue}{ \mathbf{a}}) ) + \;\Stars \,[]) \; \textcolor{red}{c})$\\ + & $\stackrel{\inj \;\textcolor{magenta}{a}}{\rightarrow}$ & $\Seq \; (\Stars \; [\Right \; (\Seq \; \mathbf{\textcolor{magenta}{a}} \; \textcolor{blue}{a})]) \; \textcolor{red}{ c}$ + + %$\inj \; r \; c\A$ + %$\inj \; (a\cdot b)\cdot c \; \Seq \; \ONE \; b$ & $=$ & $(a+e)$\\ + \end{tabular} +\end{center} +\noindent +We have assigned different colours for each character, +as well as their corresponding locations in values and regular expressions. +The most recently injected character is marked with a bold font. +%Their corresponding derivative steps have been marked with the same +%colour. We also mark the specific location where the coloured character +%was injected into with the same colour, before and after the injection occurred. +%TODO: can be also used to motivate injection->blexer in Bitcoded1 +To show the details of how $\inj$ works, +we zoom in to the second injection above, +illustrating the recursive calls involved: +\begin{center} + \begin{tabular}{lcl} +$\inj \;\quad ((\ONE + {\ONE} + \textcolor{blue}{a})\cdot (a+aa)^*)\cdot + c + \ZERO \; \quad \textcolor{blue}{a} \; $ & &\\ +$\quad \Left \; +(\Left \; (\Seq \;(\Left \; (\Seq \; (\Right \; (\Right\; + \textcolor{blue}{\Empty})) \; \Stars \, [])) \; c))$ & \\$=$\\ + $\Left \; (\inj \; ((\ONE + \ONE + \textcolor{blue}{a})\cdot (a+aa)^*)\cdot + c \quad \textcolor{blue}{a} \quad (\Left \; (\Seq\; ( \Left \; (\Seq \; (\Right \; (\Right\; + \textcolor{blue}{\Empty})) \; \Stars \, []) ) \; c ) )\;\;)$ &\\ $=$\\ + $\Left \; (\Seq \; (\inj \; (\ONE + \ONE \textcolor{blue}{a})\cdot(a+aa)^* \quad \textcolor{blue}{a} \quad + (\Left \; (\Seq \; (\Right \; (\Right\;\textcolor{blue}{\Empty})))) ) \; c ) $ & \\$=$\\ + $\Left \; (\Seq \; (\Seq \; (\inj \quad (\ONE + \ONE \textcolor{blue}{a}) \quad \textcolor{blue}{a}\quad \Right \;(\Right \; \textcolor{blue}{\Empty}) ) \; \Stars \,[])\; c)$ &\\ $=$ \\ + $\Left \; (\Seq \; (\Seq \; (\Right \; (\inj \; \ONE \textcolor{blue}{a} \quad \textcolor{blue}{a}\quad (\Right \;\textcolor{blue}{\Empty})))) \; \Stars \, [])$ + & \\ $=$\\ + $\Left \; (\Seq \; (\Seq \; (\Right \; (\Seq \; (\mkeps \; \ONE)\;(\inj \;\textcolor{blue}{a} \; \textcolor{blue}{a} \; \textcolor{blue}{\Empty})) ) ) \; \Stars \, [] )$ + & \\ $=$\\ + $\Left \; (\Seq \; (\Seq \; (\Right \; (\Seq \; \Empty \; \textcolor{blue}{a}) ) ) \; \Stars \, [] )$ + \end{tabular} +\end{center} +\noindent +We will now introduce the proofs related to $\inj$ and +$\lexer$. These proofs have originally been found by Ausaf et al. +in their 2016 work \cite{AusafDyckhoffUrban2016}. +Proofs to some of these lemmas have also been discussed in more detail in +Ausaf's thesis \cite{Ausaf}. +Nevertheless, we still introduce these proofs, to make this thesis +self-contained as we show inductive variants used in these proofs break +after more simplifications are introduced. +%Also note that lemmas like \ref{}described in this thesis is a more faithful +%representation of the actual accompanying Isabelle formalisation, and +%therefore + + + Recall that lemma \ref{mePosix} tells us that $\mkeps$ always generates the POSIX value. @@ -2212,17 +2299,14 @@ The star case can be proven similarly. \end{proof} \noindent -Putting all together, Sulzmann and Lu obtained the following algorithm -outlined in the diagram \ref{graph:inj}: -\begin{center} -\begin{tabular}{lcl} - $\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\ - $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\ - & & $\quad \phantom{\mid}\; \None \implies \None$\\ - & & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$ -\end{tabular} -\end{center} -\noindent + + + + + + + +%TODO: Cut from previous lexer def, need to make coherent The central property of the $\lexer$ is that it gives the correct result according to POSIX rules. diff -r 6da4516ea87d -r 660cf698eb26 RegexExplosionPlot/injToSimp.sc --- a/RegexExplosionPlot/injToSimp.sc Mon Jul 24 11:09:48 2023 +0100 +++ b/RegexExplosionPlot/injToSimp.sc Tue Jul 25 17:28:29 2023 +0100 @@ -393,14 +393,14 @@ def simp_lexer(r: Rexp, s: List[Char]): Val = s match{ case Nil => {//println("mkeps!"); println(r); val v = mkeps(r); - //println(v); + println(v); v} case c::cs => { val derc1 = der(c, r); val derc = light_simp(derc1); println("derivative size: "+ size(derc)); val v = simp_lexer(derc, cs); - //println("Injection!");println("r: "+r); println("c: "+c); println("v:"+v); + println("Injection!");println("r: "+r); println("c: "+c); println("v:"+v); inj(r, c, v ) } } @@ -415,7 +415,7 @@ //val derc = light_simp(derc1); println("derivative size: "+ size(derc)); val v = print_lexer(derc, cs); - //println("Injection!");println("r: "+r); println("c: "+c); println("v:"+v); + println("Injection!");println("r: "+r); println("c: "+c); println("v:"+v); inj(r, c, v ) } } val r = STAR("a"|"aa") ~ ("c") @@ -1229,9 +1229,10 @@ // println(" " + (t1 - t0)/1e9 + "") // result //} -//val astarstarB = STAR(STAR("a"))~("b") -//val data_points_x_axis = List.range(0, 60000, 3000) +val astarstarB = STAR(STAR("a"))~("b") +val data_points_x_axis = List.range(0, 60000, 3000) //for (i <- data_points_x_axis) {print(i); time { blexing(astarstarB, ("a"*i)) } } ////println(data_points_x_axis.zip( List(0,1,3,4))) //for(i <- List.range(0, 40000, 3000)) print(i + " ") +//for (i <- data_points_x_axis) {print(i); time { simp_lexer(astarstarB, ("a"*i).toList) } }