updated
authorChristian Urban <urbanc@in.tum.de>
Wed, 31 May 2017 09:14:39 +0100
changeset 495 7d9d86dc7aa0
parent 494 d0fc671bcbbf
child 496 5c9de27a5b30
updated
LINKS
handouts/ho03.pdf
handouts/ho03.tex
progs/Matcher.thy
--- a/LINKS	Sun May 21 07:35:35 2017 +0100
+++ b/LINKS	Wed May 31 09:14:39 2017 +0100
@@ -6,12 +6,75 @@
 webassembly
 https://sourceware.org/ml/binutils/2017-03/msg00044.html
 https://hacks.mozilla.org/2017/02/a-cartoon-intro-to-webassembly/
+https://webassembly.github.io/spec/
 
 webassembly explorer
 https://mbebenita.github.io/WasmExplorer/
 
+ARM
+https://azeria-labs.com/writing-arm-assembly-part-1/
 
+JVM
+https://www.toptal.com/scala/scala-bytecode-and-the-jvm
+
+Growing a compiler
+http://www.cs.dartmouth.edu/~mckeeman/cs48/mxcom/gem/html/GrowingCompiler.html
 
 free books
 https://github.com/vhf/free-programming-books/blob/master/free-programming-books.md
-https://john.cs.olemiss.edu/~hcc/csci658/notes/Free_Prog_Lang_Textbooks.html
\ No newline at end of file
+https://john.cs.olemiss.edu/~hcc/csci658/notes/Free_Prog_Lang_Textbooks.html
+
+MIPS
+http://courses.missouristate.edu/kenvollmar/mars/
+
+PEG
+
+https://github.com/taocpp/PEGTL
+
+
+Parser
+https://www.reddit.com/r/programming/comments/615hoz/how_to_write_a_recursive_descent_parser/
+https://www.reddit.com/r/ProgrammingLanguages/comments/60gmgc/writing_a_recursive_descent_expression_parser/
+http://www.craftinginterpreters.com/parsing-expressions.html
+
+
+small languages
+https://github.com/Michael2109/cobalt
+http://www.red-lang.org/p/about.html
+http://craftinginterpreters.com/contents.html
+https://michaelhaywoodblog.wordpress.com
+https://ruslanspivak.com/lsbasi-part1/
+http://selfie.cs.uni-salzburg.at
+
+
+automata
+https://www7.in.tum.de/um/courses/auto/ws1314/script/autonotes.pdf
+
+Reges helpers
+https://regex101.com
+http://www.regular-expressions.info/tutorial.html
+
+
+Regex performance benchmark
+https://rust-leipzig.github.io/regex/2017/03/28/comparison-of-regex-engines/
+https://github.com/k-takata/Onigmo
+
+Sulzmann
+https://github.com/pippijn/dreml/
+
+
+Scala parser
+http://enear.github.io/2016/03/31/parser-combinators/
+
+
+ANTLR megatutorial
+https://tomassetti.me/antlr-mega-tutorial/
+
+From regex to LLVM
+https://www.youtube.com/watch?v=Ukqb6nMjFyk
+
+
+
+
+Static code analysis
+https://medium.com/@Coder_HarryLee/videos-about-static-code-analysis-7654b40f9a3b
\ No newline at end of file
Binary file handouts/ho03.pdf has changed
--- a/handouts/ho03.tex	Sun May 21 07:35:35 2017 +0100
+++ b/handouts/ho03.tex	Wed May 31 09:14:39 2017 +0100
@@ -47,7 +47,7 @@
 everywhere: so it can be the case that given a character there is no
 next state, in which case we need to raise a kind of ``failure
 exception''.  That means actually we have \emph{partial} functions as
-transitions---see the Scala implementation of DFAs later on.  A
+transitions---see the Scala implementation for DFAs later on.  A
 typical example of a DFA is
 
 \begin{center}
@@ -94,6 +94,10 @@
 \end{array}
 \]
 
+\noindent
+Please check that this table represents the same transition function
+as the graph above.
+
 We need to define the notion of what language is accepted by
 an automaton. For this we lift the transition function
 $\delta$ from characters to strings as follows:
@@ -235,13 +239,14 @@
 
 \noindent
 This NFA happens to have only one starting state, but in general there
-could be more.  Notice that in state $Q_0$ we might go to state $Q_1$
-\emph{or} to state $Q_2$ when receiving an $a$. Similarly in state
-$Q_1$ and receiving an $a$, we can stay in state $Q_1$ \emph{or} go to
-$Q_2$. This kind of choice is not allowed with DFAs. The downside of
-this choice in NFAs is that when it comes to deciding whether a string is
-accepted by a NFA we potentially have to explore all possibilities. I
-let you think which strings the above NFA accepts.
+could be more than one.  Notice that in state $Q_0$ we might go to
+state $Q_1$ \emph{or} to state $Q_2$ when receiving an $a$. Similarly
+in state $Q_1$ and receiving an $a$, we can stay in state $Q_1$
+\emph{or} go to $Q_2$. This kind of choice is not allowed with
+DFAs. The downside of this choice in NFAs is that when it comes to
+deciding whether a string is accepted by a NFA we potentially have to
+explore all possibilities. I let you think which strings the above NFA
+accepts.
 
 
 There are a number of additional points you should note about
@@ -530,8 +535,8 @@
 \noindent
 I let you think whether the NFAs can match exactly those strings the
 regular expressions can match. To do this translation in code we need
-a way to construct states programatically...and as an additional
-constrain Scala needs to recognise that these states are being distinct.
+a way to construct states ``programatically''...and as an additional
+constraint Scala needs to recognise that these states are being distinct.
 For this I implemented in Figure~\ref{thompson1} a class
 \texttt{TState} that includes a counter and a companion object that
 increases this counter whenever a new state is created.\footnote{You might
@@ -546,7 +551,8 @@
   increased in the companion object of \texttt{TState}
   whenever a new state is created. The code in Lines 24--40
   constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$.
-  Compare the pictures given in \eqref{simplecases}.
+  Compare this code with the pictures given in \eqref{simplecases} on
+  Page~\pageref{simplecases}.
   \label{thompson1}}
 \end{figure}
 
@@ -558,7 +564,7 @@
   The implicit class about rich partial functions
   implements the infix operation \texttt{+++} which
   combines an $\epsilon$NFA transition with a NFA transition
-  (both given as partial functions).\label{thompson2}}
+  (both are given as partial functions---but with different type!).\label{thompson2}}
 \end{figure}
 
 The case for the sequence regular expression $r_1 \cdot r_2$ is a bit more
@@ -658,8 +664,10 @@
 $r_1$ and $r_2$ respectively. Each NFA has some starting states and
 some accepting states. We obtain a NFA for the regular expression $r_1
 + r_2$ by composing the transition functions (this crucially depends
-on knowing that the states of each component NFA are distinct); and
-also combine the starting states and accepting functions.
+on knowing that the states of each component NFA are distinct---recall
+we implemented for this to hold some bespoke code for states). We also
+need to combine the starting states and accepting functions
+appropriately.
 
 \begin{center}
 \begin{tabular}[t]{ccc}
@@ -732,13 +740,14 @@
 state, because $r^*$ can recognise the empty string.
 
 \begin{center}
-\begin{tabular}[b]{@{\hspace{-4mm}}ccc@{}}  
+\begin{tabular}[b]{@{}ccc@{}}  
 \begin{tikzpicture}[node distance=3mm,
     >=stealth',very thick,
     every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
     baseline=(current bounding box.north)]
-\node at (0,0)  (1)  {$\mbox{}$};
-\node[state, initial]  (2)  [right=16mm of 1] {$\mbox{}$};
+\node (2)  {$\mbox{}$};
+\node[state, initial]  (4)  [above=1mm of 2] {$\mbox{}$};
+\node[state, initial]  (5)  [below=1mm of 2] {$\mbox{}$};  
 \node (a)  [right=of 2] {$\ldots$};
 \node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
 \node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
@@ -756,13 +765,16 @@
     every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},
     baseline=(current bounding box.north)]
 \node at (0,0) [state, initial,accepting]  (1)  {$\mbox{}$};
-\node[state]  (2)  [right=16mm of 1] {$\mbox{}$};
+\node (2)  [right=16mm of 1] {$\mbox{}$};
+\node[state]  (4)  [above=1mm of 2] {$\mbox{}$};
+\node[state]  (5)  [below=1mm of 2] {$\mbox{}$};  
 \node (a)  [right=of 2] {$\ldots$};
 \node[state]  (a1)  [right=of a] {$\mbox{}$};
 \node[state]  (a2)  [above=of a1] {$\mbox{}$};
 \node[state]  (a3)  [below=of a1] {$\mbox{}$};
-\path[->] (1) edge node [above]  {$\epsilon$} (2);
-\path[->] (a1) edge [bend left=45] node [above]  {$\epsilon$} (1);
+\path[->] (1) edge node [below]  {$\epsilon$} (4);
+\path[->] (1) edge node [below]  {$\epsilon$} (5);
+\path[->] (a1) edge [bend left=45] node [below]  {$\epsilon$} (1);
 \path[->] (a2) edge [bend right] node [below]  {$\epsilon$} (1);
 \path[->] (a3) edge [bend left=45] node [below]  {$\epsilon$} (1);
 \begin{pgfonlayer}{background}
@@ -1076,25 +1088,25 @@
 state in the subset is accepting (that is true) in the NFA.\medskip
 
 \noindent
-You might be able to spend some quality tinkering with this code and
-time to ponder about it. Then you will probably notice it is actually
-a bit silly. The whole point of translating the NFA into a DFA via the
-subset construction is to make the decision of whether a string is
-accepted or not faster. Given the code above, the generated DFA will
-be exactly as fast, or as slow, as the NFA we started with (actually
-it will even be a tiny bit slower). The reason is that we just re-use
-the \texttt{nexts} function from the NFA. This function implements the
-non-deterministic breadth-first search.  You might be thinking: This
-is cheating! \ldots{} Well, not quite as you will see later, but in
-terms of speed we still need to work a bit in order to get
-sometimes(!) a faster DFA. Let's do this next.
+You might be able to spend some quality time tinkering with this code
+and time to ponder about it. Then you will probably notice that it is
+actually a bit silly. The whole point of translating the NFA into a
+DFA via the subset construction is to make the decision of whether a
+string is accepted or not faster. Given the code above, the generated
+DFA will be exactly as fast, or as slow, as the NFA we started with
+(actually it will even be a tiny bit slower). The reason is that we
+just re-use the \texttt{nexts} function from the NFA. This function
+implements the non-deterministic breadth-first search.  You might be
+thinking: This is cheating! \ldots{} Well, not quite as you will see
+later, but in terms of speed we still need to work a bit in order to
+get sometimes(!) a faster DFA. Let's do this next.
 
 \subsection*{DFA Minimisation}
 
 As seen in \eqref{subsetdfa}, the subset construction from NFA to a
 DFA can result in a rather ``inefficient'' DFA. Meaning there are
 states that are not needed. There are two kinds of such unneeded
-states: \emph{unreachable} states and \emph{nondistinguishable}
+states: \emph{unreachable} states and \emph{non-distinguishable}
 states. The first kind of states can just be removed without affecting
 the language that can be recognised (after all they are
 unreachable). The second kind can also be recognised and thus a DFA
@@ -1253,18 +1265,19 @@
 
 \subsection*{Brzozowski's Method}
 
-I know tyhis is already a long, long rant: but after all it is a topic
-that has been researched for more than 60 years. If you reflect on
-what you have read so far, the story is that you can take a regular
-expression, translate it via the Thompson Construction into an
+I know this handout is already a long, long rant: but after all it is
+a topic that has been researched for more than 60 years. If you
+reflect on what you have read so far, the story is that you can take a
+regular expression, translate it via the Thompson Construction into an
 $\epsilon$NFA, then translate it into a NFA by removing all
 $\epsilon$-transitions, and then via the subset construction obtain a
 DFA. In all steps we made sure the language, or which strings can be
-recognised, stays the same. After the last section, we can even
-minimise the DFA (maybe not in code). But again we made sure the same
-language is recognised. You might be wondering: Can we go into the
-other direction?  Can we go from a DFA and obtain a regular expression
-that can recognise the same language as the DFA?\medskip
+recognised, stays the same. Of couse we should have proved this in
+each step, but let us cut corners here.  After the last section, we
+can even minimise the DFA (maybe not in code). But again we made sure
+the same language is recognised. You might be wondering: Can we go
+into the other direction?  Can we go from a DFA and obtain a regular
+expression that can recognise the same language as the DFA?\medskip
 
 \noindent
 The answer is yes. Again there are several methods for calculating a
@@ -1520,15 +1533,37 @@
 
 \subsection*{Where Have Derivatives Gone?}
 
-By now you are probably fed up by this text. It is now way too long
+By now you are probably fed up with this text. It is now way too long
 for one lecture, but there is still one aspect of the
-automata-connection I like to highlight for you. Perhaps by now you
-are asking yourself: Where have the derivatives gone? Did we just
-forget them?  Well, they have a place in the picture of calculating a
-DFA from the regular expression.
+automata-regular-expression-connection I like to describe. Perhaps by
+now you are asking yourself: Where have the derivatives gone? Did we
+just forget them?  Well, they have a place in the picture of
+calculating a DFA from the regular expression.
 
 To be done
 
+\begin{center}
+\begin{tikzpicture}
+  [level distance=25mm,very thick,auto,
+   level 1/.style={sibling distance=30mm},
+   level 2/.style={sibling distance=15mm},
+   every node/.style={minimum size=30pt,
+                    inner sep=0pt,circle,draw=blue!50,very thick,
+                    fill=blue!20}]
+  \node {$r$} [grow=right]
+  child[->] {node (cn) {$d_{c_n}$}
+    child { node {$dd_{c_nc_n}$}}
+    child { node {$dd_{c_nc_1}$}}
+    %edge from parent node[left] {$c_n$}
+  }
+  child[->] {node (c1) {$d_{c_1}$}
+    child { node {$dd_{c_1c_n}$}}
+    child { node {$dd_{c_1c_1}$}}
+    %edge from parent node[left] {$c_1$}
+  };
+  %%\draw (cn) -- (c1) node {\vdots}; 
+\end{tikzpicture}  
+\end{center}
 
 %\section*{Further Reading}
 
--- a/progs/Matcher.thy	Sun May 21 07:35:35 2017 +0100
+++ b/progs/Matcher.thy	Wed May 31 09:14:39 2017 +0100
@@ -6,8 +6,8 @@
 section {* Regular Expressions *}
 
 datatype rexp =
-  NULL
-| EMPTY
+  ZERO
+| ONE
 | CHAR char
 | SEQ rexp rexp
 | ALT rexp rexp
@@ -63,8 +63,8 @@
 fun
   L :: "rexp \<Rightarrow> string set"
 where
-  "L (NULL) = {}"
-| "L (EMPTY) = {[]}"
+  "L (ZERO) = {}"
+| "L (ONE) = {[]}"
 | "L (CHAR c) = {[c]}"
 | "L (SEQ r1 r2) = (L r1) ;; (L r2)"
 | "L (ALT r1 r2) = (L r1) \<union> (L r2)"
@@ -75,180 +75,61 @@
 fun
  nullable :: "rexp \<Rightarrow> bool"
 where
-  "nullable (NULL) = False"
-| "nullable (EMPTY) = True"
+  "nullable (ZERO) = False"
+| "nullable (ONE) = True"
 | "nullable (CHAR c) = False"
 | "nullable (ALT r1 r2) = (nullable r1 \<or> nullable r2)"
 | "nullable (SEQ r1 r2) = (nullable r1 \<and> nullable r2)"
 | "nullable (STAR r) = True"
 
-fun
- noccurs :: "rexp \<Rightarrow> bool"
-where
-  "noccurs (NULL) = True"
-| "noccurs (EMPTY) = False"
-| "noccurs (CHAR c) = False"
-| "noccurs (ALT r1 r2) = (noccurs r1 \<or> noccurs r2)"
-| "noccurs (SEQ r1 r2) = (noccurs r1 \<or> noccurs r2)"
-| "noccurs (STAR r) = (noccurs r)"
 
-lemma
-  "\<not> noccurs r \<Longrightarrow> L r \<noteq> {}"
-apply(induct r)
-apply(auto simp add: Seq_def)
-done
-
-lemma
-  "L r = {} \<Longrightarrow> noccurs r"
-apply(induct r)
-apply(auto simp add: Seq_def)
-done
-
-lemma does_not_hold:
-  "noccurs r \<Longrightarrow> L r = {}"
-apply(induct r)
-apply(auto simp add: Seq_def)
-oops
-
-fun
- der :: "char \<Rightarrow> rexp \<Rightarrow> rexp"
-where
-  "der c (NULL) = NULL"
-| "der c (EMPTY) = NULL"
-| "der c (CHAR c') = (if c = c' then EMPTY else NULL)"
-| "der c (ALT r1 r2) = ALT (der c r1) (der c r2)"
-| "der c (SEQ r1 r2) = ALT (SEQ (der c r1) r2) (if nullable r1 then der c r2 else NULL)"
-| "der c (STAR r) = SEQ (der c r) (STAR r)"
-
-fun 
- ders :: "string \<Rightarrow> rexp \<Rightarrow> rexp"
-where
-  "ders [] r = r"
-| "ders (c # s) r = ders s (der c r)"
-
-fun
-  matcher :: "rexp \<Rightarrow> string \<Rightarrow> bool"
-where
-  "matcher r s = nullable (ders s r)"
-
-
-section {* Correctness Proof of the Matcher *}
+section {* Correctness Proof for Nullable *}
 
 lemma nullable_correctness:
   shows "nullable r  \<longleftrightarrow> [] \<in> (L r)"
-by (induct r) (auto simp add: Seq_def) 
-section {* Left-Quotient of a Set *}
-
-fun
- zeroable :: "rexp \<Rightarrow> bool"
-where
-  "zeroable (NULL) = True"
-| "zeroable (EMPTY) = False"
-| "zeroable (CHAR c) = False"
-| "zeroable (ALT r1 r2) = (zeroable r1 \<and> zeroable r2)"
-| "zeroable (SEQ r1 r2) = (zeroable r1 \<or> zeroable r2)"
-| "zeroable (STAR r) = False"
-
-
-lemma zeroable_correctness:
-  shows "zeroable r  \<longleftrightarrow>  (L r = {})"
 apply(induct r)
-apply(auto simp add: Seq_def)[6]
-done
-
-section {* Left-Quotient of a Set *}
-
-definition
-  Der :: "char \<Rightarrow> string set \<Rightarrow> string set"
-where
-  "Der c A \<equiv> {s. [c] @ s \<in> A}"
-
-lemma Der_null [simp]:
-  shows "Der c {} = {}"
-unfolding Der_def
-by auto
-
-lemma Der_empty [simp]:
-  shows "Der c {[]} = {}"
-unfolding Der_def
-by auto
-
-lemma Der_char [simp]:
-  shows "Der c {[d]} = (if c = d then {[]} else {})"
-unfolding Der_def
-by auto
-
-lemma Der_union [simp]:
-  shows "Der c (A \<union> B) = Der c A \<union> Der c B"
-unfolding Der_def
-by auto
-
-lemma Der_seq [simp]:
-  shows "Der c (A ;; B) = (Der c A) ;; B \<union> (if [] \<in> A then Der c B else {})"
-unfolding Der_def Seq_def
-by (auto simp add: Cons_eq_append_conv)
+(* ZERO case *)
+apply(simp only: nullable.simps)
+apply(simp only: L.simps)
+apply(simp)
+(* ONE case *)
+apply(simp only: nullable.simps)
+apply(simp only: L.simps)
+apply(simp)
+(* CHAR case *)
+apply(simp only: nullable.simps)
+apply(simp only: L.simps)
+apply(simp)
+prefer 2
+(* ALT case *)
+apply(simp (no_asm) only: nullable.simps)
+apply(simp only:)
+apply(simp only: L.simps)
+apply(simp)
+(* SEQ case *)
+oops
 
-lemma Der_star [simp]:
-  shows "Der c (A\<star>) = (Der c A) ;; A\<star>"
-proof -    
-  have "Der c (A\<star>) = Der c ({[]} \<union> A ;; A\<star>)"
-    by (simp only: star_cases[symmetric])
-  also have "... = Der c (A ;; A\<star>)"
-    by (simp only: Der_union Der_empty) (simp)
-  also have "... = (Der c A) ;; A\<star> \<union> (if [] \<in> A then Der c (A\<star>) else {})"
-    by simp
-  also have "... =  (Der c A) ;; A\<star>"
-    unfolding Seq_def Der_def
-    by (auto dest: star_decomp)
-  finally show "Der c (A\<star>) = (Der c A) ;; A\<star>" .
-qed
-
-
-lemma der_correctness:
-  shows "L (der c r) = Der c (L r)"
-by (induct r) 
-   (simp_all add: nullable_correctness)
-
-lemma matcher_correctness:
-  shows "matcher r s \<longleftrightarrow> s \<in> L r"
-by (induct s arbitrary: r)
-   (simp_all add: nullable_correctness der_correctness Der_def)
-
-section {* Examples *}
-
-definition 
-  "CHRA \<equiv> CHAR (CHR ''a'')"
+lemma nullable_correctness:
+  shows "nullable r  \<longleftrightarrow> [] \<in> (L r)"
+apply(induct r)
+apply(simp_all)
+(* all easy subgoals are proved except the last 2 *)
+(* where the definition of Seq needs to be unfolded. *)
+oops
 
-definition 
-  "ALT1 \<equiv> ALT CHRA EMPTY"
-
-definition 
-  "SEQ3 \<equiv> SEQ (SEQ ALT1 ALT1) ALT1"
-
-value "matcher SEQ3 ''aaa''"
+lemma nullable_correctness:
+  shows "nullable r  \<longleftrightarrow> [] \<in> (L r)"
+apply(induct r)
+apply(simp_all add: Seq_def)
+(* except the star case every thing is proved *)
+(* we need to use the rule for Star.start *)
+oops
 
-value "matcher NULL []"
-value "matcher (CHAR (CHR ''a'')) [CHR ''a'']"
-value "matcher (CHAR a) [a,a]"
-value "matcher (STAR (CHAR a)) []"
-value "matcher (STAR (CHAR a))  [a,a]"
-value "matcher (SEQ (CHAR (CHR ''a'')) (SEQ (STAR (CHAR (CHR ''b''))) (CHAR (CHR ''c'')))) ''abbbbc''"
-value "matcher (SEQ (CHAR (CHR ''a'')) (SEQ (STAR (CHAR (CHR ''b''))) (CHAR (CHR ''c'')))) ''abbcbbc''"
-
-section {* Incorrect Matcher - fun-definition rejected *}
-
-fun
-  match :: "rexp list \<Rightarrow> string \<Rightarrow> bool"
-where
-  "match [] [] = True"
-| "match [] (c # s) = False"
-| "match (NULL # rs) s = False"  
-| "match (EMPTY # rs) s = match rs s"
-| "match (CHAR c # rs) [] = False"
-| "match (CHAR c # rs) (d # s) = (if c = d then match rs s else False)"         
-| "match (ALT r1 r2 # rs) s = (match (r1 # rs) s \<or> match (r2 # rs) s)" 
-| "match (SEQ r1 r2 # rs) s = match (r1 # r2 # rs) s"
-| "match (STAR r # rs) s = (match rs s \<or> match (r # (STAR r) # rs) s)"
+lemma nullable_correctness:
+  shows "nullable r  \<longleftrightarrow> [] \<in> (L r)"
+apply(induct r)
+apply(simp_all add: Seq_def Star.start)
+done
 
 
 end    
\ No newline at end of file