--- a/LINKS Sat May 05 10:31:00 2018 +0100
+++ b/LINKS Fri Jun 01 15:28:37 2018 +0100
@@ -1,13 +1,31 @@
-Pictures from teh Starting Forth book
+Compiler courses
+http://www.cse.chalmers.se/edu/year/2017/course/TDA283/lectures/
+http://www.cs.columbia.edu/~sedwards/classes/2017/4115-fall/index.html
+http://www.cs.dartmouth.edu/~mckeeman/cs48/index.html
+===============================
+
+Pictures from the Starting Forth book
https://www.forth.com/starting-forth/2-stack-manipulation-operators-arithmetic/
--------------------------------
+Java Byte Code for SCALA
+https://www.toptal.com/scala/scala-bytecode-and-the-jvm
+------------------
Yeti - ML for the JVM
https://mth.github.io/yeti/
http://dot.planet.ee/yeti/intro.html
+------------------
+Ocaml for LLVM
+https://github.com/artagnon/rhine-ml
+
+---------------
+Type inference
+http://www.calebh.io/Type-Inference-by-Solving-Constraints/
+
+
JVM languages
https://en.wikipedia.org/wiki/List_of_JVM_languages
@@ -67,6 +85,11 @@
https://michaelhaywoodblog.wordpress.com
https://ruslanspivak.com/lsbasi-part1/
http://selfie.cs.uni-salzburg.at
+https://academy.realm.io/posts/tryswift-samuel-giddins-building-tiny-compiler-swift-ios/
+https://legacy.gitbook.com/book/luxlang/the-lux-programming-language/details
+https://github.com/rhysd/gocaml
+https://github.com/aalhour/awesome-compilers#educational-and-toy-projects
+https://rsms.me/hue-intro
automata
@@ -99,4 +122,39 @@
Static code analysis
-https://medium.com/@Coder_HarryLee/videos-about-static-code-analysis-7654b40f9a3b
\ No newline at end of file
+https://medium.com/@Coder_HarryLee/videos-about-static-code-analysis-7654b40f9a3b
+
+
+Brainfuck compiler
+https://github.com/PurpleMyst/bf_compiler
+https://www.reddit.com/r/programming/comments/8371tk/a_brainfuck_to_llvm_ir_compiler_written_in_python/
+http://www.wilfred.me.uk/blog/2015/02/21/my-first-llvm-compiler/
+
+===============================
+A simple recursive regular expression matcher written in Scala.
+https://github.com/marconilanna/RegexMatcher
+
+===============================
+A neat little tool to build presentations using the Scala REPL
+https://github.com/marconilanna/REPLesent
+
+
+======================
+Minimal Static Single Assignment Form
+Max Wagner and Denis Lohner
+
+This formalization is an extension to "Verified Construction of Static Single
+Assignment Form". In their work, the authors have shown that Braun et al.'s
+static single assignment (SSA) construction algorithm produces minimal SSA form
+for input programs with a reducible control flow graph (CFG). However Braun et
+al. also proposed an extension to their algorithm that they claim produces
+minimal SSA form even for irreducible CFGs.
+In this formalization we support that claim by giving a mechanized proof.
+
+As the extension of Braun et al.'s algorithm aims for removing so-called
+redundant strongly connected components of phi functions, we show that this
+suffices to guarantee minimality according to Cytron et al..
+
+https://www.isa-afp.org/entries/Minimal_SSA.shtml
+
+
Binary file handouts/ho01.pdf has changed
--- a/handouts/ho01.tex Sat May 05 10:31:00 2018 +0100
+++ b/handouts/ho01.tex Fri Jun 01 15:28:37 2018 +0100
@@ -36,7 +36,7 @@
%https://www.youtube.com/watch?v=gmhMQfFQu20
\begin{document}
-\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
+\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018}
\section*{Handout 1}
@@ -46,8 +46,8 @@
Knuth-Morris-Pratt algorithm, which is currently the most efficient
general string search algorithm. But often we do \emph{not} just look
for a particular string, but for string patterns. For example, in
-program code we need to identify what are the keywords (if, then,
-while, for, etc), what are the identifiers (variable names). A pattern for
+program code we need to identify what are the keywords (\texttt{if}, \texttt{then},
+\texttt{while}, \texttt{for}, etc), what are the identifiers (variable names). A pattern for
identifiers could be stated as: they start with a letter, followed by
zero or more letters, numbers and underscores. Often we also face the
problem that we are given a string (for example some user input) and
@@ -106,7 +106,7 @@
\noindent according to the regular expression we specified in line
\eqref{email} above. Whether this is intended or not is a different
question (the second email above is actually an acceptable email
-address acording to the RFC 5322 standard for email addresses).
+address according to the RFC 5322 standard for email addresses).
As mentioned above, identifiers, or variables, in program code
are often required to satisfy the constraints that they start
@@ -140,7 +140,7 @@
\pcode{re\{n\}} & matches exactly \pcode{n} number of
occurrences of preceding expression\\
\pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m}
-occurences of the preceding expression\\
+occurrences of the preceding expression\\
\pcode{[...]} & matches any single character inside the
brackets\\
\pcode{[^...]} & matches any single character not inside the
@@ -191,7 +191,7 @@
before (remember the PRA module?). Why on earth then is there any
interest in studying them again in depth in this module? Well, one
answer is in the following two graphs about regular expression
-matching in Python, Ruby and Java.
+matching in Python, Ruby and Java (Version 8).
\begin{center}
\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
@@ -211,7 +211,7 @@
axis lines=left,
width=5.5cm,
height=4.5cm,
- legend entries={Python, Java},
+ legend entries={Python, Java 8},
legend pos=north west,
legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
@@ -273,7 +273,7 @@
\end{center}
\noindent
-A similar problem also occured in the Atom editor:
+A similar problem also occurred in the Atom editor:
\begin{center}
\url{http://davidvgalbraith.com/how-i-fixed-atom/}
@@ -283,14 +283,15 @@
Such troublesome regular expressions are sometimes called \emph{evil
regular expressions} because they have the potential to make regular
expression matching engines to topple over, like in Python, Ruby and
-Java. This ``toppling over'' is also sometimes called \emph{catastrophic
- backtracking}. The problem with evil regular expressions is that
-they can have some serious consequences, for example, if you use them
-in your web-application. The reason is that hackers can look for these
-instances where the matching engine behaves badly and then mount a
-nice DoS-attack against your application. These attacks are already
-have their own name: \emph{Regular Expression Denial of Service
- Attacks (ReDoS)}.
+Java. This ``toppling over'' is also sometimes called
+\emph{catastrophic backtracking}. I have also seen the term
+\emph{eternal matching} used for this. The problem with evil regular
+expressions is that they can have some serious consequences, for
+example, if you use them in your web-application. The reason is that
+hackers can look for these instances where the matching engine behaves
+badly and then mount a nice DoS-attack against your application. These
+attacks are already have their own name: \emph{Regular Expression
+ Denial of Service Attacks (ReDoS)}.
It will be instructive to look behind the ``scenes'' to find
out why Python and Ruby (and others) behave so badly when
@@ -405,23 +406,23 @@
will be helpful. For example we will write $(r_1 + r_2)^*$,
which is different from, say $r_1 + (r_2)^*$. The former means
roughly zero or more times $r_1$ or $r_2$, while the latter
-means $r_1$ or zero or more times $r_2$. This will turn out to
+means $r_1$, or zero or more times $r_2$. This will turn out to
be two different patterns, which match in general different
strings. We should also write $(r_1 + r_2) + r_3$, which is
different from the regular expression $r_1 + (r_2 + r_3)$, but
in case of $+$ and $\cdot$ we actually do not care about the
order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2
-\cdot r_3$, respectively. The reasons for this will become
+\cdot r_3$, respectively. The reasons for this carelessness will become
clear shortly.
In the literature you will often find that the choice $r_1 +
r_2$ is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also,
often our $\ZERO$ and $\ONE$ are written $\varnothing$ and
$\epsilon$, respectively. Following the convention in the
-literature, we will often omit the $\cdot$ all together. This
+literature, we will often omit the $\cdot$. This
is to make some concrete regular expressions more readable.
For example the regular expression for email addresses shown
-in \eqref{email} would look like
+in \eqref{email} would fully expanded look like
\[
\texttt{[...]+} \;\cdot\; \texttt{@} \;\cdot\;
@@ -559,7 +560,8 @@
\]
\noindent That means two regular expressions are said to be
-equivalent if they match the same set of strings. Therefore we
+equivalent if they match the same set of strings. That is
+there meaning is the same. Therefore we
do not really distinguish between the different regular
expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,
because they are equivalent. I leave you to the question
@@ -584,7 +586,7 @@
was really a long time ago.} I got utterly confused about $\ONE$
(which my lecturer wrote as $\epsilon$) and the empty string (which he
also wrote as $\epsilon$).\footnote{Obviously the lecturer must have
- been bad ;o)} Since my then, I have used regular expressions for
+ been bad ;o)} Since then, I have used regular expressions for
implementing lexers and parsers as I have always been interested in
all kinds of programming languages and compilers, which invariably
need regular expressions in some form or shape.
@@ -596,7 +598,7 @@
them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
provers are systems in which you can formally reason about
mathematical concepts, but also about programs. In this way
-theorem provers can help with the manacing problem of writing bug-free code. Theorem provers have
+theorem provers can help with the menacing problem of writing bug-free code. Theorem provers have
proved already their value in a number of cases (even in
terms of hard cash), but they are still clunky and difficult
to use by average programmers.
@@ -607,7 +609,7 @@
expressions, not needing any automata (you will see we only touch
briefly on automata in lecture 3). Automata are usually the main
object of study in formal language courses. The avoidance of automata
-is crucial because automata are graphs and it is rather difficult to
+is crucial for me because automata are graphs and it is rather difficult to
reason about graphs in theorem provers. In contrast, reasoning about
regular expressions is easy-peasy in theorem provers. Is this
important? I think yes, because according to Kuklewicz nearly all
@@ -629,20 +631,21 @@
theorem we can show that regular languages are closed under
complementation, something which Gasarch in his
blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
-shown via automata. Even sombody who has written a 700+-page
+shown via automata. So even somebody who has written a 700+-page
book\footnote{\url{http://goo.gl/fD0eHx}} on regular
-exprssions did not know better. Well, we showed it can also be
+expressions did not know better. Well, we showed it can also be
done with regular expressions only.\footnote{\url{http://nms.kcl.ac.uk/christian.urban/Publications/posix.pdf}}
What a feeling when you are an outsider to the subject!
To conclude: Despite my early ignorance about regular expressions, I
-find them now very interesting. They have a beautiful mathematical
-theory behind them, which can be sometimes quite deep and which
-sometimes contains hidden snares. They have practical importance
+find them now extremely interesting. They have practical importance
(remember the shocking runtime of the regular expression matchers in
Python, Ruby and Java in some instances and the problems in Stack
-Exchange and the Atom editor). People who are not very familiar with
-the mathematical background of regular expressions get them
+Exchange and the Atom editor). They are used in tools like Snort and
+Bro in order to monitor network traffic. They have a beautiful mathematical
+theory behind them, which can be sometimes quite deep and which
+sometimes contains hidden snares. People who are not very familiar
+with the mathematical background of regular expressions get them
consistently wrong (this is surprising given they are a supposed to be
core skill for computer scientists). The hope is that we can do better
in the future---for example by proving that the algorithms actually
@@ -653,7 +656,7 @@
expressions have their shortcomings. There are some well-known
``theoretical'' shortcomings, for example recognising strings of the
form $a^{n}b^{n}$ is not possible with regular expressions. This means
-for example if we try to regognise whether parentheses are well-nested
+for example if we try to recognise whether parentheses are well-nested
in an expression is impossible with (basic) regular expressions. I am
not so bothered by these shortcomings. What I am bothered about is
when regular expressions are in the way of practical programming. For
@@ -702,7 +705,7 @@
\begin{figure}[p]\small
- \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+ \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/crawler1.scala}
\caption{The Scala code for a simple web-crawler that checks
@@ -716,7 +719,7 @@
\begin{figure}[p]\small
- \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+ \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/crawler2.scala}
\caption{A version of the web-crawler that only follows links
@@ -731,7 +734,7 @@
\end{figure}
\begin{figure}[p]\small
- \lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+ \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/crawler3.scala}
\caption{A small email harvester---whenever we download a
Binary file handouts/ho02.pdf has changed
--- a/handouts/ho02.tex Sat May 05 10:31:00 2018 +0100
+++ b/handouts/ho02.tex Fri Jun 01 15:28:37 2018 +0100
@@ -27,8 +27,8 @@
\begin{center}
Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
\begin{tabular}{@{}cc@{}}
-\begin{tikzpicture}
-\begin{axis}[
+\begin{tikzpicture}[baseline=(current bounding box.north)]
+ \begin{axis}[
xlabel={$n$},
x label style={at={(1.05,0.0)}},
ylabel={time in secs},
@@ -41,7 +41,7 @@
axis lines=left,
width=5cm,
height=5cm,
- legend entries={Java, Python},
+ legend entries={Java 8, Python},
legend pos=north west,
legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
@@ -49,7 +49,7 @@
\end{axis}
\end{tikzpicture}
&
-\begin{tikzpicture}
+\begin{tikzpicture}[baseline=(current bounding box.north)]
\begin{axis}[
xlabel={$n$},
x label style={at={(1.1,0.0)}},
@@ -745,9 +745,9 @@
ymax=35,
ytick={0,5,...,30},
axis lines=left,
- %scaled ticks=false,
+ %%scaled ticks=false,
x label style={at={(1.09,0.0)}},
- %xmax=7700000,
+ %%xmax=7700000,
width=9cm,
height=5cm,
legend entries={Scala V3},
@@ -761,7 +761,7 @@
\subsection*{Epilogue}
-(23/Aug/2016) I recently found another place where this algorithm can
+(23/Aug/2016) I found another place where this algorithm can
be sped up (this idea is not integrated with what is coming next, but
I present it nonetheless). The idea is to not define \texttt{ders}
that it iterates the derivative character-by-character, but in bigger
Binary file handouts/notation.pdf has changed
--- a/handouts/notation.tex Sat May 05 10:31:00 2018 +0100
+++ b/handouts/notation.tex Fri Jun 01 15:28:37 2018 +0100
@@ -5,7 +5,7 @@
\begin{document}
-\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
+\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018}
\section*{A Crash-Course on Notation}
@@ -89,8 +89,8 @@
\noindent
However, we do not want to do this kind of explicit coercion in our
pencil-and-paper, everyday arguments. So in our (mathematical)
-definitions we regard strings as lists of characters, we will also
-write \dq{$hello$} as
+definitions we regard strings as lists of characters and we will also
+write \dq{$hello$} as list
\[
[\text{\it h, e, l, l, o}] \qquad\text{or simply}\qquad \textit{hello}
@@ -114,7 +114,7 @@
\dq{\textit{foo}} $@$ \dq{\textit{bar}}, gives
\dq{\textit{foobar}}. But as said above, we will often
simplify our life and just drop the double quotes whenever it
-is clear we are talking about strings, So we will often just
+is clear we are talking about strings. So we will often just
write \textit{foo}, \textit{bar}, \textit{foobar} or
\textit{foo $@$ bar}.
@@ -153,25 +153,29 @@
the set of all natural numbers $\{0, 1, 2, \ldots\}$ is infinite. This
set is often also abbreviated as $\mathbb{N}$. Lists can be very large, but they cannot contain infinitely many elements.
-We can define sets by giving all elements, for example $\{0, 1\}$ for
-the set containing just $0$ and $1$, but also by \defn{set
- comprehensions}. For example the set of all even natural numbers can
-be defined as
+We can define sets by giving all their elements, for example $\{0, 1\}$ for
+the set containing just $0$ and $1$. But often we need to use \defn{set
+ comprehensions} to define sets. For example the set of all \emph{even}
+natural numbers can be defined as
\[
\{n\;|\;n\in\mathbb{N} \wedge n\;\text{is even}\}
\]
-\noindent Though silly, but the set $\{0, 1, 2\}$ could also be
+\noindent Set comprehensions consist of a ``base set'' (in this case
+all the natural numbers) and a predicate (here eveness).
+
+Though silly, but the set $\{0, 1, 2\}$ could also be
defined by the following set comprehension
\[
-\{n\;|\; n^2 < 9 \wedge n \in \mathbb{N}\}
+\{n\;|\; n \in \mathbb{N} \wedge n^2 < 9\}
\]
\noindent Can you see why this defines the set $\{0, 1, 2\}$? Notice
-that set comprehensions could be used to define set union,
-intersection and difference:
+that set comprehensions are quite powerful constructions. For example they
+could be used to define set union,
+set intersection and set difference:
\begin{eqnarray*}
A \cup B & \dn & \{x\;|\; x \in A \vee x \in B\}\\
@@ -208,17 +212,18 @@
\noindent but using the big union notation is more concise.
-As an aside: While this stuff about sets might all look trivial or even needlessly
-pedantic, \emph{Nature} is never simple. If you want to be amazed how
-complicated sets can get, watch out for the last lecture just before
-Christmas where I want to convince you of the fact that some sets are
-more infinite than others. Yes, you read correctly, there can be sets
-that are ``more infinite'' then others. If you think this is obvious:
-say you have the infinite set $\mathbb{N}\backslash\{0\} = \{1, 2, 3, 4, \ldots\}$ which is all
-the natural numbers except $0$, and then compare it to the set
-$\{0, 1, 2, 3, 4, \ldots\}$ which contains the $0$. If you think,
-the second must be more infinite\ldots{} well, then think again. Because the two
-infinite sets
+As an aside: While this stuff about sets might all look trivial or
+even needlessly pedantic, \emph{Nature} is never simple. If you want
+to be amazed how complicated sets can get, watch out for the last
+lecture just before Christmas where I want to convince you of the fact
+that some sets are more infinite than other sets. Yes, you read
+correctly, there can be sets that are ``more infinite'' than
+others. If you think this is obvious: say you have the infinite set
+$\mathbb{N}\backslash\{0\} = \{1, 2, 3, 4, \ldots\}$ which is all the
+natural numbers except $0$, and then compare it to the set
+$\{0, 1, 2, 3, 4, \ldots\}$ which contains the $0$ and all other
+numbers. If you think, the second must be more infinite\ldots{} well,
+then think again. Because the two infinite sets
\begin{center}
$\{1, 2, 3, 4, \ldots\}$ and
@@ -226,12 +231,12 @@
\end{center}
\noindent
-contain actually the same number of elements. Does this make sense?
-Though this might all look strange this about infinite sets will be a
+contain actually the same amount of elements. Does this make sense?
+Though this might all look strange, infinite sets will be a
topic that is very relevant to the material of this module. It tells
us what we can compute with a computer (actually algorithm) and what
we cannot. But during the first 9 lectures we can go by without this
-``weird'' stuff.
+``weird'' stuff. End of aside.\smallskip
Another important notion in this module are \defn{languages}, which
are sets of strings. One of the main goals for us will be how to
@@ -302,7 +307,7 @@
\end{equation}
\noindent This star operation is often also called
-\emph{Kleene-star}. Unfolding the definition in \eqref{star}
+\emph{Kleene-star}. Unfolding the definition in~\eqref{star}
gives
\[
Binary file hws/hw01.pdf has changed
--- a/hws/hw01.tex Sat May 05 10:31:00 2018 +0100
+++ b/hws/hw01.tex Fri Jun 01 15:28:37 2018 +0100
@@ -22,17 +22,19 @@
\item {\bf (Optional)} Have a look at the crawler programs.
Can you find a usage for them in your daily programming
- life? Can you improve them? (For example in cases there
+ life? Can you improve them? For example in cases there
are links that appear on different recursion levels, the
crawlers visit such web-pages several times. Can this be
- avoided?)
+ avoided? Also, the crawlers flag as problematic any page
+ that gives an error, but probably only 404 Not Found
+ errors should be flagged. Can you change that?)
\item Read the handout of the first lecture and the handout
about notation. Make sure you understand the concepts of
strings and languages. In the context of the CFL-course,
what is meant by the term \emph{language}?
- \item Give the definition for regular expressions---this is an
+\item Give the definition for regular expressions---this is an
inductive datatype. What is the
meaning of a regular expression? (Hint: The meaning is
defined recursively.)
--- a/progs/crawler1.scala Sat May 05 10:31:00 2018 +0100
+++ b/progs/crawler1.scala Fri Jun 01 15:28:37 2018 +0100
@@ -20,7 +20,6 @@
def get_all_URLs(page: String) : Set[String] =
http_pattern.findAllIn(page).map(unquote).toSet /*@\label{findallline}@*/
-
// naive version of crawl - searches until a given depth,
// visits pages potentially more than once
def crawl(url: String, n: Int) : Unit = {
@@ -32,9 +31,8 @@
}
// some starting URLs for the crawler
-val startURL = """http://www.inf.kcl.ac.uk/staff/urbanc"""
-//val startURL = """http://www.inf.kcl.ac.uk/staff/mcburney"""
+val startURL = """https://nms.kcl.ac.uk/christian.urban/"""
+//val startURL = """https://nms.kcl.ac.uk/luc.moreau/"""
+crawl(startURL, 3)
-crawl(startURL, 2)
-
--- a/progs/crawler2.scala Sat May 05 10:31:00 2018 +0100
+++ b/progs/crawler2.scala Fri Jun 01 15:28:37 2018 +0100
@@ -13,7 +13,8 @@
// regexes for URLs and "my" domain
val http_pattern = """"https?://[^"]*"""".r
-val my_urls = """urbanc""".r /*@\label{myurlline}@*/
+val my_urls = """urban""".r /*@\label{myurlline}@*/
+//val my_urls = """kcl.ac.uk""".r
def unquote(s: String) = s.drop(1).dropRight(1)
@@ -28,15 +29,16 @@
} /*@\label{changeendline}@*/
else {
println(s"Visiting: $n $url")
- for (u <- get_all_URLs(get_page(url))) crawl(u, n - 1)
+ for (u <- get_all_URLs(get_page(url)).par) crawl(u, n - 1)
}
}
// starting URL for the crawler
-val startURL = """http://www.inf.kcl.ac.uk/staff/urbanc"""
-val startURL = """http://www.inf.kcl.ac.uk/staff/urbanc/bsc-projects-16.html"""
+val startURL = """https://nms.kcl.ac.uk/christian.urban/"""
+//val startURL = """https://nms.kcl.ac.uk/christian.urban/bsc-projects-17.html"""
+
// can now deal with depth 3 and beyond
-crawl(startURL, 2)
+crawl(startURL, 3)
--- a/progs/crawler3.scala Sat May 05 10:31:00 2018 +0100
+++ b/progs/crawler3.scala Fri Jun 01 15:28:37 2018 +0100
@@ -25,7 +25,7 @@
def crawl(url: String, n: Int) : Unit = {
if (n == 0) ()
else {
- println(s"Visiting: $n $url")
+ println(s" Visiting: $n $url")
val page = get_page(url)
print_str(email_pattern.findAllIn(page).mkString("\n")) /*@\label{mainline}@*/
for (u <- get_all_URLs(page).par) crawl(u, n - 1)
@@ -33,6 +33,7 @@
}
// staring URL for the crawler
-val startURL = """http://www.inf.kcl.ac.uk/staff/urbanc"""
+val startURL = """https://nms.kcl.ac.uk/christian.urban/"""
+
crawl(startURL, 3)
--- a/progs/cw1.scala Sat May 05 10:31:00 2018 +0100
+++ b/progs/cw1.scala Fri Jun 01 15:28:37 2018 +0100
@@ -232,17 +232,19 @@
TEST("/f&", COMMENT3)
TEST("/f& ", COMMENT3)
-
-
//test: ("a" | "aa") ~ ("a" | "aa")*
-for (i <- 1 to 100 by 1) {
- println(i + " " + "%.5f".format(time_needed(2, matcher(EVIL3, "a" * i))) + " size: " + size(ders(("a" * i).toList, EVIL3)))
-}
-
val auxEVIL3 = ALT(CHAR('a'), SEQ(CHAR('a'), CHAR('a')))
val EVIL3 = SEQ(auxEVIL3, STAR(auxEVIL3))
val EVIL3p = FROMNTIMES(auxEVIL3, 1)
+
+for (i <- 1 to 100 by 1) {
+ println(i + " " + "%.5f".format(time_needed(2, matcher(EVIL3, "a" * i))) + " size: " +
+ size(ders(("a" * i).toList, EVIL3)))
+}
+
+
+
val t1 = EVIL3
val t2 = simp(der('a', t1))
val t3 = simp(der('a', t2))
--- a/progs/re1.scala Sat May 05 10:31:00 2018 +0100
+++ b/progs/re1.scala Fri Jun 01 15:28:37 2018 +0100
@@ -50,10 +50,10 @@
der('b', r)
der('c', r)
-val r = SEQ(SEQ(CHAR('x'), CHAR('y')), CHAR('z'))
-der('x', r)
-der('y', der('x', r))
-der('z', der('y', der('x', r)))
+val r2 = SEQ(SEQ(CHAR('x'), CHAR('y')), CHAR('z'))
+der('x', r2)
+der('y', der('x', r2))
+der('z', der('y', der('x', r2)))
//optional regular expression (one or zero times)
--- a/progs/re3.scala Sat May 05 10:31:00 2018 +0100
+++ b/progs/re3.scala Fri Jun 01 15:28:37 2018 +0100
@@ -13,6 +13,32 @@
case class NTIMES(r: Rexp, n: Int) extends Rexp
+// string of a regular expressions - for testing purposes
+def string(r: Rexp) : String = r match {
+ case ZERO => "0"
+ case ONE => "1"
+ case CHAR(c) => c.toString
+ case ALT(r1, r2) => s"(${string(r1)} + ${string(r2)})"
+ case SEQ(CHAR(c), CHAR(d)) => s"${c}${d}"
+ case SEQ(ONE, CHAR(c)) => s"1${c}"
+ case SEQ(r1, r2) => s"(${string(r1)} ~ ${string(r2)})"
+ case STAR(r) => s"(${string(r)})*"
+ case NTIMES(r, n) => s"(${string(r)}){${n}}"
+}
+
+// size of a regular expressions - for testing purposes
+def size(r: Rexp) : Int = r match {
+ case ZERO => 1
+ case ONE => 1
+ case CHAR(_) => 1
+ case ALT(r1, r2) => 1 + size(r1) + size(r2)
+ case SEQ(r1, r2) => 1 + size(r1) + size(r2)
+ case STAR(r) => 1 + size(r)
+ case NTIMES(r, _) => 1 + size(r)
+}
+
+
+
// nullable function: tests whether the regular
// expression can recognise the empty string
def nullable (r: Rexp) : Boolean = r match {
@@ -39,27 +65,38 @@
if (i == 0) ZERO else SEQ(der(c, r), NTIMES(r, i - 1))
}
-def simp(r: Rexp) : Rexp = r match {
- case ALT(r1, r2) => (simp(r1), simp(r2)) match {
- case (ZERO, r2s) => r2s
- case (r1s, ZERO) => r1s
- case (r1s, r2s) => if (r1s == r2s) r1s else ALT (r1s, r2s)
+
+
+def simp(r: Rexp, seen: Set[Rexp]) : (Rexp, Set[Rexp]) = r match {
+ case ALT(r1, r2) => {
+ val (r1s, seen1) = simp(r1, seen)
+ val (r2s, seen2) = simp(r2, seen1)
+ (r1s, r2s) match {
+ case (ZERO, r2s) => (r2s, seen2)
+ case (r1s, ZERO) => (r1s, seen2)
+ case (r1s, r2s) => (ALT(r1s, r2s), seen2)
+ }
}
- case SEQ(r1, r2) => (simp(r1), simp(r2)) match {
- case (ZERO, _) => ZERO
- case (_, ZERO) => ZERO
- case (ONE, r2s) => r2s
- case (r1s, ONE) => r1s
- case (r1s, r2s) => SEQ(r1s, r2s)
+ case SEQ(r1, r2) => {
+ val (r1s, _) = simp(r1, Set())
+ val (r2s, _) = simp(r2, Set())
+ if (seen.contains(SEQ(r1s, r2s))) (ZERO, seen)
+ else (r1s, r2s) match {
+ case (ZERO, _) => (ZERO, seen)
+ case (_, ZERO) => (ZERO, seen)
+ case (ONE, r2s) => (r2s, seen + r2s)
+ case (r1s, ONE) => (r1s, seen + r1s)
+ case (r1s, r2s) => (SEQ(r1s, r2s), seen + SEQ(r1s, r2s))
+ }
}
- case r => r
+ case r => if (seen.contains(r)) (ZERO, seen) else (r, seen + r)
}
// derivative w.r.t. a string (iterates der)
def ders (s: List[Char], r: Rexp) : Rexp = s match {
case Nil => r
- case c::s => ders(s, simp(der(c, r)))
+ case c::s => ders(s, simp(der(c, r), Set())._1)
}
@@ -71,6 +108,75 @@
def OPT(r: Rexp) = ALT(r, ONE)
+
+
+
+
+
+
+
+def time_needed[T](i: Int, code: => T) = {
+ val start = System.nanoTime()
+ for (j <- 1 to i) code
+ val end = System.nanoTime()
+ (end - start)/(i * 1.0e9)
+}
+
+
+// star example
+val Tstar = STAR(STAR(STAR(CHAR('a'))))
+
+string(ders("".toList, Tstar))
+size(ders("".toList, Tstar)) // 4
+string(ders("a".toList, Tstar))
+size(ders("a".toList, Tstar)) // 11
+string(ders("aa".toList, Tstar))
+size(ders("aa".toList, Tstar)) // 11
+string(ders("aaa".toList, Tstar))
+size(ders("aaa".toList, Tstar)) // 11
+string(ders("aaaa".toList, Tstar))
+size(ders("aaaa".toList, Tstar)) // 11
+string(ders("aaaa".toList, Tstar))
+size(ders("aaaaa".toList, Tstar)) // 11
+string(ders("aaaab".toList, Tstar))
+size(ders("aaaaab".toList, Tstar)) // 1
+
+// test: ("a" | "aa")*
+val EVIL3 = STAR(ALT(CHAR('a'), SEQ(CHAR('a'), CHAR('a'))))
+
+for (i <- 1 to 100 by 1) {
+ println(i + " " + "%.5f".format(time_needed(2, matcher(EVIL3, "a" * i))) +
+ " size: " + size(ders(("a" * i).toList, EVIL3)))
+}
+
+
+println("start " + string(EVIL3) + " " + size(EVIL3))
+val t1 = der('a', EVIL3)
+println(string(t1) + " " + size(t1))
+val t1s = simp(t1, Set())._1
+println("simplified " + string(t1s) + " " + size(t1s))
+val t2 = der('a', t1s)
+println(string(t2) + " " + size(t2))
+val t2s = simp(t2, Set())._1
+println("simplified " + string(t2s) + " " + size(t2s))
+val t3 = der('a', t2s)
+println(string(t3) + " " + size(t3))
+val t3s = simp(t3, Set())._1
+println("simplified " + string(t3s) + " " + size(t3s))
+val t4 = der('a', t3s)
+val t4s = simp(t4, Set())._1
+
+
+
+
+
+
+
+
+println(string(t4) + " " + size(t4))
+println("simplified " + string(t4s) + " " + size(t4s))
+
+
// Test Cases
//evil regular expressions: (a?){n} a{n} and (a*)* b
@@ -143,3 +249,8 @@
for (i <- 1 to 4501 by 500) {
println(i + ": " + "%.5f".format(time_needed(1, matcher(sulzmann, "ab" * i))))
}
+
+
+
+
+(((1 + 1a) ~ ((a + aa))*) + (((0 + 1) ~ ((a + aa))*) + ((1 + 1a) ~ ((a + aa))*)))
--- a/progs/re4.scala Sat May 05 10:31:00 2018 +0100
+++ b/progs/re4.scala Fri Jun 01 15:28:37 2018 +0100
@@ -112,3 +112,50 @@
+// size of a regular expressions - for testing purposes
+def size(r: Rexp) : Int = r match {
+ case ZERO => 1
+ case ONE => 1
+ case CHAR(_) => 1
+ case ALT(r1, r2) => 1 + size(r1) + size(r2)
+ case SEQ(r1, r2) => 1 + size(r1) + size(r2)
+ case STAR(r) => 1 + size(r)
+ case NTIMES(r, _) => 1 + size(r)
+}
+
+
+// string of a regular expressions - for testing purposes
+def string(r: Rexp) : String = r match {
+ case ZERO => "0"
+ case ONE => "1"
+ case CHAR(c) => c.toString
+ case ALT(r1, r2) => s"(${string(r1)} + ${string(r2)})"
+ case SEQ(CHAR(c), CHAR(d)) => s"${c}${d}"
+ case SEQ(r1, r2) => s"(${string(r1)} ~ ${string(r2)})"
+ case STAR(r) => s"(${string(r)})*"
+ case NTIMES(r, n) => s"(${string(r)}){${n}}"
+}
+
+
+// test: ("a" | "aa")*
+val EVIL3 = STAR(ALT(CHAR('a'), SEQ(CHAR('a'), CHAR('a'))))
+
+//test: ("" | "a" | "aa")*
+val EVIL3 = STAR(ALT(ONE, ALT(CHAR('a'), SEQ(CHAR('a'), CHAR('a')))))
+
+val t1 = ders2("a".toList, EVIL3)
+val t1s = simp(t1)
+val t2 = ders2("aa".toList, t1s)
+val t2s = simp(t2)
+val t3 = ders2("aaa".toList, t2s)
+val t3s = simp(t3)
+val t4 = ders2("aaaa".toList, t3s)
+val t4s = simp(t4)
+println(string(t1) + " " + size(t1))
+println("s " + string(t1s) + " " + size(t1s))
+println(string(t2) + " " + size(t2))
+println("s " + string(t2s) + " " + size(t2s))
+println(string(t3) + " " + size(t3))
+println("s " + string(t3s) + " " + size(t3s))
+println(string(t4) + " " + size(t4))
+println("s " + string(t4s) + " " + size(t4s))
--- a/progs/token.scala Sat May 05 10:31:00 2018 +0100
+++ b/progs/token.scala Fri Jun 01 15:28:37 2018 +0100
@@ -105,6 +105,15 @@
case c::s => ders(s, der(c, r))
}
+val test : Rexp= STAR("a" | "aa")
+size(test)
+size(der('a', test))
+size(der('a', der('a', test)))
+
+size(ders("aaaaaa".toList, test))
+string(ders("aaaaaa".toList, test))
+
+
// extracts a string from value
def flatten(v: Val) : String = v match {
case Empty => ""
Binary file slides/slides05.pdf has changed
--- a/slides/slides05.tex Sat May 05 10:31:00 2018 +0100
+++ b/slides/slides05.tex Fri Jun 01 15:28:37 2018 +0100
@@ -478,6 +478,25 @@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{frame}[c]
+\frametitle{What Parsing is Not}
+
+Usually parsing does not check semantic correctness, e.g.
+
+\begin{itemize}
+\item whether a function is not used before it
+ is defined
+\item whether a function has the correct number of arguments
+ or are of correct type
+\item whether a variable can be declared twice in a scope
+\end{itemize}
+
+\end{frame}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\begin{frame}[c]
\frametitle{Regular Languages}
While regular expressions are very useful for lexing, there is