handouts/ho01.tex
changeset 924 6ad0f63e1968
parent 905 15973df32613
child 925 ddb521b57e0c
equal deleted inserted replaced
923:437e4f8d35d8 924:6ad0f63e1968
    40 % 10 MOST(LY DEAD) INFLUENTIAL PROGRAMMING LANGUAGES
    40 % 10 MOST(LY DEAD) INFLUENTIAL PROGRAMMING LANGUAGES
    41 % https://www.hillelwayne.com/post/influential-dead-languages/
    41 % https://www.hillelwayne.com/post/influential-dead-languages/
    42 
    42 
    43 
    43 
    44 \begin{document}
    44 \begin{document}
    45 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021}
    45 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2023}
    46 
    46 
    47 \section*{Handout 1}
    47 \section*{Handout 1}
    48 
    48 
    49 The purpose of a compiler is to transform a program a human can read and
    49 The purpose of a compiler is to transform a program a human can read and
    50 write into code machines can run as fast as possible. Developing a
    50 write into code machines can run as fast as possible. Developing a
    51 compiler is an old craft going back to 1952 with the first compiler
    51 compiler is an old craft going back to 1952 with the first compiler
    52 written by Grace Hopper.\footnote{Who many years ago was invited on a
    52 written by Grace Hopper.\footnote{Who many years ago was invited on a
    53 talk show hosted by David Letterman.
    53 talk show hosted by David Letterman.
    54 \here{https://youtu.be/3N_ywhx6_K0?t=31}} Why studying compilers
    54 \here{https://youtu.be/3N_ywhx6_K0?t=31}} Why studying compilers
    55 nowadays?  An interesting answer is given by John Regher in his compiler
    55 nowadays?  An interesting answer is given by John Regehr in his compiler
    56 blog:\here{http://blog.regehr.org/archives/1419}
    56 blog:\here{http://blog.regehr.org/archives/1419}
    57 
    57 
    58 \begin{quote}\it{}``We can start off with a couple of observations
    58 \begin{quote}\it{}``We can start off with a couple of observations
    59   about the role of compilers. First, hardware is getting weirder
    59   about the role of compilers. First, hardware is getting weirder
    60   rather than getting clocked faster: almost all processors are
    60   rather than getting clocked faster: almost all processors are
   203 \pcode{(re)}	& groups regular expressions and remembers 
   203 \pcode{(re)}	& groups regular expressions and remembers 
   204 matched text
   204 matched text
   205 \end{tabular}
   205 \end{tabular}
   206 \end{center}
   206 \end{center}
   207 
   207 
   208 \noindent With this table you can figure out the purpose of the regular
   208 The syntax is pretty universal and can be found in many regular
   209 expressions in the web-crawlers shown Figures \ref{crawler1} and
   209 expression libraries. If you need a quick recap about regular
   210 \ref{crawler3}. In Figure~\ref{crawler1}, however, be careful with
   210 expressions and how the match strings, here is a quick video:
   211 the regular expression for http-addresses in Line~\ref{httpline}. It is
   211 \url{https://youtu.be/bgBWp9EIlMM}.
   212 intended to be
       
   213 
       
   214 \[
       
   215 \pcode{"https?://[^"]*"}
       
   216 \]
       
   217 
       
   218 \noindent specifying that http-addresses need to start with a double
       
   219 quote, then comes \texttt{http} followed by an optional \texttt{s} and
       
   220 so on\ldots{}until the closing double quote comes at the end of the
       
   221 address. Normally we would have to escape the double quotes in order to
       
   222 make sure we interpret the double quote as character, not as double
       
   223 quote for a string. But Scala's trick with triple quotes allows us to
       
   224 omit this kind of ugly escaping. As a result we can just write:
       
   225 
       
   226 \[
       
   227 \pcode{""""https?://[^"]*"""".r}
       
   228 \]
       
   229 
       
   230 \noindent The convention in Scala is that \texttt{.r} converts a string
       
   231 into a regular expression. I leave it to you to ponder whether this
       
   232 regular expression really captures all possible web-addresses. If you
       
   233 need a quick recap about regular expressions and how the match strings,
       
   234 here is a quick video: \url{https://youtu.be/bgBWp9EIlMM}.
       
   235 
   212 
   236 \subsection*{Why Study Regular Expressions?}
   213 \subsection*{Why Study Regular Expressions?}
   237 
   214 
   238 Regular expressions were introduced by Kleene in the 1950ies and they
   215 Regular expressions were introduced by Kleene in the 1950ies and they
   239 have been object of intense study since then. They are nowadays pretty
   216 have been object of intense study since then. They are nowadays pretty
   241 implementing regular expressions. I am sure you have come across them
   218 implementing regular expressions. I am sure you have come across them
   242 before (remember the PRA or PEP modules?). 
   219 before (remember the PRA or PEP modules?). 
   243 
   220 
   244 Why on earth then is there any interest in studying them again in depth
   221 Why on earth then is there any interest in studying them again in depth
   245 in this module? Well, one answer is in the following two graphs about
   222 in this module? Well, one answer is in the following two graphs about
   246 regular expression matching in Python, Ruby, JavaScript and Java
   223 regular expression matching in Python, Ruby, JavaScript, Swift and Java
   247 (Version 8).
   224 (Version 8).
   248 
   225 
   249 \begin{center}
   226 \begin{center}
   250 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1.5mm}}c@{}}
   227 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1.5mm}}c@{}}
   251 \begin{tikzpicture}
   228 \begin{tikzpicture}
   262     ytick={0,5,...,30},
   239     ytick={0,5,...,30},
   263     scaled ticks=false,
   240     scaled ticks=false,
   264     axis lines=left,
   241     axis lines=left,
   265     width=5.5cm,
   242     width=5.5cm,
   266     height=4.5cm, 
   243     height=4.5cm, 
   267     legend entries={Python, Java 8, JavaScript},  
   244     legend entries={Python, Java 8, JavaScript, Swift},  
   268     legend pos=north west,
   245     legend pos=north west,
   269     legend cell align=left]
   246     legend cell align=left]
   270 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
   247 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
   271 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
   248 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
   272 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
   249 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
       
   250 \addplot[magenta,mark=*, mark options={fill=white}] table {re-swift.data};
   273 \end{axis}
   251 \end{axis}
   274 \end{tikzpicture}
   252 \end{tikzpicture}
   275 &
   253 &
   276 \begin{tikzpicture}
   254 \begin{tikzpicture}
   277 \begin{axis}[
   255 \begin{axis}[
   297 \end{axis}
   275 \end{axis}
   298 \end{tikzpicture}
   276 \end{tikzpicture}
   299 \end{tabular}
   277 \end{tabular}
   300 \end{center}
   278 \end{center}
   301 
   279 
   302 \noindent The first graph shows that Python, JavaScript and Java 8 need
   280 \noindent The first graph shows that Python, JavaScript, Swift and Java 8 need
   303 approximately 30 seconds to find out that the regular expression
   281 approximately 30 seconds to find out that the regular expression
   304 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly,
   282 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly,
   305 the second shows that Python and Ruby need approximately 29 seconds for finding
   283 the second shows that Python and Ruby need approximately 29 seconds for finding
   306 out whether a string of 28 \texttt{a}s matches the regular expression
   284 out whether a string of 28 \texttt{a}s matches the regular expression
   307 \texttt{a?\{28\}\,a\{28\}}.\footnote{In this
   285 \texttt{a?\{28\}\,a\{28\}}.\footnote{In this
   336 \noindent
   314 \noindent
   337 and also when somebody tried to match web-addresses using a 
   315 and also when somebody tried to match web-addresses using a 
   338 relatively simple regular expression
   316 relatively simple regular expression
   339 
   317 
   340 \begin{center}
   318 \begin{center}
   341 \url{https://www.tutorialdocs.com/article/regex-trap.html}
   319 \url{https://archive.ph/W5Ogx#selection-141.1-141.36}
   342 \end{center}  
   320 \end{center}  
   343 
   321 
   344 \noindent
   322 \noindent
   345 Finally, on 2 July 2019 Cloudflare had a global outage because of a 
   323 Finally, on 2 July 2019 Cloudflare had a global outage because of a 
   346 regular expression (they had no outage for the 6 years before).  What
   324 regular expression (they had no outage for the 6 years before).  What
   694   regular expressions}. This notion allows one to do almost all
   672   regular expressions}. This notion allows one to do almost all
   695 calculations with regular expressions on the level of regular
   673 calculations with regular expressions on the level of regular
   696 expressions, not needing any automata (you will see we only touch
   674 expressions, not needing any automata (you will see we only touch
   697 briefly on automata in lecture 3). Automata are usually the main
   675 briefly on automata in lecture 3). Automata are usually the main
   698 object of study in formal language courses.  The avoidance of automata
   676 object of study in formal language courses.  The avoidance of automata
   699 is crucial for me because automata are graphs and it is rather difficult to
   677 is crucial for me because automata are graphs and it is rather
   700 reason about graphs in theorem provers. In contrast, reasoning about
   678 difficult to reason about graphs in theorem provers. In contrast,
   701 regular expressions is easy-peasy in theorem provers. Is this
   679 reasoning about regular expressions is easy-peasy in theorem
   702 important? I think yes, because according to Kuklewicz nearly all
   680 provers. Is this important? I think yes, because according to
   703 POSIX-based regular expression matchers are
   681 Kuklewicz nearly all POSIX-based regular expression matchers are
   704 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
   682 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
   705 With my PhD student Fahad Ausaf I proved the correctness for one such
   683 With my PhD students Fahad Ausaf and Chengsong Tan, I proved the
   706 matcher that was proposed by Sulzmann and Lu in
   684 correctness for two such matchers that were proposed by Sulzmann and Lu
   707 2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can prove that
   685 in 2014.\footnote{\url{http://goo.gl/bz0eHp}} A variant of which you have
   708 the machine code(!)  that implements this code efficiently is correct
   686 already seen in PEP as CW3 and you will see again in the CFL in the first
   709 also. Writing programs in this way does not leave any room for
   687 two CWs. What we have not yet figured out that our matchers are
   710 potential errors or bugs. How nice!
   688 universally fast, meaning they do not explode on any input.
       
   689 Hopefully we can also prove
       
   690 that the machine code(!)  that implements our matchers efficiently is
       
   691 correct also. Writing programs in this way does not leave any room for
       
   692 any errors or bugs. How nice!
   711 
   693 
   712 What also helped with my fascination with regular expressions
   694 What also helped with my fascination with regular expressions
   713 is that we could indeed find out new things about them that
   695 is that we could indeed find out new things about them that
   714 have surprised some experts. Together with two colleagues from China, I was
   696 have surprised some experts. Together with two colleagues from China, I was
   715 able to prove the Myhill-Nerode theorem by only using regular
   697 able to prove the Myhill-Nerode theorem by only using regular
   719 complementation, something which Bill Gasarch in his Computational Complexity
   701 complementation, something which Bill Gasarch in his Computational Complexity
   720 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
   702 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
   721 shown via automata. So even somebody who has written a 700+-page
   703 shown via automata. So even somebody who has written a 700+-page
   722 book\footnote{\url{http://goo.gl/fD0eHx}} on regular
   704 book\footnote{\url{http://goo.gl/fD0eHx}} on regular
   723 expressions did not know better. Well, we showed it can also be
   705 expressions did not know better. Well, we showed it can also be
   724 done with regular expressions only.\footnote{\url{http://nms.kcl.ac.uk/christian.urban/Publications/posix.pdf}}
   706 done with regular expressions only.\footnote{\url{https://nms.kcl.ac.uk/christian.urban/Publications/rexp.pdf}}
   725 What a feeling when you are an outsider to the subject!
   707 What a feeling when you are an outsider to the subject!
   726 
   708 
   727 To conclude: Despite my early ignorance about regular expressions, I
   709 To conclude: Despite my early ignorance about regular expressions, I
   728 find them now extremely interesting. They have practical importance
   710 find them now extremely interesting. They have practical importance
   729 (remember the shocking runtime of the regular expression matchers in
   711 (remember the shocking runtime of the regular expression matchers in
   730 Python, Ruby and Java in some instances and the problems in Stack
   712 Python, Ruby, Swift and Java in some instances and the problems in Stack
   731 Exchange and the Atom editor). They are used in tools like Snort and
   713 Exchange and the Atom editor---even regex libraries in more modern programming languages, like Rust, have their problems). They are used in tools like Snort and
   732 Zeek in order to monitor network traffic. They have a beautiful mathematical
   714 Zeek in order to monitor network traffic. They have a beautiful mathematical
   733 theory behind them, which can be sometimes quite deep and which
   715 theory behind them, which can be sometimes quite deep and which
   734 sometimes contains hidden snares.  People who are not very familiar
   716 sometimes contains hidden snares.  People who are not very familiar
   735 with the mathematical background of regular expressions get them
   717 with the mathematical background of regular expressions get them
   736 consistently wrong (this is surprising given they are a supposed to be
   718 consistently wrong (this is surprising given they are a supposed to be
   789   best solution is to use regular expressions; now you have two
   771   best solution is to use regular expressions; now you have two
   790   problems.''
   772   problems.''
   791 \end{quote}  
   773 \end{quote}  
   792 
   774 
   793 
   775 
   794 \begin{figure}[p]\small
   776 %\begin{figure}[p]\small
   795   \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
   777 %  \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
   796                   {../progs/crawler1.scala}
   778 %                  {../progs/crawler1.scala}
   797 
   779 %
   798 \caption{The Scala code for a simple web-crawler that checks
   780 %\caption{The Scala code for a simple web-crawler that checks
   799 for broken links in web-pages. It uses the regular expression
   781 %for broken links in web-pages. It uses the regular expression
   800 \texttt{http\_pattern} in Line~\ref{httpline} for recognising
   782 %\texttt{http\_pattern} in Line~\ref{httpline} for recognising
   801 URL-addresses. It finds all links using the library function
   783 %URL-addresses. It finds all links using the library function
   802 \texttt{findAllIn} in Line~\ref{findallline} (this function 
   784 %\texttt{findAllIn} in Line~\ref{findallline} (this function 
   803 is part of Scala's regular expression library).\label{crawler1}}
   785 %is part of Scala's regular expression library).\label{crawler1}}
   804 
   786 %
   805 \end{figure}
   787 %\end{figure}
   806 
   788 
   807  
   789  
   808 
   790 
   809 %\begin{figure}[p]\small
   791 %\begin{figure}[p]\small
   810 %  \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
   792 %  \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
   818 %Lines~\ref{changestartline}--\ref{changeendline} where there
   800 %Lines~\ref{changestartline}--\ref{changeendline} where there
   819 %is a test whether URL is in ``my'' domain or
   801 %is a test whether URL is in ``my'' domain or
   820 %not.\label{crawler2}}
   802 %not.\label{crawler2}}
   821 %\end{figure}
   803 %\end{figure}
   822 
   804 
   823 \begin{figure}[p]\small
   805 %\begin{figure}[p]\small
   824   \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
   806 %  \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\co%lor{capri!3}\fi}]
   825                   {../progs/crawler2.scala}
   807 %                  {../progs/crawler2.scala}
   826 
   808 %
   827 \caption{A small email harvester---whenever we download a
   809 %\caption{A small email harvester---whenever we download a
   828 web-page, we also check whether it contains any email
   810 %web-page, we also check whether it contains any email
   829 addresses. For this we use the regular expression
   811 %addresses. For this we use the regular expression
   830 \texttt{email\_pattern} in Line~\ref{emailline}. The main
   812 %\texttt{email\_pattern} in Line~\ref{emailline}. The main
   831 change is in Line~\ref{mainline} where all email addresses
   813 %change is in Line~\ref{mainline} where all email addresses
   832 that can be found in a page are printed.\label{crawler3}}
   814 %that can be found in a page are printed.\label{crawler3}}
   833 
   815 %
   834 \end{figure}
   816 %\end{figure}
   835 
   817 
   836 \begin{figure}[p]
   818 \begin{figure}[p]
   837 \tiny
   819 \tiny
   838 \begin{center}
   820 \begin{center}
   839 \begin{minipage}{0.8\textwidth}
   821 \begin{minipage}{0.8\textwidth}