40 % 10 MOST(LY DEAD) INFLUENTIAL PROGRAMMING LANGUAGES |
40 % 10 MOST(LY DEAD) INFLUENTIAL PROGRAMMING LANGUAGES |
41 % https://www.hillelwayne.com/post/influential-dead-languages/ |
41 % https://www.hillelwayne.com/post/influential-dead-languages/ |
42 |
42 |
43 |
43 |
44 \begin{document} |
44 \begin{document} |
45 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021} |
45 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2023} |
46 |
46 |
47 \section*{Handout 1} |
47 \section*{Handout 1} |
48 |
48 |
49 The purpose of a compiler is to transform a program a human can read and |
49 The purpose of a compiler is to transform a program a human can read and |
50 write into code machines can run as fast as possible. Developing a |
50 write into code machines can run as fast as possible. Developing a |
51 compiler is an old craft going back to 1952 with the first compiler |
51 compiler is an old craft going back to 1952 with the first compiler |
52 written by Grace Hopper.\footnote{Who many years ago was invited on a |
52 written by Grace Hopper.\footnote{Who many years ago was invited on a |
53 talk show hosted by David Letterman. |
53 talk show hosted by David Letterman. |
54 \here{https://youtu.be/3N_ywhx6_K0?t=31}} Why studying compilers |
54 \here{https://youtu.be/3N_ywhx6_K0?t=31}} Why studying compilers |
55 nowadays? An interesting answer is given by John Regher in his compiler |
55 nowadays? An interesting answer is given by John Regehr in his compiler |
56 blog:\here{http://blog.regehr.org/archives/1419} |
56 blog:\here{http://blog.regehr.org/archives/1419} |
57 |
57 |
58 \begin{quote}\it{}``We can start off with a couple of observations |
58 \begin{quote}\it{}``We can start off with a couple of observations |
59 about the role of compilers. First, hardware is getting weirder |
59 about the role of compilers. First, hardware is getting weirder |
60 rather than getting clocked faster: almost all processors are |
60 rather than getting clocked faster: almost all processors are |
203 \pcode{(re)} & groups regular expressions and remembers |
203 \pcode{(re)} & groups regular expressions and remembers |
204 matched text |
204 matched text |
205 \end{tabular} |
205 \end{tabular} |
206 \end{center} |
206 \end{center} |
207 |
207 |
208 \noindent With this table you can figure out the purpose of the regular |
208 The syntax is pretty universal and can be found in many regular |
209 expressions in the web-crawlers shown Figures \ref{crawler1} and |
209 expression libraries. If you need a quick recap about regular |
210 \ref{crawler3}. In Figure~\ref{crawler1}, however, be careful with |
210 expressions and how the match strings, here is a quick video: |
211 the regular expression for http-addresses in Line~\ref{httpline}. It is |
211 \url{https://youtu.be/bgBWp9EIlMM}. |
212 intended to be |
|
213 |
|
214 \[ |
|
215 \pcode{"https?://[^"]*"} |
|
216 \] |
|
217 |
|
218 \noindent specifying that http-addresses need to start with a double |
|
219 quote, then comes \texttt{http} followed by an optional \texttt{s} and |
|
220 so on\ldots{}until the closing double quote comes at the end of the |
|
221 address. Normally we would have to escape the double quotes in order to |
|
222 make sure we interpret the double quote as character, not as double |
|
223 quote for a string. But Scala's trick with triple quotes allows us to |
|
224 omit this kind of ugly escaping. As a result we can just write: |
|
225 |
|
226 \[ |
|
227 \pcode{""""https?://[^"]*"""".r} |
|
228 \] |
|
229 |
|
230 \noindent The convention in Scala is that \texttt{.r} converts a string |
|
231 into a regular expression. I leave it to you to ponder whether this |
|
232 regular expression really captures all possible web-addresses. If you |
|
233 need a quick recap about regular expressions and how the match strings, |
|
234 here is a quick video: \url{https://youtu.be/bgBWp9EIlMM}. |
|
235 |
212 |
236 \subsection*{Why Study Regular Expressions?} |
213 \subsection*{Why Study Regular Expressions?} |
237 |
214 |
238 Regular expressions were introduced by Kleene in the 1950ies and they |
215 Regular expressions were introduced by Kleene in the 1950ies and they |
239 have been object of intense study since then. They are nowadays pretty |
216 have been object of intense study since then. They are nowadays pretty |
241 implementing regular expressions. I am sure you have come across them |
218 implementing regular expressions. I am sure you have come across them |
242 before (remember the PRA or PEP modules?). |
219 before (remember the PRA or PEP modules?). |
243 |
220 |
244 Why on earth then is there any interest in studying them again in depth |
221 Why on earth then is there any interest in studying them again in depth |
245 in this module? Well, one answer is in the following two graphs about |
222 in this module? Well, one answer is in the following two graphs about |
246 regular expression matching in Python, Ruby, JavaScript and Java |
223 regular expression matching in Python, Ruby, JavaScript, Swift and Java |
247 (Version 8). |
224 (Version 8). |
248 |
225 |
249 \begin{center} |
226 \begin{center} |
250 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1.5mm}}c@{}} |
227 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1.5mm}}c@{}} |
251 \begin{tikzpicture} |
228 \begin{tikzpicture} |
262 ytick={0,5,...,30}, |
239 ytick={0,5,...,30}, |
263 scaled ticks=false, |
240 scaled ticks=false, |
264 axis lines=left, |
241 axis lines=left, |
265 width=5.5cm, |
242 width=5.5cm, |
266 height=4.5cm, |
243 height=4.5cm, |
267 legend entries={Python, Java 8, JavaScript}, |
244 legend entries={Python, Java 8, JavaScript, Swift}, |
268 legend pos=north west, |
245 legend pos=north west, |
269 legend cell align=left] |
246 legend cell align=left] |
270 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
247 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
271 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
248 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
272 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; |
249 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; |
|
250 \addplot[magenta,mark=*, mark options={fill=white}] table {re-swift.data}; |
273 \end{axis} |
251 \end{axis} |
274 \end{tikzpicture} |
252 \end{tikzpicture} |
275 & |
253 & |
276 \begin{tikzpicture} |
254 \begin{tikzpicture} |
277 \begin{axis}[ |
255 \begin{axis}[ |
297 \end{axis} |
275 \end{axis} |
298 \end{tikzpicture} |
276 \end{tikzpicture} |
299 \end{tabular} |
277 \end{tabular} |
300 \end{center} |
278 \end{center} |
301 |
279 |
302 \noindent The first graph shows that Python, JavaScript and Java 8 need |
280 \noindent The first graph shows that Python, JavaScript, Swift and Java 8 need |
303 approximately 30 seconds to find out that the regular expression |
281 approximately 30 seconds to find out that the regular expression |
304 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly, |
282 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly, |
305 the second shows that Python and Ruby need approximately 29 seconds for finding |
283 the second shows that Python and Ruby need approximately 29 seconds for finding |
306 out whether a string of 28 \texttt{a}s matches the regular expression |
284 out whether a string of 28 \texttt{a}s matches the regular expression |
307 \texttt{a?\{28\}\,a\{28\}}.\footnote{In this |
285 \texttt{a?\{28\}\,a\{28\}}.\footnote{In this |
336 \noindent |
314 \noindent |
337 and also when somebody tried to match web-addresses using a |
315 and also when somebody tried to match web-addresses using a |
338 relatively simple regular expression |
316 relatively simple regular expression |
339 |
317 |
340 \begin{center} |
318 \begin{center} |
341 \url{https://www.tutorialdocs.com/article/regex-trap.html} |
319 \url{https://archive.ph/W5Ogx#selection-141.1-141.36} |
342 \end{center} |
320 \end{center} |
343 |
321 |
344 \noindent |
322 \noindent |
345 Finally, on 2 July 2019 Cloudflare had a global outage because of a |
323 Finally, on 2 July 2019 Cloudflare had a global outage because of a |
346 regular expression (they had no outage for the 6 years before). What |
324 regular expression (they had no outage for the 6 years before). What |
694 regular expressions}. This notion allows one to do almost all |
672 regular expressions}. This notion allows one to do almost all |
695 calculations with regular expressions on the level of regular |
673 calculations with regular expressions on the level of regular |
696 expressions, not needing any automata (you will see we only touch |
674 expressions, not needing any automata (you will see we only touch |
697 briefly on automata in lecture 3). Automata are usually the main |
675 briefly on automata in lecture 3). Automata are usually the main |
698 object of study in formal language courses. The avoidance of automata |
676 object of study in formal language courses. The avoidance of automata |
699 is crucial for me because automata are graphs and it is rather difficult to |
677 is crucial for me because automata are graphs and it is rather |
700 reason about graphs in theorem provers. In contrast, reasoning about |
678 difficult to reason about graphs in theorem provers. In contrast, |
701 regular expressions is easy-peasy in theorem provers. Is this |
679 reasoning about regular expressions is easy-peasy in theorem |
702 important? I think yes, because according to Kuklewicz nearly all |
680 provers. Is this important? I think yes, because according to |
703 POSIX-based regular expression matchers are |
681 Kuklewicz nearly all POSIX-based regular expression matchers are |
704 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} |
682 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} |
705 With my PhD student Fahad Ausaf I proved the correctness for one such |
683 With my PhD students Fahad Ausaf and Chengsong Tan, I proved the |
706 matcher that was proposed by Sulzmann and Lu in |
684 correctness for two such matchers that were proposed by Sulzmann and Lu |
707 2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can prove that |
685 in 2014.\footnote{\url{http://goo.gl/bz0eHp}} A variant of which you have |
708 the machine code(!) that implements this code efficiently is correct |
686 already seen in PEP as CW3 and you will see again in the CFL in the first |
709 also. Writing programs in this way does not leave any room for |
687 two CWs. What we have not yet figured out that our matchers are |
710 potential errors or bugs. How nice! |
688 universally fast, meaning they do not explode on any input. |
|
689 Hopefully we can also prove |
|
690 that the machine code(!) that implements our matchers efficiently is |
|
691 correct also. Writing programs in this way does not leave any room for |
|
692 any errors or bugs. How nice! |
711 |
693 |
712 What also helped with my fascination with regular expressions |
694 What also helped with my fascination with regular expressions |
713 is that we could indeed find out new things about them that |
695 is that we could indeed find out new things about them that |
714 have surprised some experts. Together with two colleagues from China, I was |
696 have surprised some experts. Together with two colleagues from China, I was |
715 able to prove the Myhill-Nerode theorem by only using regular |
697 able to prove the Myhill-Nerode theorem by only using regular |
719 complementation, something which Bill Gasarch in his Computational Complexity |
701 complementation, something which Bill Gasarch in his Computational Complexity |
720 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be |
702 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be |
721 shown via automata. So even somebody who has written a 700+-page |
703 shown via automata. So even somebody who has written a 700+-page |
722 book\footnote{\url{http://goo.gl/fD0eHx}} on regular |
704 book\footnote{\url{http://goo.gl/fD0eHx}} on regular |
723 expressions did not know better. Well, we showed it can also be |
705 expressions did not know better. Well, we showed it can also be |
724 done with regular expressions only.\footnote{\url{http://nms.kcl.ac.uk/christian.urban/Publications/posix.pdf}} |
706 done with regular expressions only.\footnote{\url{https://nms.kcl.ac.uk/christian.urban/Publications/rexp.pdf}} |
725 What a feeling when you are an outsider to the subject! |
707 What a feeling when you are an outsider to the subject! |
726 |
708 |
727 To conclude: Despite my early ignorance about regular expressions, I |
709 To conclude: Despite my early ignorance about regular expressions, I |
728 find them now extremely interesting. They have practical importance |
710 find them now extremely interesting. They have practical importance |
729 (remember the shocking runtime of the regular expression matchers in |
711 (remember the shocking runtime of the regular expression matchers in |
730 Python, Ruby and Java in some instances and the problems in Stack |
712 Python, Ruby, Swift and Java in some instances and the problems in Stack |
731 Exchange and the Atom editor). They are used in tools like Snort and |
713 Exchange and the Atom editor---even regex libraries in more modern programming languages, like Rust, have their problems). They are used in tools like Snort and |
732 Zeek in order to monitor network traffic. They have a beautiful mathematical |
714 Zeek in order to monitor network traffic. They have a beautiful mathematical |
733 theory behind them, which can be sometimes quite deep and which |
715 theory behind them, which can be sometimes quite deep and which |
734 sometimes contains hidden snares. People who are not very familiar |
716 sometimes contains hidden snares. People who are not very familiar |
735 with the mathematical background of regular expressions get them |
717 with the mathematical background of regular expressions get them |
736 consistently wrong (this is surprising given they are a supposed to be |
718 consistently wrong (this is surprising given they are a supposed to be |
789 best solution is to use regular expressions; now you have two |
771 best solution is to use regular expressions; now you have two |
790 problems.'' |
772 problems.'' |
791 \end{quote} |
773 \end{quote} |
792 |
774 |
793 |
775 |
794 \begin{figure}[p]\small |
776 %\begin{figure}[p]\small |
795 \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
777 % \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
796 {../progs/crawler1.scala} |
778 % {../progs/crawler1.scala} |
797 |
779 % |
798 \caption{The Scala code for a simple web-crawler that checks |
780 %\caption{The Scala code for a simple web-crawler that checks |
799 for broken links in web-pages. It uses the regular expression |
781 %for broken links in web-pages. It uses the regular expression |
800 \texttt{http\_pattern} in Line~\ref{httpline} for recognising |
782 %\texttt{http\_pattern} in Line~\ref{httpline} for recognising |
801 URL-addresses. It finds all links using the library function |
783 %URL-addresses. It finds all links using the library function |
802 \texttt{findAllIn} in Line~\ref{findallline} (this function |
784 %\texttt{findAllIn} in Line~\ref{findallline} (this function |
803 is part of Scala's regular expression library).\label{crawler1}} |
785 %is part of Scala's regular expression library).\label{crawler1}} |
804 |
786 % |
805 \end{figure} |
787 %\end{figure} |
806 |
788 |
807 |
789 |
808 |
790 |
809 %\begin{figure}[p]\small |
791 %\begin{figure}[p]\small |
810 % \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
792 % \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
818 %Lines~\ref{changestartline}--\ref{changeendline} where there |
800 %Lines~\ref{changestartline}--\ref{changeendline} where there |
819 %is a test whether URL is in ``my'' domain or |
801 %is a test whether URL is in ``my'' domain or |
820 %not.\label{crawler2}} |
802 %not.\label{crawler2}} |
821 %\end{figure} |
803 %\end{figure} |
822 |
804 |
823 \begin{figure}[p]\small |
805 %\begin{figure}[p]\small |
824 \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
806 % \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\co%lor{capri!3}\fi}] |
825 {../progs/crawler2.scala} |
807 % {../progs/crawler2.scala} |
826 |
808 % |
827 \caption{A small email harvester---whenever we download a |
809 %\caption{A small email harvester---whenever we download a |
828 web-page, we also check whether it contains any email |
810 %web-page, we also check whether it contains any email |
829 addresses. For this we use the regular expression |
811 %addresses. For this we use the regular expression |
830 \texttt{email\_pattern} in Line~\ref{emailline}. The main |
812 %\texttt{email\_pattern} in Line~\ref{emailline}. The main |
831 change is in Line~\ref{mainline} where all email addresses |
813 %change is in Line~\ref{mainline} where all email addresses |
832 that can be found in a page are printed.\label{crawler3}} |
814 %that can be found in a page are printed.\label{crawler3}} |
833 |
815 % |
834 \end{figure} |
816 %\end{figure} |
835 |
817 |
836 \begin{figure}[p] |
818 \begin{figure}[p] |
837 \tiny |
819 \tiny |
838 \begin{center} |
820 \begin{center} |
839 \begin{minipage}{0.8\textwidth} |
821 \begin{minipage}{0.8\textwidth} |