4 |
4 |
5 \begin{document} |
5 \begin{document} |
6 |
6 |
7 \section*{Coursework 8 (Scala, Regular Expressions} |
7 \section*{Coursework 8 (Scala, Regular Expressions} |
8 |
8 |
9 This coursework is worth 10\% and is due on XXXX at |
9 This coursework is worth 10\%. It is about regular expressions and |
10 16:00. You are asked to implement a regular expression matcher. |
10 pattern matching. The first part is due on 30 November at 11pm; the |
11 |
11 second, more advanced part, is due on 7 December at 11pm. The |
12 Make sure the files |
12 second part is not yet included. For the first part you are |
|
13 asked to implement a regular expression matcher. Make sure the files |
13 you submit can be processed by just calling \texttt{scala |
14 you submit can be processed by just calling \texttt{scala |
14 <<filename.scala>>}.\bigskip |
15 <<filename.scala>>}.\bigskip |
15 |
16 |
16 \noindent |
17 \noindent |
17 \textbf{Important:} Do not use any mutable data structures in your |
18 \textbf{Important:} Do not use any mutable data structures in your |
18 submissions! They are not needed. This excluded the use of |
19 submissions! They are not needed. This excluded the use of |
19 \texttt{ListBuffer}s, for example. Do not use \texttt{return} in your |
20 \texttt{ListBuffer}s, for example. Do not use \texttt{return} in your |
20 code! It has a different meaning in Scala, than in Java. |
21 code! It has a different meaning in Scala, than in Java. Do not use |
21 Do not use \texttt{var}! This declares a mutable variable. Feel free to |
22 \texttt{var}! This declares a mutable variable. Make sure the |
22 copy any code you need from files \texttt{knight1.scala}, |
|
23 \texttt{knight2.scala} and \texttt{knight3.scala}. Make sure the |
|
24 functions you submit are defined on the ``top-level'' of Scala, not |
23 functions you submit are defined on the ``top-level'' of Scala, not |
25 inside a class or object. |
24 inside a class or object. |
26 |
25 |
27 |
26 |
28 \subsection*{Disclaimer} |
27 \subsection*{Disclaimer!!!!!!!!} |
29 |
28 |
30 It should be understood that the work you submit represents |
29 It should be understood that the work you submit represents |
31 your own effort. You have not copied from anyone else. An |
30 your own effort! You have not copied from anyone else. An |
32 exception is the Scala code I showed during the lectures or |
31 exception is the Scala code I showed during the lectures or |
33 uploaded to KEATS, which you can freely use.\bigskip |
32 uploaded to KEATS, which you can freely use.\bigskip |
34 |
33 |
35 |
34 |
36 \subsubsection*{Task} |
35 \subsection*{Part 1 (6 Marks)} |
37 |
36 |
38 The task is to implement a regular expression matcher based on |
37 The task is to implement a regular expression matcher based on |
39 derivatives of regular expressions. The implementation should |
38 derivatives of regular expressions. The implementation can deal |
40 be able to deal with the usual (basic) regular expressions |
39 with the following regular expressions, which have been predefined |
|
40 file re.scala: |
41 |
41 |
42 \begin{center} |
42 \begin{center} |
43 \begin{tabular}{lcll} |
43 \begin{tabular}{lcll} |
44 $r$ & $::=$ & $\ZERO$ & cannot match anything\\ |
44 $r$ & $::=$ & $\ZERO$ & cannot match anything\\ |
45 & $|$ & $\ONE$ & can only match the empty string\\ |
45 & $|$ & $\ONE$ & can only match the empty string\\ |
46 & $|$ & $c$ & can match a character $c$\\ |
46 & $|$ & $c$ & can match a character (in this case $c$)\\ |
47 & $|$ & $r_1 + r_2$ & can match either with $r_1$ or with $r_2$\\ |
47 & $|$ & $r_1 + r_2$ & can match a string either with $r_1$ or with $r_2$\\ |
48 & $|$ & $r_1 \cdot r_2$ & can match first with $r_1$ and then with $r_2$\\ |
48 & $|$ & $r_1\cdot r_2$ & can match the first part of a string with $r_1$ and\\ |
|
49 & & & then the second part with $r_2$\\ |
49 & $|$ & $r^*$ & can match zero or more times $r$\\ |
50 & $|$ & $r^*$ & can match zero or more times $r$\\ |
50 & $|$ & $r^{\{\uparrow n\}}$ & can match zero upto $n$ times $r$\\ |
|
51 & $|$ & $r^{\{n\}}$ & can match exactly $n$ times $r$\\ |
|
52 \end{tabular} |
51 \end{tabular} |
53 \end{center} |
52 \end{center} |
54 |
53 |
55 \noindent |
54 \noindent |
56 Implement a function called \textit{nullable} by recursion over |
55 Why? Knowing how to match regular expressions and strings fast will |
57 regular expressions: |
56 let you solve a lot of problems that vex other humans. Regular |
|
57 expressions are one of the fastest and simplest ways to match patterns |
|
58 in text, and are endlessly useful for searching, editing and |
|
59 analysing text in all sorts of places. However, you need to be |
|
60 fast, otherwise you will stumble upon problems such as recently reported at |
|
61 |
|
62 {\small |
|
63 \begin{itemize} |
|
64 \item[$\bullet$] \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
|
65 \item[$\bullet$] \url{https://vimeo.com/112065252} |
|
66 \item[$\bullet$] \url{http://davidvgalbraith.com/how-i-fixed-atom/} |
|
67 \end{itemize}} |
|
68 |
|
69 \subsection*{Tasks (file re.scala)} |
|
70 |
|
71 \begin{itemize} |
|
72 \item[(1a)] Implement a function, called \textit{nullable}, by recursion over |
|
73 regular expressions. This function test whether a regular expression can match |
|
74 the empty string. |
58 |
75 |
59 \begin{center} |
76 \begin{center} |
60 \begin{tabular}{lcl} |
77 \begin{tabular}{lcl} |
61 $\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\ |
78 $\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\ |
62 $\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\ |
79 $\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\ |
63 $\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\ |
80 $\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\ |
64 $\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\ |
81 $\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\ |
65 $\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\ |
82 $\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\ |
66 $\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\\ |
83 $\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\\ |
67 $\textit{nullable}(r^{\{\uparrow n\}})$ & $\dn$ & $\textit{true}$\\ |
|
68 $\textit{nullable}(r^{\{n\}})$ & $\dn$ & |
|
69 $\textit{if}\;n = 0\; \textit{then} \; \textit{true} \; \textit{else} \; \textit{nullable}(r)$\\ |
|
70 \end{tabular} |
84 \end{tabular} |
71 \end{center} |
85 \end{center}\hfill[1 Mark] |
|
86 |
|
87 \item[(1b)] Implement a function, called \textit{der}, by recursion over |
|
88 regular expressions. It takes a character and a regular expression |
|
89 as arguments and calculates the derivative regular expression. |
72 |
90 |
73 \begin{center} |
91 \begin{center} |
74 \begin{tabular}{lcl} |
92 \begin{tabular}{lcl} |
75 $\textit{der}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\ |
93 $\textit{der}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\ |
76 $\textit{der}\;c\;(\ONE)$ & $\dn$ & $\ZERO$\\ |
94 $\textit{der}\;c\;(\ONE)$ & $\dn$ & $\ZERO$\\ |
78 $\textit{der}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{der}\;c\;r_1) + (\textit{der}\;c\;r_2)$\\ |
96 $\textit{der}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{der}\;c\;r_1) + (\textit{der}\;c\;r_2)$\\ |
79 $\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\ |
97 $\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\ |
80 & & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\ |
98 & & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\ |
81 & & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\ |
99 & & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\ |
82 $\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\ |
100 $\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\ |
83 $\textit{der}\;c\;(r^{\{\uparrow n\}})$ & $\dn$ & $\textit{if}\;n = 0\;\textit{then}\;\ZERO\;\text{else}\; |
|
84 (\textit{der}\;c\;r)\cdot (r^{\{\uparrow n-1\}})$\\ |
|
85 $\textit{der}\;c\;(r^{\{n\}})$ & $\dn$ & |
|
86 $\textit{if}\;n = 0\; \textit{then} \; \ZERO\; \textit{else}\;$\\ |
|
87 & & $\textit{if} \;\textit{nullable}(r)\;\textit{then}\;(\textit{der}\;c\;r)\cdot (r^{\{\uparrow n-1\}})$\\ |
|
88 & & $\textit{else}\;(\textit{der}\;c\;r)\cdot (r^{\{n-1\}})$ |
|
89 \end{tabular} |
101 \end{tabular} |
90 \end{center} |
102 \end{center}\hfill[1 Mark] |
91 |
103 |
|
104 \item[(1c)] Implement the function \textit{simp}, which recursively |
|
105 traverses a regular expression from inside to outside, and |
|
106 simplifies every sub-regular-expressions on the left to |
|
107 the regular expression on the right, except it does not simplify inside |
|
108 ${}^*$-regular expressions. |
92 |
109 |
93 Be careful that your implementation of \textit{nullable} and |
110 \begin{center} |
94 \textit{der}\;c\; satisfies for every $r$ the following two |
|
95 properties (see also Question 2): |
|
96 |
|
97 \begin{itemize} |
|
98 \item $\textit{nullable}(r)$ if and only if $[]\in L(r)$ |
|
99 \item $L(der\,c\,r) = Der\,c\,(L(r))$ |
|
100 \end{itemize} |
|
101 |
|
102 \noindent {\bf Important!} Your implementation should have |
|
103 explicit cases for the basic regular expressions, but also |
|
104 explicit cases for the extended regular expressions. That |
|
105 means do not treat the extended regular expressions by just |
|
106 translating them into the basic ones. See also Question 2, |
|
107 where you are asked to explicitly give the rules for |
|
108 \textit{nullable} and \textit{der}\;c\; for the extended regular |
|
109 expressions. |
|
110 |
|
111 |
|
112 \subsection*{Question 1} |
|
113 |
|
114 What is your King's email address (you will need it in |
|
115 Question 3)? |
|
116 |
|
117 \subsection*{Question 2} |
|
118 |
|
119 This question does not require any implementation. From the |
|
120 lectures you have seen the definitions for the functions |
|
121 \textit{nullable} and \textit{der}\;c\; for the basic regular |
|
122 expressions. Give the rules for the extended regular |
|
123 expressions: |
|
124 |
|
125 \begin{center} |
|
126 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} |
|
127 $\textit{nullable}([c_1 c_2 \ldots c_n])$ & $\dn$ & $?$\\ |
|
128 $\textit{nullable}(r^+)$ & $\dn$ & $?$\\ |
|
129 $\textit{nullable}(r^?)$ & $\dn$ & $?$\\ |
|
130 $\textit{nullable}(r^{\{n,m\}})$ & $\dn$ & $?$\\ |
|
131 $\textit{nullable}(\sim{}r)$ & $\dn$ & $?$\medskip\\ |
|
132 $der\, c\, ([c_1 c_2 \ldots c_n])$ & $\dn$ & $?$\\ |
|
133 $der\, c\, (r^+)$ & $\dn$ & $?$\\ |
|
134 $der\, c\, (r^?)$ & $\dn$ & $?$\\ |
|
135 $der\, c\, (r^{\{n,m\}})$ & $\dn$ & $?$\\ |
|
136 $der\, c\, (\sim{}r)$ & $\dn$ & $?$\\ |
|
137 \end{tabular} |
|
138 \end{center} |
|
139 |
|
140 \noindent |
|
141 Remember your definitions have to satisfy the two properties |
|
142 |
|
143 \begin{itemize} |
|
144 \item $\textit{nullable}(r)$ if and only if $[]\in L(r)$ |
|
145 \item $L(der\,c\,r)) = Der\,c\,(L(r))$ |
|
146 \end{itemize} |
|
147 |
|
148 \subsection*{Question 3} |
|
149 |
|
150 Implement the following regular expression for email addresses |
|
151 |
|
152 \[ |
|
153 ([a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot ([a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot ([a\mbox{-}z\,.]^{\{2,6\}}) |
|
154 \] |
|
155 |
|
156 \noindent and calculate the derivative according to your email |
|
157 address. When calculating the derivative, simplify all regular |
|
158 expressions as much as possible by applying the |
|
159 following 7 simplification rules: |
|
160 |
|
161 \begin{center} |
|
162 \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}ll} |
111 \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}ll} |
163 $r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ |
112 $r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ |
164 $\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ |
113 $\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ |
165 $r \cdot \ONE$ & $\mapsto$ & $r$\\ |
114 $r \cdot \ONE$ & $\mapsto$ & $r$\\ |
166 $\ONE \cdot r$ & $\mapsto$ & $r$\\ |
115 $\ONE \cdot r$ & $\mapsto$ & $r$\\ |
167 $r + \ZERO$ & $\mapsto$ & $r$\\ |
116 $r + \ZERO$ & $\mapsto$ & $r$\\ |
168 $\ZERO + r$ & $\mapsto$ & $r$\\ |
117 $\ZERO + r$ & $\mapsto$ & $r$\\ |
169 $r + r$ & $\mapsto$ & $r$\\ |
118 $r + r$ & $\mapsto$ & $r$\\ |
170 \end{tabular} |
119 \end{tabular} |
|
120 \end{center} |
|
121 |
|
122 For example |
|
123 \[(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)\] |
|
124 |
|
125 simplifies to just $r_1$. |
|
126 \hfill[1 Mark] |
|
127 |
|
128 \item[(1d)] Implement two functions: The first, called \textit{ders}, |
|
129 takes a list of characters as arguments and a regular expression and |
|
130 buids the derivative as follows: |
|
131 |
|
132 \begin{center} |
|
133 \begin{tabular}{lcl} |
|
134 $\textit{ders}\;Nil\;(r)$ & $\dn$ & $r$\\ |
|
135 $\textit{ders}\;c::cs\;(r)$ & $\dn$ & |
|
136 $\textit{ders}\;cs\;(\textit{simp}(\textit{der}\;c\;r))$\\ |
|
137 \end{tabular} |
171 \end{center} |
138 \end{center} |
172 |
139 |
173 \noindent Write down your simplified derivative in a readable |
140 The second, called \textit{matcher}, takes a string and a regular expression |
174 notation using parentheses where necessary. That means you |
141 as arguments. It builds first the derivatives according to \textit{ders} |
175 should use the infix notation $+$, $\cdot$, $^*$ and so on, |
142 and at the end tests whether the resulting redular expression can match |
176 instead of code. |
143 the empty string (using \textit{nullable}). |
177 |
144 For example the \textit{matcher} will produce true if given the |
178 \subsection*{Question 4} |
145 regular expression $a\cdot b\cdot c$ and the string $abc$. |
|
146 \hfill[1 Mark] |
179 |
147 |
180 Suppose \textit{[a-z]} stands for the range regular expression |
148 \item[(1e)] Implement the function \textit{replace}: it searches (from the left to |
181 $[a,b,c,\ldots,z]$. Consider the regular expression $/ \cdot * \cdot |
149 right) in string $s_1$ all the non-empty substrings that match the |
182 (\sim{}([a\mbox{-}z]^* \cdot * \cdot / \cdot [a\mbox{-}z]^*)) \cdot * |
150 regular expression---these substrings are assumed to be |
183 \cdot /$ and decide wether the following four strings are matched by |
151 the longest substrings matched by the regular expression and |
184 this regular expression. Answer yes or no. |
152 assumed to be non-overlapping. All these substrings in $s_1$ are replaced |
|
153 by $s_2$. For example given the regular expression |
185 |
154 |
186 \begin{enumerate} |
155 \[(a \cdot a)^* + (b \cdot b)\] |
187 \item \texttt{"/**/"} |
|
188 \item \texttt{"/*foobar*/"} |
|
189 \item \texttt{"/*test*/test*/"} |
|
190 \item \texttt{"/*test/*test*/"} |
|
191 \end{enumerate} |
|
192 |
156 |
193 \noindent |
157 \noindent the string $aabbbaaaaaaabaaaaabbaaaabb$ and |
194 Also test your regular expression matcher with the regular |
158 replacement string $c$ yields the string |
195 expression $a^{\{3,5\}}$ and the strings |
|
196 |
159 |
197 \begin{enumerate} |
160 \[ |
198 \setcounter{enumi}{4} |
161 ccbcabcaccc |
199 \item \texttt{aa} |
162 \] |
200 \item \texttt{aaa} |
|
201 \item \texttt{aaaaa} |
|
202 \item \texttt{aaaaaa} |
|
203 \end{enumerate} |
|
204 |
163 |
205 \noindent |
164 \hfill[2 Mark] |
206 Does your matcher produce the expected results? |
165 \end{itemize} |
207 |
166 |
208 \subsection*{Question 5} |
167 \subsection*{Part 2 (4 Marks)} |
209 |
168 |
210 Let $r_1$ be the regular expression $a\cdot a\cdot a$ and $r_2$ be |
169 Coming soon. |
211 $(a^{\{19,19\}}) \cdot (a^?)$. Decide whether the following three |
|
212 strings consisting of $a$s only can be matched by $(r_1^+)^+$. |
|
213 Similarly test them with $(r_2^+)^+$. Again answer in all six cases |
|
214 with yes or no. \medskip |
|
215 |
|
216 \noindent |
|
217 These are strings are meant to be entirely made up of $a$s. Be careful |
|
218 when copy-and-pasting the strings so as to not forgetting any $a$ and |
|
219 to not introducing any other character. |
|
220 |
|
221 \begin{enumerate} |
|
222 \item \texttt{"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\\ |
|
223 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\\ |
|
224 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"} |
|
225 \item \texttt{"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\\ |
|
226 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\\ |
|
227 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"} |
|
228 \item \texttt{"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\\ |
|
229 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\\ |
|
230 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"} |
|
231 \end{enumerate} |
|
232 |
|
233 |
170 |
234 \end{document} |
171 \end{document} |
|
172 |
235 |
173 |
236 %%% Local Variables: |
174 %%% Local Variables: |
237 %%% mode: latex |
175 %%% mode: latex |
238 %%% TeX-master: t |
176 %%% TeX-master: t |
239 %%% End: |
177 %%% End: |