19 readable and understandable), but other times we need pedantic |
19 readable and understandable), but other times we need pedantic |
20 precision for actual programs. |
20 precision for actual programs. |
21 |
21 |
22 \subsubsection*{Characters and Strings} |
22 \subsubsection*{Characters and Strings} |
23 |
23 |
24 The most important concept in this module are strings. Strings |
24 The most basic concept in this module are strings. Strings |
25 are composed of \defn{characters}. While characters are surely |
25 are composed of \defn{characters}. While characters are surely |
26 a familiar concept, we will make one subtle distinction in |
26 a familiar concept, we will make one subtle distinction in |
27 this module. If we want to refer to concrete characters, like |
27 this module. If we want to refer to concrete characters, like |
28 \code{a}, \code{b}, \code{c} and so on, we use a typewriter font. |
28 \code{a}, \code{b}, \code{c} and so on, we will use a typewriter font. |
29 Accordingly if we want to refer to the concrete characters of |
29 Accordingly if we want to refer to the concrete characters of |
30 my email address we shall write |
30 my email address we shall write |
31 |
31 |
32 \begin{center} |
32 \begin{center} |
33 \pcode{christian.urban@kcl.ac.uk} |
33 \pcode{christian.urban@kcl.ac.uk} |
53 \noindent In this string, we do not really care what the |
53 \noindent In this string, we do not really care what the |
54 characters stand for, except we do care about the fact that |
54 characters stand for, except we do care about the fact that |
55 for example the character $a$ is not equal to $b$ and so on. |
55 for example the character $a$ is not equal to $b$ and so on. |
56 Why do I make this distinction? Because we often need to |
56 Why do I make this distinction? Because we often need to |
57 define functions using variables ranging over characters. We |
57 define functions using variables ranging over characters. We |
58 need to somehow say this is a variable, say $c$, ranging over |
58 need to somehow say ``this-is-a-variable'' and give it a name. |
59 characters, while this is the actual character \pcode{c}. |
59 In such cases we use the italic font. |
|
60 |
60 |
61 |
61 An \defn{alphabet} is a (non-empty) finite set of characters. |
62 An \defn{alphabet} is a (non-empty) finite set of characters. |
62 Often the letter $\Sigma$ is used to refer to an alphabet. For |
63 Often the letter $\Sigma$ is used to refer to an alphabet. For |
63 example the ASCII characters \pcode{a} to \pcode{z} form an |
64 example the ASCII characters \pcode{a} to \pcode{z} form an |
64 alphabet. The digits $0$ to $9$ are another alphabet. The |
65 alphabet. The digits $0$ to $9$ are another alphabet. The |
76 typed programming languages, such as Scala, strings have a special |
77 typed programming languages, such as Scala, strings have a special |
77 type---namely \pcode{String} which is different from the type |
78 type---namely \pcode{String} which is different from the type |
78 for lists of characters. This is because strings can be |
79 for lists of characters. This is because strings can be |
79 efficiently represented in memory, unlike lists. Since |
80 efficiently represented in memory, unlike lists. Since |
80 \code{String} and the type of lists of characters |
81 \code{String} and the type of lists of characters |
81 (\code{List[Char]}) are not the same, we need to explicitly |
82 (namely \code{List[Char]}) are not the same, we need to explicitly |
82 coerce elements between the two types, for example |
83 coerce elements between the two types, for example |
83 |
84 |
84 \begin{lstlisting}[numbers=none] |
85 \begin{lstlisting}[numbers=none] |
85 scala> "abc".toList |
86 scala> "abc".toList |
86 res01: List[Char] = List(a, b, c) |
87 res01: List[Char] = List(a, b, c) |
102 say \dq{\textit{ello}} when making definitions about strings. |
103 say \dq{\textit{ello}} when making definitions about strings. |
103 There are also some subtleties with the empty string, |
104 There are also some subtleties with the empty string, |
104 sometimes written as \dq{} but also as the empty list of |
105 sometimes written as \dq{} but also as the empty list of |
105 characters $[\,]$.\footnote{In the literature you can also |
106 characters $[\,]$.\footnote{In the literature you can also |
106 often find that $\varepsilon$ or $\lambda$ is used to |
107 often find that $\varepsilon$ or $\lambda$ is used to |
107 represent the empty string.} |
108 represent the empty string. But we are not going to use this notation.} |
108 |
109 |
109 Two strings, say $s_1$ and $s_2$, can be \defn{concatenated}, |
110 Two strings, say $s_1$ and $s_2$, can be \defn{concatenated}, |
110 which we write as $s_1 @ s_2$. If we regard $s_1$ and $s_2$ as |
111 which we write as $s_1 @ s_2$. If we regard $s_1$ and $s_2$ as |
111 lists of characters, then $@$ is the list-append function. |
112 lists of characters, then $@$ is the list-append function. |
112 Suppose we are given two strings \dq{\textit{foo}} and |
113 Suppose we are given two strings \dq{\textit{foo}} and |
113 \dq{\textit{bar}}, then their concatenation, written |
114 \dq{\textit{bar}}, then their concatenation, written |
114 \dq{\textit{foo}} $@$ \dq{\textit{bar}}, gives |
115 \dq{\textit{foo}} $@$ \dq{\textit{bar}}, gives |
115 \dq{\textit{foobar}}. But as said above, we will often |
116 \dq{\textit{foobar}}. But as said above, we will often |
116 simplify our life and just drop the double quotes whenever it |
117 simplify our life and just drop the double quotes whenever it |
117 is clear we are talking about strings. So we will often just |
118 is clear we are talking about strings. So we will just |
118 write \textit{foo}, \textit{bar}, \textit{foobar} or |
119 write \textit{foo}, \textit{bar}, \textit{foobar} |
119 \textit{foo $@$ bar}. |
120 \textit{foo $@$ bar} and so on. |
120 |
121 |
121 Occasionally we will use the notation $a^n$ for strings, which stands |
122 Occasionally we will use the notation $a^n$ for strings, which stands |
122 for the string of $n$ repeated $a$s. So $a^{n}b^{n}$ is a string that |
123 for the string of $n$ repeated $a$s. So $a^{n}b^{n}$ is a string that |
123 has some number of $a$s followed by the same number of $b$s. A simple |
124 has some number of $a$s followed by the same number of $b$s. |
124 property of string concatenation is \emph{associativity}, meaning |
125 Confusingly, in Scala the notation is ``times'' for this opration. |
|
126 So you can write |
|
127 |
|
128 \begin{lstlisting}[numbers=none] |
|
129 scala> "a" * 13 |
|
130 val res02: String = aaaaaaaaaaaaa |
|
131 \end{lstlisting} |
|
132 |
|
133 \noindent |
|
134 A simple property of string concatenation is \emph{associativity}, meaning |
125 |
135 |
126 \[(s_1 @ s_2) @ s_3 = s_1 @ (s_2 @ s_3)\] |
136 \[(s_1 @ s_2) @ s_3 = s_1 @ (s_2 @ s_3)\] |
127 |
137 |
128 \noindent are always equal strings. The empty string behaves |
138 \noindent are always equal strings. The empty string behaves |
129 like a \emph{unit element}, therefore |
139 like a \emph{unit element}, therefore |
183 A \backslash B & \dn & \{x\;|\; x \in A \wedge x \not\in B\} |
193 A \backslash B & \dn & \{x\;|\; x \in A \wedge x \not\in B\} |
184 \end{eqnarray*} |
194 \end{eqnarray*} |
185 |
195 |
186 \noindent In general set comprehensions are of the form |
196 \noindent In general set comprehensions are of the form |
187 $\{a\;|\;P\}$ which stands for the set of all elements $a$ |
197 $\{a\;|\;P\}$ which stands for the set of all elements $a$ |
188 (from some set) for which some property $P$ holds. |
198 (from some set) for which some property $P$ holds. If programming |
|
199 is more your-kind-of-thing, you might recognise the similarities |
|
200 with for-comprehensions, for example for the silly set above you |
|
201 could write |
|
202 |
|
203 \begin{lstlisting}[numbers=none] |
|
204 scala> for (n <- (0 to 10).toSet; if n * n < 9) yield n |
|
205 val res03: Set[Int] = Set(0, 1, 2) |
|
206 \end{lstlisting} |
|
207 |
|
208 \noindent |
|
209 This is pretty much the same as $\{n\;|\; n \in \mathbb{N} \wedge n^2 < 9\}$ |
|
210 just in Scala syntax. |
189 |
211 |
190 For defining sets, we will also often use the notion of the |
212 For defining sets, we will also often use the notion of the |
191 ``big union''. An example is as follows: |
213 ``big union''. An example is as follows: |
192 |
214 |
193 \begin{equation}\label{bigunion} |
215 \begin{equation}\label{bigunion} |
232 |
254 |
233 \noindent |
255 \noindent |
234 contain actually the same amount of elements. Does this make sense? |
256 contain actually the same amount of elements. Does this make sense? |
235 Though this might all look strange, infinite sets will be a |
257 Though this might all look strange, infinite sets will be a |
236 topic that is very relevant to the material of this module. It tells |
258 topic that is very relevant to the material of this module. It tells |
237 us what we can compute with a computer (actually algorithm) and what |
259 us what we can compute with a computer (actually an algorithm) and what |
238 we cannot. But during the first 9 lectures we can go by without this |
260 we cannot. But during the first 9 lectures we can go by without this |
239 ``weird'' stuff. End of aside.\smallskip |
261 ``weird'' stuff. End of aside.\smallskip |
240 |
262 |
241 Another important notion in this module are \defn{languages}, which |
263 Another important notion in this module are \defn{languages}, which |
242 are sets of strings. One of the main goals for us will be how to |
264 are sets of strings. One of the main goals for us will be how to |
246 hardness is meant in terms of Turing decidable, for example.} |
268 hardness is meant in terms of Turing decidable, for example.} |
247 Note that the language containing the empty string $\{\dq{}\}$ |
269 Note that the language containing the empty string $\{\dq{}\}$ |
248 is not equal to $\varnothing$, the empty language (or empty |
270 is not equal to $\varnothing$, the empty language (or empty |
249 set): The former contains one element, namely \dq{} (also |
271 set): The former contains one element, namely \dq{} (also |
250 written $[\,]$), but the latter does not contain any |
272 written $[\,]$), but the latter does not contain any |
251 element. |
273 element at all! Make sure you see the difference. |
252 |
274 |
253 For languages we define the operation of \defn{language |
275 For languages we define the operation of \defn{language |
254 concatenation}, written like in the string case as $A @ B$: |
276 concatenation}, written like in the string case as $A @ B$: |
255 |
277 |
256 \begin{equation}\label{langconc} |
278 \begin{equation}\label{langconc} |
265 |
287 |
266 \[ |
288 \[ |
267 \{abzzz, abqq, abr, aczzz, acqq, acr\} |
289 \{abzzz, abqq, abr, aczzz, acqq, acr\} |
268 \] |
290 \] |
269 |
291 |
270 \noindent Recall the properties for string concatenation. For |
292 \noindent The cool thing about Scala is that we can define language |
|
293 concatenation very elegantly as |
|
294 |
|
295 \begin{lstlisting}[numbers=none] |
|
296 def concat(A: Set[String], B: Set[String]) = |
|
297 for (x <- A ; y <- B) yield x ++ y |
|
298 \end{lstlisting} |
|
299 |
|
300 \noindent |
|
301 where \code{++} is string concatenation in Scala. |
|
302 |
|
303 Recall the properties for string concatenation. For |
271 language concatenation we have the following properties |
304 language concatenation we have the following properties |
272 |
305 |
273 \begin{center} |
306 \begin{center} |
274 \begin{tabular}{ll} |
307 \begin{tabular}{ll} |
275 associativity: & $(A @ B) @ C = A @ (B @ C)$\\ |
308 associativity: & $(A @ B) @ C = A @ (B @ C)$\\ |
327 what $\{[]\}\star$ and $\varnothing\star$ are? |
362 what $\{[]\}\star$ and $\varnothing\star$ are? |
328 |
363 |
329 Recall that an alphabet is often referred to by the letter |
364 Recall that an alphabet is often referred to by the letter |
330 $\Sigma$. We can now write for the set of \emph{all} strings |
365 $\Sigma$. We can now write for the set of \emph{all} strings |
331 over this alphabet as $\Sigma\star$. In doing so we also include the |
366 over this alphabet as $\Sigma\star$. In doing so we also include the |
332 empty string as a possible string over $\Sigma$. So if $\Sigma |
367 empty string as a possible string (over $\Sigma$). Assuming $\Sigma |
333 = \{a, b\}$, then $\Sigma\star$ is |
368 = \{a, b\}$, then $\Sigma\star$ is |
334 |
369 |
335 \[ |
370 \[ |
336 \{[], a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, \ldots\} |
371 \{[], a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, \ldots\} |
337 \] |
372 \] |
338 |
373 |
339 \noindent or in other words all strings containing $a$s and |
374 \noindent or in words all strings containing $a$s and |
340 $b$s only, plus the empty string. |
375 $b$s only, plus the empty string. |
341 |
376 \bigskip |
|
377 |
|
378 \noindent |
|
379 Thanks for making it until here! There are also some personal conventions |
|
380 about regular expressions. But I will explain them in the handout for the |
|
381 first week. An exercise you can do: Implement the power operation for languages |
|
382 and try out some examples. |
342 \end{document} |
383 \end{document} |
343 |
384 |
344 %%% Local Variables: |
385 %%% Local Variables: |
345 %%% mode: latex |
386 %%% mode: latex |
346 %%% TeX-master: t |
387 %%% TeX-master: t |