9 |
9 |
10 \begin{document} |
10 \begin{document} |
11 |
11 |
12 \section*{Handout 1} |
12 \section*{Handout 1} |
13 |
13 |
14 This course is about processing of strings. Lets start with what we mean by \emph{string}. Strings |
14 This course is about the processing of strings. Lets start with what we mean by \emph{strings}. Strings |
15 are lists of characters drawn from an \emph{alphabet}. If nothing else is specified, we usually assume |
15 (they are also sometimes referred to as \emph{words}) are lists of characters drawn from an \emph{alphabet}. |
16 the alphabet are letters $a$, $b$, \ldots, $z$ and $A$, $B$, \ldots $Z$. Sometimes we explicitly |
16 If nothing else is specified, we usually assume |
17 restrict strings to only contain the letters $a$ and $b$. Then we say the alphabet is the set $\{a, b\}$. |
17 the alphabet consists of just the lower-case letters $a$, $b$, \ldots, $z$. Sometimes, however, we explicitly |
18 |
18 restrict strings to contain, for example, only the letters $a$ and $b$. In this case we say the alphabet is the set $\{a, b\}$. |
19 There are many ways how we write string. Since they are lists of characters we might write |
19 |
20 them as {\it "hello"} being enclosed by double quotes. This is a short-hand for the list |
20 There are many ways how we can write down strings. In programming languages, they are usually |
|
21 written as {\it "hello"} where the double quotes indicate that we dealing with a string. |
|
22 Essentially, strings are lists of characters which can be written for example as follows |
21 |
23 |
22 \[ |
24 \[ |
23 [\text{\it h, e, l, l, o}] |
25 [\text{\it h, e, l, l, o}] |
24 \] |
26 \] |
25 |
27 |
26 \noindent |
28 \noindent |
27 The important point is that we can always decompose strings. For example we often consider the |
29 The important point is that we can always decompose strings. For example, we will often consider the |
28 first character of a string, say $h$, and the ``rest'' of a string {\it "ello"}. |
30 first character of a string, say $h$, and the ``rest'' of a string say {\it "ello"} when making definitions |
29 There are also some subtleties with the empty string, sometimes written as {\it ""} or as the empty list |
31 about strings. There are some subtleties with the empty string, sometimes written as {\it ""} but also as |
30 of characters $[\,]$. Two strings, say $s_1$ and $s_2$, can be \emph{concatenated} which is written |
32 the empty list of characters $[\,]$. Two strings, for example $s_1$ and $s_2$, can be \emph{concatenated}, |
31 as $s_1 @ s_2$. For example if we have the two strings {\it "foo"} and {\it "bar"}, their concatenation |
33 which we write as $s_1 @ s_2$. Suppose we are given two strings {\it "foo"} and {\it "bar"}, then their concatenation |
32 gives {\it "foobar"}. |
34 gives {\it "foobar"}. |
33 |
35 |
34 We often need to talk about sets of strings. For example the set of all strings |
36 We often need to talk about sets of strings. For example the set of all strings over the alphabet $\{a, \ldots\, z\}$ |
|
37 is |
35 |
38 |
36 \[ |
39 \[ |
37 \{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\} |
40 \{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\} |
38 \] |
41 \] |
39 |
42 |
40 \noindent |
43 \noindent |
41 Any set of strings, not just the set of all strings, is often called a \emph{language}. The idea behind |
44 Any set of strings, not just the set-of-all-strings, is often called a \emph{language}. The idea behind |
42 this choice is that if we enumerate, say, all words/strings from a dictionary, like |
45 this choice of terminology is that if we enumerate, say, all words/strings from a dictionary, like |
43 |
46 |
44 \[ |
47 \[ |
45 \{\text{\it "the", "of", "milk", "name", "antidisestablishmentarianism", \ldots}\} |
48 \{\text{\it "the", "of", "milk", "name", "antidisestablishmentarianism", \ldots}\} |
46 \] |
49 \] |
47 |
50 |
48 \noindent |
51 \noindent |
49 then we have essentially described the English language, or more precisely all |
52 then we have essentially described the English language, or more precisely all |
50 strings that can be used in a sentence of the English language. French would be a |
53 strings that can be used in a sentence of the English language. French would be a |
51 different set of string, and so on. In the context of this course, a language might |
54 different set of strings, and so on. In the context of this course, a language might |
52 not necessarily make sense from a natural language perspective. For example |
55 not necessarily make sense from a natural language point of view. For example |
53 the set of all strings from above is a language, as is the empty set (of strings). The |
56 the set of all strings shown above is a language, as is the empty set (of strings). The |
54 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a |
57 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a |
55 difference between the empty set, or empty language, and the set, or language, that |
58 difference between the empty set, or empty language, and the set that |
56 contains the empty string $\{\text{""}\}$: the former has no elements, the latter has one |
59 contains only the empty string $\{\text{""}\}$: the former has no elements, whereas |
57 element. |
60 the latter has one element. |
58 |
61 |
59 As seen there are languages which contain infinitely many strings, like the set of all strings. |
62 As seen, there are languages which contain infinitely many strings, like the set of all strings. |
60 The ``natural'' languages English, French and so on contain many but only finitely many |
63 The ``natural'' languages like English, French and so on contain many but only finitely many |
61 strings (the ones listed in a good dictionary). It might be therefore surprising that the |
64 strings (namely the ones listed in a good dictionary). It might be therefore be surprising that the |
62 language consisting of all email addresses is infinite if we assume it is defined by the |
65 language consisting of all email addresses is infinite provided we assume it is defined by the |
63 regular expression\footnote{See \url{http://goo.gl/5LoVX7}} |
66 regular expression\footnote{See \url{http://goo.gl/5LoVX7}} |
64 |
67 |
65 \[ |
68 \[ |
66 ([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}}) |
69 ([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}}) |
67 \] |
70 \] |
68 |
71 |
69 \noindent |
72 \noindent |
70 The reason is that for example before the $@$-sign there can be any string you want if it |
73 The reason is that for example before the $@$-sign there can be any string you want assuming it |
71 is made up from letters, digits, underscore, dot and hyphen---there are infinitely many |
74 is made up from letters, digits, underscores, dots and hyphens---clearly there are infinitely many |
72 of those. Similarly the string after the $@$-sign can be any string. This does not mean |
75 of those. Similarly the string after the $@$-sign can be any string. However, this does not mean |
73 that every string is an email address. For example |
76 that every string is an email address. For example |
74 |
77 |
75 \[ |
78 \[ |
76 \text{\it foo}@\text{\it bar}.\text{\it c} |
79 \text{\it foo}@\text{\it bar}.\text{\it c} |
77 \] |
80 \] |
78 |
81 |
79 \noindent |
82 \noindent |
80 is not, since the top-level-domains must be of length of at least two. Note that there is |
83 is not, because the top-level-domains must be of length of at least two. (Note that there is |
81 the convention that uppercase letters are treated in email-addresses as if they were |
84 the convention that uppercase letters are treated in email-addresses as if they were |
82 lower-case. |
85 lower-case.) |
83 \bigskip |
86 \bigskip |
84 |
87 |
85 Before we expand on the topic of regular expressions, let us review some operations on |
88 Before we expand on the topic of regular expressions, let us review some operations on |
86 sets. We will use capital letters, such as $A$, $B$ and so on, to stand for sets of strings. |
89 sets. We will use capital letters $A$, $B$, $\ldots$ to stand for sets of strings. |
87 The union of two sets is written as usual as $A \cup B$. We also need to define the |
90 The union of two sets is written as usual as $A \cup B$. We also need to define the |
88 \emph{concatenation} of two sets (of strings). This can be defined as |
91 operation of \emph{concatenating} two sets of strings. This can be defined as |
89 |
92 |
90 \[ |
93 \[ |
91 A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \} |
94 A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \} |
92 \] |
95 \] |
93 |
96 |
94 \noindent |
97 \noindent |
95 which essentially means take the first string from the set $A$ and concatenate it with every |
98 which essentially means take the first string from the set $A$ and concatenate it with every |
96 string in the set $B$, then take the second string from $A$ and so on. We also need to define |
99 string in the set $B$, then take the second string from $A$ do the same and so on. You might |
97 the power of a set, written as $A^n$. This is defined inductively as follows |
100 like to think about what this definition means in case $A$ or $B$ is the empty set. |
|
101 |
|
102 We also need to define |
|
103 the power of a set, written as $A^n$ with $n$ being a natural number. This is defined inductively as follows |
98 |
104 |
99 \begin{center} |
105 \begin{center} |
100 \begin{tabular}{rcl} |
106 \begin{tabular}{rcl} |
101 $A^0$ & $\dn$ & $\{[\,]\}$ \\ |
107 $A^0$ & $\dn$ & $\{[\,]\}$ \\ |
102 $A^{n+1}$ & $\dn$ & $A @ A^n$\\ |
108 $A^{n+1}$ & $\dn$ & $A @ A^n$\\ |
103 \end{tabular} |
109 \end{tabular} |
104 \end{center} |
110 \end{center} |
105 |
111 |
106 \noindent |
112 \noindent |
107 Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union |
113 Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union |
108 of every power of $A^n$ for every $n\ge 0$. The mathematical notation for this operation is |
114 of every power of $A^n$ with $n\ge 0$. The mathematical notation for this operation is |
109 |
115 |
110 \[ |
116 \[ |
111 A^* \dn \bigcup_{0\le n} A^n |
117 A^* \dn \bigcup_{0\le n} A^n |
112 \] |
118 \] |
113 |
119 |
114 \noindent |
120 \noindent |
115 This means the star of a set $A$ contains always the empty string, one copy of every string in $A$, two |
121 This definition implies that the star of a set $A$ contains always the empty string (that is $A^0$), one |
116 copies and so on. In case $A=\{"a\}$ we have |
122 copy of every string in $A$ (that is $A^1$), two copies in $A$ (that is $A^2$) and so on. In case $A=\{"a"\}$ we therefore |
|
123 have |
117 |
124 |
118 \[ |
125 \[ |
119 A^* = \{"", "a", "aa", "aaa", \ldots\} |
126 A^* = \{"", "a", "aa", "aaa", \ldots\} |
120 \] |
127 \] |
121 |
128 |
122 \noindent |
129 \noindent |
123 Be aware that these operations sometimes have non-intuitive properties, for example |
130 Be aware that these operations sometimes have quite non-intuitive properties, for example |
124 |
131 |
125 \begin{center} |
132 \begin{center} |
126 \begin{tabular}{ccc} |
133 \begin{tabular}{ccc} |
127 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} |
134 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} |
128 $A \cup \varnothing$ & $=$ & $A$\\ |
135 $A \cup \varnothing$ & $=$ & $A$\\ |
135 $A @ \{""\}$ & $=$ & $\{""\} @ A = A$\\ |
142 $A @ \{""\}$ & $=$ & $\{""\} @ A = A$\\ |
136 \end{tabular} & |
143 \end{tabular} & |
137 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} |
144 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} |
138 $\varnothing^*$ & $=$ & $\{""\}$\\ |
145 $\varnothing^*$ & $=$ & $\{""\}$\\ |
139 $\{""\}^*$ & $=$ & $\{""\}$\\ |
146 $\{""\}^*$ & $=$ & $\{""\}$\\ |
140 $A^1$ & $=$ & $A$\\ |
147 $A^\star$ & $=$ & $\{""\} \cup A\cdot A^*$\\ |
141 \end{tabular} |
148 \end{tabular} |
142 \end{tabular} |
149 \end{tabular} |
143 \end{center} |
150 \end{center} |
144 \bigskip |
151 \bigskip |
145 |
152 |
146 \noindent |
153 \noindent |
147 \emph{Regular expressions} are meant for conveniently describing languages...at least languages |
154 \emph{Regular expressions} are meant to conveniently describe languages...at least languages |
148 we are interested in in Computer Science. For example there is no convenient regular expression |
155 we are interested in in Computer Science. For example there is no convenient regular expression |
149 for describing the English language short of enumerating all English words/strings like in a dictionary. |
156 for describing the English language short of enumerating all English words. |
150 But they seem useful for describing all permitted email addresses, as seen above. |
157 But they seem useful for describing all permitted email addresses, as seen above. |
151 |
158 |
152 Regular expressions are given by the following grammar: |
159 Regular expressions are given by the following grammar: |
153 |
160 |
154 \begin{center} |
161 \begin{center} |
155 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} |
162 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} |
156 $r$ & $::=$ & $\varnothing$ & null\\ |
163 $r$ & $::=$ & $\varnothing$ & null\\ |
157 & $\mid$ & $\epsilon$ & empty string / "" / []\\ |
164 & $\mid$ & $\epsilon$ & empty string / "" / []\\ |
158 & $\mid$ & $c$ & character\\ |
165 & $\mid$ & $c$ & single character\\ |
159 & $\mid$ & $r_1 \cdot r_2$ & sequence\\ |
166 & $\mid$ & $r_1 \cdot r_2$ & sequence\\ |
160 & $\mid$ & $r_1 + r_2$ & alternative / choice\\ |
167 & $\mid$ & $r_1 + r_2$ & alternative / choice\\ |
161 & $\mid$ & $r^*$ & star (zero or more)\\ |
168 & $\mid$ & $r^*$ & star (zero or more)\\ |
162 \end{tabular} |
169 \end{tabular} |
163 \end{center} |
170 \end{center} |
164 |
171 |
165 \noindent |
172 \noindent |
166 There are some subtleties you should be aware of. The letter $c$ stands for any character from the |
173 Because we overload our notation there are some subtleties you should be aware of. The letter $c$ stands for any character from the |
167 alphabet. Second, we will use parentheses to disambiguate |
174 alphabet at hand. Second, we will use parentheses to disambiguate |
168 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from $r_1 + (r_2)^*$. |
175 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from, say $r_1 + (r_2)^*$. |
169 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times |
176 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times |
170 $r_2$. We should also write $(r_1 + r_2) + r_3$ which is a regular expression different from $r_1 + (r_2 + r_3)$, |
177 $r_2$. We should also write $(r_1 + r_2) + r_3$, which is different from the regular expression $r_1 + (r_2 + r_3)$, |
171 but in case of $+$ and $\cdot$ we actually do not care and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$, |
178 but in case of $+$ and $\cdot$ we actually do not care about the order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$, |
172 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice |
179 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice |
173 $r_1 + r_2$ is written as $r_1\mid{}r_2$. In case of $\cdot$ we will even often omit it all together. For example |
180 $r_1 + r_2$ is written as $r_1\mid{}r_2$. Also following the convention in the literature, we will in case of $\cdot$ even |
174 the regular expression for email addresses is meant to be of the form |
181 often omit it all together. For example the regular expression for email addresses shown above is meant to be of the form |
175 |
182 |
176 \[ |
183 \[ |
177 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots |
184 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots |
178 \] |
185 \] |
179 |
186 |