Saturday, 15 February 2014

Regular Expressions Special Characters Explained

Regular expressions” (often shortened to “regex”) is a language used to represent patterns for matching text. Regular expressions are the primary text-matching schema in all text-processing tools, including grep,egrep,awk ,sed.
The following table contains the basic elements, along with description and examples.
Regex Description Example
^ The start-of-line marker ^tux matches any line that starts with tux
$ The end-of-line marker tux$ matches any line that ends with tux
. Matches any one character Hack. matches Hack1, Hacki but not Hack12, Hackil; only one additional character matches
[] Matches any one of the character set inside [] coo[kl] matches cook or cool
[^] Exclusion set: the carat negates the set of characters in the square brackets; text matching this set will not be returned as a match 9[^01] matches 92, 93 but not 91 and 90
[-] Matches any character within the range specified in [] [1-5] matches any digits from 1 to 5
? The preceding item must match one or zero times colou?r matches color or colour but not colouur
+ The preceding item must match one or more times Rollno-9+ matches Rollno-99, Rollno-9 but not Rollno-
* The preceding item must match zero or more times co*l matches cl, col, coool
() Creates a substring in the regex match Explained below, in the section “Substring match and back-referencing”
{n} The preceding item must match exactly n times [0-9]{3} matches any three-digit number. This can be expanded as: [0-9][0-9][0-9]
{n,} Minimum number of times that the preceding item should match [0-9]{2,} matches any number that is two digits or more in length
{n, m} Specifies the minimum and maximum number of times the preceding item should match [0-9]{2,5} matches any number that is between two and five digits in length
| Alternation — one of the items on either side of | should match Oct (1st|2nd) matches Oct 1st or Oct 2nd
\ The escape character for escaping any of the special characters given above a\.b matches a.b but not ajb. The dot is not interpreted as the special “match any one character” regex shown above, but instead a literal dot (period) ASCII character is sought to be matched. Another example: if you’re searching for the US currency symbol “$”, and not the end-of-line marker, you must precede it with a back-slash, like this: \$
There are a few character classes, called POSIX classes, in the format [:name:] that can be conveniently used, instead of spelling out the character set each time. Note that, as shown in the example column, you need to enclose the class itself in another pair of square brackets. For example:
$ echo -e "maxnORnMatrix" | sed '/[:alpha:]/d'
OR
$ echo -e "maxnORnMatrix" | sed '/[[:alpha:]]/d'
$
In the first case, the set is interpreted literally — the words max and matrix are deleted because they contain a, one of the letters in the character set. In the second command, with another pair of square brackets around the class, all input lines are deleted, because all lines contain (at least one) alphabet.
Regex Description Example
[:alnum:] Alphanumeric characters [[:alnum:]]+
[:alpha:] Alphabet character (lowercase and uppercase) [[:alpha:]]{4}
[:blank:] Space and tab [[:blank:]]*
[:digit:] Digit [[:digit:]]?
[:lower:] Lowercase alphabet [[:lower:]]{5,}
[:upper:] Uppercase alphabet ([[:upper:]]+)?
[:punct:] Punctuation [[:punct:]]
[:space:] All whitespace characters including newline, carriage return, and so on [[:space:]]+
Meta-characters are a type of Perl-style regular expressions that are supported by a subset of text-processing utilities. Not all utilities will support the following notations.
Regex Description Example
\b Word boundary \bcool\b matches only cool and not coolant
\B Non-word boundary cool\B matches coolant but not cool
\d Single digit character b\db matches b2b but not bcb
\D Single non-digit b\Db matches bcb but not b2b
\w Single word character (alnum and _) \w matches 1 or a but not &
\W Single non-word character \w matches & but not 1 or a
\n Newline \n matches a new line
\s Single whitespace x\sx matches x x but not xx
\S Single non-space x\Sx matches xkx but not x x
\r Carriage return \r matches carriage return
The above tables can be used as a reference while constructing regular expression patterns.
Let us go through a few examples of regular expressions.

Treatment of special characters

Regular expressions use some characters such as $, ^, ., *, +, {, and } as special characters. But what if we want to use these characters as non-special characters (normal text character)? Let’s see an example. Regex: [a-z]*.[0-9].
How is this interpreted? It can be zero or more [a-z] ([a-z]*), then any one character (.), and one character in the set [0-9] such that it matches abcdeO9. It can also be interpreted as one of [a-z], then a character *, then a character . (period), and a digit such that it matches x*.8. In order to overcome this problem, we precede the character with a forward slash \ (doing this is called “escaping the character”). The characters such as * that have multiple meanings are prefixed with \ to make them into a special meaning or to make them non special.
Whether special characters or non-special characters are to be escaped varies depending on the tool that you are using. In short the term special meaning means that a character is considered as meaningful interpretation other than its character ASCII value. For example a*means a, aa, aaa… Here * has special meaning since its not interpreted as ASCII character “*”.
Certain characters to be escaped using \ to give special meaning while some others are by default taken as special meaning (e.g., *). To use it as regular ASCII meaning, it should be escaped. Here is small list of characters having special meaning with escaping: \+, \{, \}, \(,\), \?.
Characters that are by default special (you need to escape these in order to use as regular ASCII): *, ., ^, $, [, ].
To match any line containing only the word test, and no other characters on it, use ^test$. This is interpreted as “start of line marker” followed by “test” followed by “end of line marker”.
Another good example is to extract email addresses from the given text. An email address has the format username@domain.root. We can formulate the regular expression as:
[A-Za-z0-9.]+@[A-Za-z0-9.]+.[a-zA-Z]{2,4}. The [A-Za-z0-9.]+ before @ states that the given character class should occur one or more times, just as after the @. At the end of the email address, we have the TLD (top-level domain), which can be two to four characters in length, as specified by {2,4}.

Points to remember

In Sed, “one or more” (+) is always prefixed with the backslash escape character if it does not occur after a character set/class, while “zero or more” (*) is not thus prefixed. We can do an inverse match using /PATTERN/!{statements}. That is, include the bang (!) after the slash after PATTERN.

0 blogger-disqus:

Post a Comment