Regular expression: From Wikipedia, the free encyclopedia
In computing, regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) provide a concise and flexible means for identifying text of interest, such as particular characters, words, or patterns of characters. Regular expressions are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.
Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. For example, Perl and Tcl have a powerful regular expression engine built directly into their syntax. Several utilities provided by Unix distributions—including the editor ed and the filter grep—were the first to popularize the concept of regular expressions.
7.1 Exercise – Fill in the following chart:
Number of Hits
In these notes we concentrate on POSIX regular expressions using egrep.
Assume we have a directory with the following contents:
Using “wild cards:
7.2 The Grep Family
The UNIX grep utility marked the birth of a global regular expression print(GREP) tools. Searching for patterns in text is important operation in a number ofdomains, including program comprehension and software maintenance, structuredtext databases, indexing file systems, and searching natural language texts. Such awide range of uses inspired the development of variations of the original UNIXgrep. This variations range from adding new features, to employing fasteralgorithms, to changing the behaviour of pattern matching and printing. This
survey presents all the major developments in global regular expression printtools, namely the UNIX grep family, the GNU grep family, agrep, cgrep, sgrep,nrgrep, and Perl regularexpressions. Taken from man grep:
7.3 Regexs: Some Examples
Some Examples:We start with several simple examples. Assume we have a file fruits:
Matching characters “strings” by example:
Metacharacters: Metacharacters are characters that have ‘special’ meaning. Here are the metacharacters that are defined.
. Matches anycharacter.
* “character*” specifies that the character can be matched zero or more times.
+ “character+” Matches that character one or more times. Pay careful attention to the
difference between * and +; * matches zero or more times, so whatever’s being repeatedmay not be present at all, while + requires at least one occurrence. To use a similar example, ca+t will match “cat” (1 “a”), “caaat” (3 “a”‘s), but won’t match “ct”.
? “character?” Matches that character either once or zero times; you can think of it as marking something as being optional. For example, home-?brew matches either “homebrew” or “home-brew”.
If you wish to search for a metacharacter the metacharacter must be escaped by preceding it with the backslash “\”.As an example let’s assume we have a file such as:
And we wish to fine “188.8.131.52”. egrep ‘184.108.40.206’ ip will NOT work. We must escape the “.” character.
$ you can force a regex to match only at the start or end of a line, respectively.
^ Match at the start of a line
$ Match at the end of a line
As you can see, this regex fails to match both apple and grape, since neither starts with a ‘p’. The fact that they contain a ‘p’ elsewhere is irrelevant. Similarly, the regex e$ only matches apple, orange and grape:So
^cat matches only those lines that start with cat, and
cat$ only matches lines ending with cat.
Mind the quotes though! In most shells, the dollar-sign has a special meaning. By putting the regex in single-quotes (not double-quotes or back-quotes), the regex will be protected from the shell, so to speak. It’s generally a good idea to single-quote your regexes.
Moving on, ^cat$ only matches lines that contain exactly cat. You can find empty lines in a similar way with ^$. If you’re having trouble understanding that last one, just apply the definitions. The regex basically says: “Match a start-of-line, followed by an end-of-line”.