April 16, 2014

Finding Things on Linux and Understanding Regular Expressions - page 2

The Shell Built-in Wildcard Provision

  • September 14, 2009
  • By Juliet Kemp

OK, so what might you want to do on the command-line with proper regular expressions? One of the most common usages of regexps on the command line is with grep. This example would find all instances of a perl shebang line (a line like #!/usr/bin/perl) in files in the current working directory, and thus find you all perl scripts:

grep '^#!.*perl' *

(Note that this would find textfiles with scripts included in them, as well as executable scripts.) The ^ means 'the start of the line' ($ means the end of the line). .* means 'any character (the period), any number of times (the *). For 'one or more of the previous character', you can use +.


Here's another grep example, this time to look for things that look like UK mobile phone numbers (07123 456789) in my 'addresses' mail folder:

grep -C 10 '07[0-9]\{3\} [0-9]\{6\}' ~/mail/addresses

-C 10 gives you 10 lines of context on each side of the match, which might be helpful here in matching number to name. The 07 looks for exactly that match. [0-9] specifies any digit in that range ([0123] would specify any of the digits 0, 1, 2, or 3). All by itself, the range will be matched only once. \{3\} means 'match the preceding range exactly 3 times. [0-9]+ would match one or more digits. Finally, [0-9]\{6\} means 'exactly 6 digits'.


Perl is another utility where regexps can be useful. Using perl as a stream editor gives you access to the backreference facility of regular expressions. This line would edit all files in the current directory, changing any instance of Juliet to Juliet Kemp:

perl -i.old -pe 's#\b(Juliet)\b#$1 Kemp#g' *

-i.old means that any altered file will be saved as filename.old before it's changed. -pe sets Perl up as a stream editor (run on each file in turn and execute the following command). The regexp \b(Juliet)\b looks for the word Juliet occurring with a word boundary on either side: so it only picks up Juliet as a word, not as part of a word (and thus wouldn't change the name of my theoretical colleague Juliette). The parentheses store this value in the variable $1: this is the backreference. The second part of the search-and-replace, $1 Kemp then refers back to the first part to use that $1 variable, and adds my surname on the end.


This next example would be helpful if you have a directory on your website which has PDF files and HTML files all mixed together, and you've just decided to move the PDF files into a separate directory. To update all references in all files in the current directory, use this line:

perl -i.old -pe 's#<a href="([^/]*\.pdf)">#<a href="pdfdir/$1">#g' *

The important part here is ([^/]*\.pdf). The parentheses again identify the part to store as $1. [^/] means 'any character except /: a caret inside square brackets means negation. This is here so that you only change file.pdf, not otherdir/file.pdf or http://othersite/dir/file.pdf. The star afterwards means "any number of the previous character class" i.e. any number of any character except /. \.pdf matches exactly .pdf: the backslash is used to treat the period as a real period, not with its special "any character" meaning.


You can find a quick summary of regular expression options here.

find and location

You can also use regular expressions with find, with the -regex switch. For example, to find all .log files in the current directory and its subdirectories:

find . -regex '.*\.log'

It's important to bear in mind that the expression looks at the full path of each file, not just the filename. So your regular expression needs to match /path/to/my/filename.txt, not just filename.txt.


You can also use the --regexp switch with locate:

locate --regexp '.*mail$'

will find any file path ending with mail. (Note that this uses --regexp while find uses -regex.)


In the second part of this series, I'll look at regular expressions used in editors and elsewhere.

Sitemap | Contact Us