Regular ExpressionsJ8 Home « Regular Expressions
In this lesson we look at regular expressions (regex) and how we can use regular expression patterns for matching data. A regular expressions is a string containing normal characters as well as metacharacters which make a pattern we can use to match data. The metacharacters are used to represent concepts such as positioning, quantity and character types. The terminology used when searching through data for specific characters or groups of characters is known as pattern matching and is generally done from the left to the right of the input character sequence.
Regular expressions are a large topic which you could write an entire book on and people have, but here we will just cover the basics of pattern matching to get a feel for how to use them. The table below lists the regular expression metacharacter constructs used in the examples in this lesson; a more complete, but not exhaustive table with examples is listed at the end of the lesson.
Metacharacter | Meaning |
---|---|
Escape/Unescape | |
\ | Used to escape characters that are treated literally within regular expressions or alternatively to unescape special characters |
Quantifiers | |
? | Matches preceding item 0 or 1 times |
* | Matches preceding item 0 or more times |
+ | Matches preceding item 1 or more times |
Character Classes | |
[xyz] | A character set. Matches any of the enclosed characters. You can specify a range of characters by using a hyphen. |
[^xyz] | A negated character set. Matches anything not enclosed in the brackets. You can specify a range of characters by using a hyphen. |
Predefined Character Classes | |
. | Matches any single character without newline characters except when the DOTALL flag is specified. |
\d | Find a digit character. Same as the range check [0-9]. |
\s | Find a whitespace character. |
\w | Find a word character. A word character is a character in ranges a-z, A-Z, 0-9 and also includes the _ (underscore) symbol. Same as the range check [A-Za-z0-9_]. |
String Searches Top
Before we look at a working example of using a regular expression we should talk a little about the java.util.regex
package and the two classes it contains. The java.util.regex.Pattern
class
allows us to instantiate a compiled representation of a regular expression we have have passed to the class as a string. We can then use the matcher()
method on the resultant pattern, to create an instance
of the java.util.regex.Matcher
class, that can match arbitrary character sequences against the specified regular expression. Once we have compiled a regular expression into a Pattern
object
we can use multiple matchers against this pattern, as all of the state involved in performing a match resides in the Matcher
instance. We can then check methods of the Matcher
class to see
if we got any matches. There is also a convenience matches()
method in the Pattern
class that allows us to compile a regular expression, use a matcher and see if it matches in a single
statement. Lets look at a simple search to see how it all hangs together:
package info.java8;
/*
Simple regex string search
*/
import java.util.regex.*; // Import all file classes from the java.util.regex package
class TestSimpleRegex {
public static void main(String[] args) {
boolean b = false;
Pattern p = Pattern.compile("is"); // Create a regex
Matcher m = p.matcher("mississippi"); // Our string for matching
// Part region matching
b = m.lookingAt();
System.out.println("Did we get a part region match? " + b);
// Full region matching
b = m.matches();
System.out.println("Did we get a full region match? " + b);
// Multiple matching
while (b = m.find()) { // matching info
System.out.println("We got a match at position: " + m.start());
}
// Convenience all in one method
b = Pattern.matches("is", "mississippi");
System.out.println("Did we get a full match? " + b);
b = Pattern.matches("mississippi", "mississippi"); // Convenience all in one method
System.out.println("Did we get a full match? " + b);
}
}
Save, compile and run the TestSimpleRegex
test class in directory c:\_APIContents2 in the usual way.
The above screenshot shows the output of compiling and running the TestSimpleRegex
class. First off we compile a regex Pattern
object from the string "is". Using this object we then
create a Matcher
object using the string "Mississippi" as the character sequence to be matched. The Matcher
object finds matches in a subset of its input called the region, which by
default contains all of the matcher's input. There is also a region()
method which can be used to modify the region boundaries which I will live as an exercise for you to look at. We then perform the
three different kinds of match operations on the Matcher
object.
The lookingAt()
method does a prefix region match which returns false
as the prefix of our input doesn't match the pattern we created. The matches()
method does a full
region match which also returns false
as our entire character input doesn't match the pattern we created. The find()
methods scans the region looking for subsequences that
match the pattern and finds these at positions 1 and 4 (think of a zero-based index). We print messages to the console showing the results.
Next we use the convenience matches()
method of the Pattern
class which works the same as the matches()
method of the Matcher
class and print some more messages
to the console. It should be noted that the matches()
method of the Pattern
class is less efficient than its counterpart in the Matcher
class when doing repeated matches as it
doesn't allow the compiled pattern to be reused.
Metacharacter Searches Top
Ok, we have seen how we can use regex to search strings for a prefix, subsequence or whole match of the character input but what else does the regex engine bring to the party? We mentioned at the start
of the lesson how we can use metacharacters to represent concepts such as positioning, quantity and character types for pattern matching. So in this part of the lesson we will look at a few of the
more common metacharacters used and how we incorporate then into our regex patterns. In our first example we will look at the \d
, \s
and \w
metacharacters
which search for digits, whitespace characters and word characters (letters, digits and the underscore symbol (_
) respectively.
package info.java8;
/*
Using regex metacharacter search
*/
import java.util.regex.*; // Import all file classes from the java.util.regex package
class TestMetaRegex {
public static void main(String[] args) {
String str = "The quick brown fox. 1+1=2";
String str2 = "---1+1=2---";
boolean b = false;
Pattern p = Pattern.compile("\\d"); // Create a regex to look for digits
Matcher m = p.matcher(str); // Our string Object for matching
// Multiple matching
while (b = m.find()) { // matching info
System.out.println("We found a digit at position: " + m.start());
}
p = Pattern.compile("\\s"); // Create a regex to look a whitespace
m = p.matcher(str); // Our string Object for matching
// Multiple matching
while (b = m.find()) { // matching info
System.out.println("We found a whitespace at position: " + m.start());
}
p = Pattern.compile("\\w"); // Create a regex to look for word characters
m = p.matcher(str2); // Our string Object for matching
// Multiple matching
while (b = m.find()) { // matching info
System.out.println("We found a word character at position: " + m.start());
}
}
}
Save, compile and run the TestSimpleRegex
test class in directory c:\_APIContents2 in the usual way.
The above screenshot shows the output of compiling and running the TestSimpleRegex
class. First off we compile a regex Pattern
object from the string containing the metacharacter
\d
(look for a single digit). We have to escape the \
symbol using the escape character which is also the \
symbol. We have to do this or the compiler thinks this is an escape
sequence such as \n
for a newline and thinks hey! I don't have an escape sequence for \d
and throws a compiler error. Using this object we then create a Matcher
object using the
String
object with a reference of str
. We then use the find()
methods to scan the region looking for subsequences that match the pattern and print messages to the console showing the results.
We then compile a regex Pattern
object from the string containing the metacharacter \s
(look for a single whitespace). The rest is the same as above and we then print messages to
the console showing the results. Finally we do the same to search for word characters using the String
object with a reference of str2
.
In our second example of metacharacter usage we look at the .
(dot) predefined character class as well as character sets and negated character sets using []
bracket notation.
package info.java8;
/*
Using regex metacharacter search
*/
import java.util.regex.*; // Import all file classes from the java.util.regex package
class TestMetaRegex2 {
public static void main(String[] args) {
String str = "Our tree is getting big";
String str2 = "facetious";
boolean b = false;
Pattern p = Pattern.compile("t.e"); // regex to look for t and e with any char in between
Matcher m = p.matcher(str); // Our string Object for matching
// Multiple matching
while (b = m.find()) { // matching info
System.out.println("We found t (any char) e at position: " + m.start());
}
p = Pattern.compile("[aeiou]"); // Create a regex to look for vowels
m = p.matcher(str2); // Our string Object for matching
// Multiple matching
while (b = m.find()) { // matching info
System.out.println("We found a vowel at position: " + m.start());
}
p = Pattern.compile("[^aeiou]"); // Create a regex to look for non vowels
m = p.matcher(str2); // Our string Object for matching
// Multiple matching
while (b = m.find()) { // matching info
System.out.println("We found a non vowel at position: " + m.start());
}
}
}
Save, compile and run the TestSimpleRegex2
test class in directory c:\_APIContents2 in the usual way.
The above screenshot shows the output of compiling and running the TestSimpleRegex2
class. First off we compile a regex Pattern
object from the string containing 't' and 'e' and
the metacharacter .
(any character). Using this object we then create a Matcher
object using the String
object with a reference of str
. We then use the
find()
methods to scan the region looking for a subsequence that matches the pattern and print a message to the console showing the results.
We then compile a regex Pattern
object from the string containing a character set looking for vowels and then non vowels. The rest is the same as above except we use the String
object with a reference of str2
. We also print messages to the console showing the results.
Quantifier Searches Top
In our final look at regex we discuss quantifiers and the effect they have on our search results. A quantifier is a metacharacter which allows us to select a range of matches.
The metacharacter quantifiers available are ?
for zero or more occurrences, *
for zero or one occurrences and +
for one or more occurrences. The following example
shows usage of the qualifiers:
package info.java8;
/*
Using regex quantifiers
*/
import java.util.regex.*; // Import all file classes from the java.util.regex package
class TestRegexQquantifiers {
public static void main(String[] args) {
String str = "Oh geee, the tree hit my kneee";
boolean b = false;
Pattern p = Pattern.compile("ee?"); // look for e and then zero or 1 more e
Matcher m = p.matcher(str); // Our string Object for matching
// Multiple matching
while (b = m.find()) { // matching info
System.out.println("We found e, then zero or one more e at position: " + m.start());
}
p = Pattern.compile("ee*"); // look for e and then zero or more e's
m = p.matcher(str); // Our string Object for matching
// Multiple matching
while (b = m.find()) { // matching info
System.out.println("We found e, then zero or more e's at position: " + m.start());
}
p = Pattern.compile("ee+"); // look for e and then 1 or more e's
m = p.matcher(str); // Our string Object for matching
// Multiple matching
while (b = m.find()) { // matching info
System.out.println("We found e, then 1 or more e's at position: " + m.start());
}
}
}
Save, compile and run the TestRegexQquantifiers
test class in directory c:\_APIContents2 in the usual way.
The above screenshot shows the output of compiling and running the TestSimpleRegex
class. When we use the ?
quantifier on our input character sequence it will find all subsequences of the
letter e and zero or one more 'e'. In other words it will match 'e' and 'ee'. So where we have three 'e's in a row we get two matches. When we use the *
quantifier on our input character sequence it will
find all subsequences of the letter e and zero or more 'e's. So it will match 'e', 'ee' and 'eee'. When we use the +
quantifier on our input character sequence it will find all subsequences of the
letter e followed by 1 or more 'e'. In this case it will match 'ee' and 'eee'.
As a further note you can also use qualifiers in combination but I'll leave that for you to look at next time you examine the Java API.
Regex Metacharacter Examples Top
The table below lists more regex metacharacters with examples of how they can be used. Luckily for certification purposes you don't need to memorize the whole table, just the rows with the light blue
background. For a full list of regex constructs visit the Oracle online version of documentation for the JavaTM 2 Platform Standard Edition 5.0 API Specification
and scroll down the top left pane and click on java.util.regex
.
MeatacChar | Meaning | Examples |
---|---|---|
Escape/Unescape | ||
\ | Used to escape characters that are treated literally within regular expressions or alternatively to unescape special characters | Literal Content d matches the character d \\d matches a digit character Unescape Special Characters d+ matches one or more character d d\\+ matches d+ |
Quantifiers | ||
? | Matches preceding item 0 or 1 times | do? Every dig has its day Every dog has its day Shut that doooor Can you see me |
* | Matches preceding item 0 or more times | do* Every dig has its day Every dog has its day Shut that doooor Can you see me |
+ | Matches preceding item 1 or more times | do+ Every dig has its day Every dog has its day Shut that doooor Can you see me |
Characters | ||
\f | Find a formfeed character. | \\f When matched will return the formfeed character When searched will return the zero-based index position the formfeed character was found in. |
\n | Find a newline character. | \\n When matched will return the newline character When searched will return the zero-based index position the newline character was found in. |
\r | Find a carriage return character. | \\r When matched will return the carriage return character When searched will return the zero-based index position the carriage return character was found in. |
\t | Find a tab character. | \\t When matched will return the tab character When searched will return the zero-based index position the tab character was found in. |
\cA |
Find a control character A .Where A is the control character in range A-Z you are looking for. |
\\cC Will search for control-C in a string. |
Character Classes | ||
[xyz] | A character set. Matches any of the enclosed characters. You can specify a range of characters by using a hyphen. |
[eno] one two one twoo one twooo one twoooo |
[^xyz] | A negated character set. Matches anything not enclosed in the brackets. You can specify a range of characters by using a hyphen. |
[^eno] one two one twoo one twooo one twoooo |
Predefined Character Classes | ||
. | Matches any single character without newline characters except when the DOTALL flag is specified. |
\\.t This Time tonight this is good |
\d | Find a digit character. Same as the range check [0-9]. |
\\d Was it 76 or 77 |
\D | Find a non-digit character. Same as the range check [^0-9]. |
\\D Was it 76 or 77 |
\s | Find a whitespace character. | Example below words are greyed out and spaces are highlighted in red purely for emphasis \\s Beware of the dog |
\S | Find a non-whitespace character. | Example below spaces are grayed out for emphasis \\S Beware of the dog |
\w | Find a word character. A word character is a character in ranges a-z, A-Z, 0-9 and also includes the _ (underscore) symbol. Same as the range check [A-Za-z0-9_]. |
\\w 76% off_sales. £12 only |
\W | Find a non-word character. Same as the range check [^A-Za-z0-9_]. |
\\W 76% off_sales. £12 only |
\xnn |
Find a character that equates to hexadecimal nn .Where nn is a two digit hexadecimal number |
\\x70 The quick brown fox jumps. |
\unnnn |
Find a character with the hexadecimal value nnnn .Where nnnn is a four digit hexadecimal number |
\\u0065 The quick brown fox jumps. |
Boundary matches | ||
^ | Matches beginning of input If line match flag (m) is set will also match after a line break character. |
^A an Armadillo An Armadillo |
$ | Matches end of input If line match flag (m) is set will also match before a line break character. |
Z$ BuzZ BuZz BuzZ BuZZ |
\b | Find a match at the beginning or end of a word. | At Beginning \\bday the day today is saturday daytime and nighttime At End day\\b the day today is saturday day and nighttime |
\B | Find a match NOT at the beginning or end of a word. | Not At Beginning \\Bday the day today is saturday daytime and nighttime Not At End day\\B the day today is saturday day and nighttime |
Special Constructs | ||
x (?=y ) |
Matches Regexp(x ) only if followed by y |
and(?= five) one and two and three one and two and four one and two and five |
x (?!y ) |
Matches Regexp(x ) only if NOT followed by y |
and(?! five) one two and three one two and four one two and five |
Occurrences | ||
{n } |
Matches exactly n occurrences of the preceding item.Where n is a positive integer |
o{4} one two one twoo one twooo one twoooo |
{n, } |
Matches at least n occurrences of the preceding item.Where n is a positive integer |
o{3,} one two one twoo one twooo one twoooo |
{n,m } |
Matches at least n and at most m occurrences of the preceding item.Where n and m are positive integers |
o{2,3} one two one twoo one twooo one twoooo |
Logical Operators | ||
x |y |
Matches x or y |
three|four one two one two three one two four |
Lesson 5 Complete
In this lesson we looked at regular expressions and how we can use regular expression patterns for matching data.
What's Next?
In our final lesson of the API Contents section we look at formatting and tokenizing our data.