Some regular expressions and how to use them in PHP

Thursday, July 28, 2011 | En Español

This isn't a guide about how to create regular expressions (although I would like to make one). These are a few regular expression that I made some time ago, when I had access to a list of names, addresses, phone numbers, etc. They were formatted in very different ways (special the phone numbers) so, the simpler regular expressions that I was using before that, had to be improved. I also talk about to use this regular expressions to validate input fields in PHP scripts.

So, lets start

Topics
What are this regular expressions?
The regular expressions
How to use these regular expressions in PHP
Resources
References
Foot notes

What are this regular expressions?

If we compare multiple strings of text related to the same subject, lets same names and last names, we can see that patterns emerge. Some obvious ones are that a space always separate the name from the last name, the first character is always uppercase, there is always one last name but there can be more than one first name. A regular expression is a form of computer language that allow us to define this patterns, then a program called parser identifies this patters inside a given text. As usual, xkcd explains it nicely: http://xkcd.com/208/

The regular expressions

A name

[a-zA-ZÀ-ÖØ-öø-ÿ]+\.?(( |\-)[a-zA-ZÀ-ÖØ-öø-ÿ]+\.?)*

I believe this regular expression works for most names with foreign characters, or at least it works with everything I had available, it also allows to use composite names (such as Sackville-Baggins). Mora than use this to find names inside a text, is to be used to validate the names provided by the user, many words would match.

A street name

[a-zA-Z1-9À-ÖØ-öø-ÿ]+\.?(( |\-)[a-zA-Z1-9À-ÖØ-öø-ÿ]+\.?)*

In case you need to have separate address fields for the street name and the number, this only checks for the street name. It is very similar to the previous example only it allows us to use numbers. As in the previous example, this is to be used when validating what the user give us.

A street name with exterior and an optional interior number

[a-zA-Z1-9À-ÖØ-öø-ÿ]+\.?(( |\-)[a-zA-Z1-9À-ÖØ-öø-ÿ]+\.?)* (((#|[nN][oO]\.?) ?)?\d{1,4}(( ?[a-zA-Z0-9\-]+)+)?)

This is in case you need to check the entire address.

There may be variations in the naming conventions in different regions. As I mentioned at the top of this article, this correctly filtered what I had available at the time, and it is very likely that the great majority of street names will be valid. Ideally, I would have a regular expression per country, maybe even per region, but I lack the necessary information to make this possible at the time of writing this. But I'll try to do it for at least the places I visit from now on.

A phone number

([\+]?[\d]{1,4} ?)?([\(]([\d]{2,3})[)] ?)?[0-9][0-9 \-]{6,}( ?([xX]|([eE]xt[\.]?)) ?([\d]{1,5}))?

Valid phone formats are:
+52(55)55555555
0052 (55) 55555555
(55) 55555555
55555555
55-555-555
01-800-765-8786
55555555 x23
5555-5555 Ext 23
55555555 ext23
5555-5555x43
55555555ext26
+52 (55) 5555-5555 ext 134

A username

[a-zA-Z]((\.|_|-)?[a-zA-Z0-9]+){3}

This is the regular username, it must start with an alphanumeric character, it must be at least 4 characters long (you can control the length by choosing your desired minimum length, subtracting one from it and putting that number between last curly brackets), it may contain numbers but not start with one. And it can contain underscores, dots or dashes but not at the beginning or the end or have more than one together (ae__, ae_- and ae._ would be invalid).

An e-mail

[a-z0-9_\-]+(\.[_a-z0-9\-]+)*@([_a-z0-9\-]+\.)+([a-z]{2}|aero|asia|arpa|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|xxx)

Due to the recent changes proposed by ICANN, this regular expression is very likely to change, as ICANN plans to sell new top-level domains. If that is the case it may result impractical to check for every existing top-level domain inside the regular expression, personally I may opt for just check the formatting, extract the top-level domain and search it in a database of top-level domains, but we are not there yet, so this should work just fine for the time being. You may add |onion after the xxx if you need to use the tor network, or |bit if you use the experimental nameserver bitname.

Additionally, if you use domains that allow internationalized domain names, you may need to add extra accepted characters to the regular expression. I may make new strings suited for different regions in a future, check back the article or send me an message in the contact form at Contact if you have any comments on this.

A RFC number (Federal registry of contributors, a Mexican value)

[A-Z]{3,4}[ \-]?[0-9]{2}((0{1}[1-9]{1})|(1{1}[0-2]{1}))((0{1}[1-9]{1})|([1-2]{1}[0-9]{1})|(3{1}[0-1]{1}))[ \-]?[A-Z0-9]{3}

How to use these regular expressions in PHP

To use regular expressions in PHP we use the function preg_match(). If the pattern is found in the string, preg_match() returns 1, otherwise it returns 0. The regular expression to match must be enclosed by a slash on each side.

Since I want to match the entire contents of the string with the regular expression I always start them with ^ and always end them with $, plus the D modifier, to represent the start and the end of the input. So, to use it inside of preg_match() for this purpose, the regular expression

[a-zA-Z]((\.|_|-)?[a-zA-Z0-9]+){3}

would turn into

/^[a-zA-Z]((\.|_|-)?[a-zA-Z0-9]+){3}$/D

What follows is a little example of this precise regular expression to check the variable $string:

<?php
$string = 'jveweb';
$regex = '/^[a-zA-Z0-9_\-]+(\.[_a-zA-Z0-9\-]+)*$/D';
if (preg_match($regex, $string)) {
    echo "The string is valid";
} else {
    echo "The string is NOT valid";
}
?>

Resources

A little (and simple) script that I made to test regular expressions

test_regex.tar.g
Size: 2788 bytes
MD5: 7cbb13893004e7b7fcdac3aebd63fc1a
SHA1: 5e5be779a03fc1481782062ccac2b346ddfb0382
License: BSD-3 like

If you mark the box labeled "Multiline" it will split the contents of the textarea into multiple lines and check each line separately, otherwise it will evaluate the entire contents of the textarea.

References

Regular Expression Pocket Reference (book)
http://www.php.net/manual/en/function.preg-match.php

Foot notes

This is another one of those topics that I enjoy, I wish I had large data sets to analyze, but I usually don't. So I make some when I get to teach them. If you have a comment on this, or if you would like a regular expression created send me an e-mail or message in my contact form in the Contact section.

Originally I was going to recommend the Firefox add-on "Regular Expression Tester" but it failed me for some complex expressions, which I find odd as I remember me testing some back then with this Firefox add-on. So I made the little script for test the regular expressions, which is what I published here.

Categories: PHP, Programming