Regex: Match everything between two things

regularOhh the adventures of regular expressions!!!  I’ve been beating my head against the wall with this one.  I have a long string filled with all kinds of characters and variables and I need to grab everything between two things.

Example 1: I want to grab the email address in a url.

http://www.domain.com/index.html?p=test&email=john@domain.com

The regular expression to get the email address is easy in this one.  You just need to use the look behind operator.  So the expression would be (?<=email=)(.+)  This uses the look behind operator ?<= and what I’m looking behind is email= .  Now the look behind operator is just a pointer operation – it actually doesn’t consume anything – so what you have to do at that point is consume the characters behind the operator.  That is accomplished using (.+)  the . will grab any character because it is a wildcard.  The + will repeat the operation until it gets to the end of the line. (.*) would work as well.

Example 2:  I want to grab the email in the address but I have another variable.

http://www.domain.com/index.html?p=test&email=john@domain.com&name=john

In this situation (?<=email=)(.*) won’t work because it will return everything until the end of the line which is john@domain.com_name=john  What I want to do now is stop my “greedy” operator at the ampersand &.  To do this I need to use the look ahead operator (?=).  So my look ahead expression looks like this (?=&).  Altogether the expression looks like this (?<=email=)(.*)(?=&)   The expression is look behind email= and starts grabbing characters until it looks ahead for the & .  Just what I wanted.  Unfortunately I can’t use this scenario for the first example because in the first example there isn’t a trailing & – instead it is the end of the line. This expression is work only if there is a & at the end of the email address

 

Example 3: I want to grab the email address but have lots of variables afterward

http://www.domain.com/index.html?p=test&email=john@domain.com&name=john&zip=90800

Our last expression is greedy and since it is so greedy it will keep eating characters until it hits the last _ which is too far.  So how do we stop the expression at the first _ instead of the last one?  We do that by using a “Lazy” modifier which is ?.  So now our expression looks like <?<=email=)(.*?)(?=&)

Example 4: I want to grab the email address but I don’t know where the email variable is

http://domain.com/index.html?email=john@domain.com or http://domain.com/index.html?email=john@domain.com&name=john or http://domain.com/index.html?email=john@domain.com&name=john&zip=343324

In this last situation I want grab both instances – one with multiple variables or one with email at the end.  To accomplish this you will need to have two expressions and trigger them with an OR operator | .  The expression looks like this (?<=email=)(.*?)(?=&)|(?<=email=)(.+)

This will return the email address if it is at the end of the line or in the middle.

 

 

You may also like...