Home > Blog Posts > Dev > Regular Expressions

Regular Expressions

Tags:

dev
Published: Apr 07 2017

Estimated Read Time:

What are Regular Expressions? They are a way to search text. I don't know the specifics even though I have read about them on Wikipedia.

I love Regular Expressions (RegEx). I know they aren't the best for everything and I am not even that great at them. I like that they are logical and terse and strict.

I found this fun RegEx crossword site and after completing all the regular puzzles I am working on the player submitted ones. They should be much more challenging and force me to learn more of the syntax that I don't know well.

A couple things I use RegEx for are:

  1. Find and replace text in Notepad++ (example: replace all attributes in html/xml tags, remove html/xml tags)
  2. Simple email validation

These aren't perfect cases though. The html/xml parsing needs to account for many possible types of tags and I might run several different expressions instead of one that covers everything. The email one is usually extremely simple and doesn't actually validate but makes sure it has a couple key components. 

Below are some actual examples of simple regular expressions that I might use in code or to get through a large file in Notepad++.

Html

Step 1: 

<div.*?>.*?</div> 

This will match either of these two texts: "<div>test</div>", "<div class='test'>test</div>". It will fail however if there are multiple lines, which is why I chose the div tag here. 

Step 2:

(<div.*?>)(.|\r|\n)*?(</div>)

This will match a div with anything in between across multiple lines. Will still not get everything though. If there is a div within a div then it will not match that in way you might want.

For the markup: 

<div class="div1">
<div class='div2'>
test1
</div>
</div> 

The above regex will match this text only: 

<div class="div1">
<div class='div2'>
test1
</div>

This is because it will find the first div and keep matching until it finds an ending div tag. I don't have a simple solution to that. I would likely have to replace a couple of times to get the results I needed.

I would probably not use the div tag and use something else if I could. The nice thing about the capture groups is that it makes replacing text much easier. 

Using the RegEx: 

(<div id='test'>)(.|\r|\n)*?(</div>)

Input:

<div id='test'>
<label>first</label>
</div>

Replace: 

$1second$3

Output: 

<div id='test'>second</div>

Obviously not perfect but I am sure you can see how it's useful. 

Email

The email example is much simpler.

.+?@.{2,}?\..{2,}

This isn't that great checking against a text file because it will match all kinds of things that have an @ symbol. This does work well for a single textbox input though. It doesn't actually test that everything is valid but it makes sure there is something before and after the @ and a period between the last two items. 

I had planned to write a simple post about simple regular expressions but as you can see it can be complex. Some people find RegEx hard to understand and there are a number of ways that matches may be missed or included when they aren't wanted.

 

Resources for RegEx:

https://regex101.com/

http://www.regular-expressions.info/