Love em or hate em, regular expression are a part of Google Analytics. They provide a lot flexibility but at a price. Small mistakes can become magnified and result in poor data quality.
I know there’s a lot of information out there about regular expressions, but I wanted to simplify the topic. In my opinion, here are the most important things to know.
Key Concept: How GA Regular Expressions Work
Let’s start by talking about how regular expressions work in Google Analytics. In general, we apply a regular expression to a piece of data. If the expression matches ANY part of the data then the expression will return TRUE. If the expression returns TRUE then some action will occur.
It doesn’t matter where you use the reg ex. If it’s part of an exclude filter, and the expression matches the data, then the data will be excluded. If it’s part of an include filter then the data will be included. If it’s part of a report filter then the report will only contain info that matches the reg ex. You get the idea.
[In this image think of the data as the square cube and the red work bench as the regular expression. If the cube is the same shape as the hole in the bench then an action happens; the cube falls through. Get it?]
It’s really important to understand this because it simplifies the expressions we need to create. Let’s say I want to identify all the keywords in a set of data that contain the term excel
. Here’s the full list:
word
excel
ms excel
excel 2003
linux
microsoft excel
excel 2007
excel makes pretty graphs
google
Rather than create some fancy regular expression, I can simply use: excel
. After the expression is applied to the data we’ll have the following sub-set:
excel
ms excel
excel 2003
microsoft excel
excel 2007
excel makes pretty graphs
This simplifies the creation of your expression because you only need to match part of the data that you’re looking for. With that in mind, let’s move on to some tips that cover the most common uses of regular expressions.
Tip #1: Use Anchors
Anchors are a way to specify if a regular expression should match the begining of the data or the end of the data. Remember, reg ex works by matching ANY PART of a piece of data. Sometimes we’re looking for data that starts or ends a particular way and that’s why we need anchors. Let’s go back to the excel
example.
word
excel
ms excel
excel 2003
linux
microsoft excel
excel 2007
excel makes pretty graphs
google
Suppose I only want to see the items that END with the word excel
. Well, if I use the regular expression excel
, I’m going to get all the items that contain the word excel no matter where it appears.
I need to create a reg ex that means, “ends with.” That’s done by placing a dollar sign, $, at the end of my reg ex. So the expression to find all of the keywords that END with excel
would be: excel$
.
It would match the following items from our list:
excel
ms excel
microsoft excel
To find all of the keywords to START with excel
use a carrot, ^, at the beginning of the regular expression, like this: ^excel
. It would match the following items from the list:
excel
excel 2003
excel 2007
excel makes pretty graphs
Now, let’s say I want just the keyword excel. Here’s how that expression would look: ^excel$
.
Anchors, pretty handy.
Tip #2: Find This OR That
Many times in an analysis we’ll want to find multiple items from a set of data. For example, let’s say I want to find all the keywords that contain the name of an MS Office product. The complete list of keywords is:
word 2007
microsoft excel
outlook express
powerpoint
windows 95
mac OSX
linux
google rocks
Again, I’m only interested in the MS Office products, so I need to create an expression that includes the names of all the products. I want to find word
OR excel
OR outlook
OR powerpoint
. The pipe character, |, is used to represent OR logic. The following expression will return true if any of the items occur in the data:
word|excel|outlook|powerpoint
And here are the results:
word 2007
microsoft excel
outlook express
powerpoint
Tip #3: If in Doubt, Escape it Out!
The dangerous thing about regular expressions is that we often don’t know what we don’t know. There are a lot of characters that have special meaning in reg ex. The plus sign, the question mark and the period are just a few. Inadvertently using a special character in an expression can lead to big trouble. There is an easy way to protect yourself: escaping.
Escaping a character means that GA will interpret the character as a LITERAL character and not as a regular expression character. To escape any character place a backslash in front of the character. Here’s the great part. It doesn’t matter if you escape a non-special character. To me, escaping a character is like using a safety net. If you’re unsure if a particular character is a special character, escape it. It can’t hurt your expression.
Time for an example. Let’s say we want to create a goal based on the following URL:
index.php?id=34
I need to turn the above into a regular expression. The question mark and period are special characters so they need to be escaped. But I’m not sure about the equal sign. I better escape just to be safe. So here’s how the resulting reg ex would look: index\.php\?id\=34
. By the way, the equal sign is not a special character.
So there you have it. My two cents on regular expressions. These tips just scratch the surface of what you can do with Reg ex. If you really want to learn about reg ex check out my friend Robbin’s series on the subject.
My Regular Expression Tool Box is a post from: Analytics Talk by Justin Cutroni
The post My Regular Expression Tool Box appeared first on Analytics Talk.