A Better Approach for Filtering Webspam in Google Analytics

Posted by wbastianen

“Don’t throw the baby out with the bathwater” is a popular saying that’s been around since the 16th century, but is no less relevant today, especially when considered against the backdrop of webspam in Google Analytics. In fact, frustrations caused by the spam issue have led to the loss of genuine data.

Spam in analytics could be the single most irritating thing in online marketing. Numerous blog posts have been written on the topic.

One particular solution consistently surfaces as the fastest way to get rid of spam: Set up one or two filters in your analytics, and you’re free of spam forever. This strategy is based on including only valid hostnames to filter out ghost spammers, the most aggressive type of spammers.

Even though implementing this solution it is a seemingly valid option, it is also the most risky one, for you are likely to lose valuable data and insights in the process.

Why is this seemingly valid option risky?

Using the two-filters option is risky because it uses inclusion instead of exclusion; and also because it marks an unset hostname as spam.

Inclusion versus exclusion

  • Inclusion: only allows data from known genuine sources
  • Exclusion: only filters data from known spam sources

What’s a hostname?

The hostname always tells you which domain your website was visited from.

This can be any (sub)domain you claimed, like www.mydomain.com, mydomain.com, blog.mydomain.com or mydomain.co.uk. However, the hostname could also be the domain of translation, cache, or shopping services like translate.googleusercontent.com or paypal.com.

This strategy is perfect to use in a vacuum. In real life, however, we have seen too many cases where using this strategy could have gone terribly wrong:

  1. Over a span of months or years, you work with multiple people and agencies. They don’t always know what was previously set up.
  2. The internet and your business will evolve and more genuine sources will appear. Who will make sure they are always included from day one?
  3. Plus, a minor technical error in your code may cause your hostname to be “not set.” This would make your genuine data appear as spam. It wouldn’t pass the inclusion filter, and you’d never even know it.

Real-life data needs a real-life solution

 

With the inclusion strategy any of above real-life scenarios causes you to lose genuine data.

In fact, one of our clients would have deleted all of the brand’s conversion data if they’d used the two-filter solution, solely because of a third-party plug-in that was implemented by another agency.

The plug-in created a new session without the hostname data instead of the real session:

AAEAAQAAAAAAAAV-AAAAJGEyOGRlNDcxLTRlMDgt

What’s the best alternative?

 

Only filter spam when you’re 100 percent sure it’s spam. Working with exclusion has its downsides, of course:

  1. You have to make sure your exclusion filters are always up-to-date with the latest spammers.
  2. You will allow some spam to enter—for instance, visits with an unset hostname that actually are spammers.

Based on the data in our clients’ accounts, these spammers account for 0.4 percent, on average, of all traffic.

This means your analytics, on average, would retain 99.6 accuracy without risking losing genuine traffic.

Back to you

So, what’s your take on dealing with spam in analytics?

If you’re like us, you’d rather filter real spam while lessening the likelihood that real data is included.

 

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Leave a comment