Not too long ago I participated in a topic at phpbb.com where the author was asking about blocking gmail email addresses. The general consensus from the community was that the board owner should not block gmail but instead rely on some other methods for blocking spammers. I don’t block gmail, but sometimes I would like to. In this post I think I summarized it best, saying:
hotmail, yahoo, gmail… any free email account is subject to abuse. Spammers are using the fact that board owners are, as you are, reluctant to ban gmail outright because it does have so many legitimate users.
Having said that, I decided it was time to go back and work through some numbers. Instead of guessing how bad the problem is, I wanted to get actual statistics to back up my claims. Anyone can say anything they want. Having numbers makes the claims more substantial. And graphs. Pictures are always good. The data used for this post is available as an Excel file for anyone to download and review (link at the end of the post). Here’s the summary:
Google: Your gmail system is borked. Fix it or risk it becoming irrelevant.
Logging Registration Attempts
I have written more than a few posts about my simple Checkbox Challenge MOD. I use it for board registrations as well as comment forms. For this post I am going to concentrate only on registration attempts at my largest phpBB board. I will use registration attempts from January of 2008 through June of 2009 (eighteen months).
For the first step, I ran some preliminary queries to identify the top five domains used. There are plenty of obvious spammer domains out there but that isn’t the point of this post. I know that
gawab.com are the source of a lot of spam already. I also can recognize that domains like
onlineovernightpharmacy.com are probably not legitimate. The point I want to drive home is how bad things are for mainstream domains, and for
gmail.com specifically. In order to do that I want to focus only on the domains that are the source of higher volumes of registration attempts.
The top five domains and the total registration attempts are shown here.
Domain Total Attempts % of Total gmail.com 12909 61% yahoo.com 2968 14% mail.ru 2704 13% hotmail.com 1606 8% aol.com 843 4%
Notice that gmail is not only number one; it is in that position by a really large margin. No other email domain comes even close. My first piece of evidence clearly shows that gmail is a popular domain. It is so popular that if I were to consider banning or blocking it, I might lose 61% of my new members. But wait, is that really true? How many of those registration attempts were successful, and how many were blocked as bots?
Checkbox Challenge Data Collection Process
My Checkbox Challenge code presents a user with a standard registration form as well as a series of checkboxes. The user is instructed to click on only the marked checkbox in order to prove they are human. The development is well documented in other posts on my blog, so I won’t go into great detail here. Suffice it to say that bots seem to either ignore all of the checkboxes because they don’t expect them to be there, or they attempt to be smart and mark all of the checkboxes since they know they’re on the form. There are some humans that have issues with the system and might take multiple attempts to get through the screen but those situations are not very common, and for the sake of this post I will assume they don’t exist. Every attempt is logged, and it is that table that I am using for source material for this block post.
I listed the top five domains above. For the rest of this post I am going to drop mail.ru because most board owners know it’s a standard domain used by spammers. I am also going to drop aol.com because at 4% of the total registrations it’s not that relevant. That leaves me with three remaining domains to focus on:
hotmail.com. (If you’re wondering who is in position six, it was
gawab.com, which is another notorious spammer domain.)
Who’s Your Bot?
Any registration attempt is a potential board member. The concept behind most any anti-spam measure is to allow real people through and block bots. I have already established that gmail is by far the number one source of registration attempts. The next step is to evaluate how many of those attempts are desirable new users, and how many are bots. To do that, I retrieved the last 18 full months of data and determined the percentage of successful versus failed registrations. Here are those numbers for the three domains I have decided to focus on for this post.
Total Success Failed % Success gmail.com 5644 7265 43.7% yahoo.com 2372 596 79.9% hotmail.com 1384 222 86.2%
Now we start to see the real problem. Both yahoo and hotmail have approximately eighty percent success rates. That means that eight out of ten registration attempts from those domains are expected to be legitimate and valuable users. With gmail over half of the registration attempts fail and therefore are presumed to be bots. Not only is gmail the number one source for registration attempts, it is the worst source in terms of the human to bot ratio.
Is Google Doing Anything To Help?
Given that these numbers start in January of 2008, the next question I want to answer is whether the problem is getting better or worse. I have to believe that Google is aware of the issues that they’re facing. Are they doing anything to help?
Here are the gmail numbers broken down by month.
Log Month Domain Success Fail 2008-01 gmail.com 297 79 2008-02 gmail.com 260 42 2008-03 gmail.com 320 94 2008-04 gmail.com 293 107 2008-05 gmail.com 290 65 2008-06 gmail.com 286 139 2008-07 gmail.com 395 147 2008-08 gmail.com 346 380 2008-09 gmail.com 316 398 2008-10 gmail.com 283 561 2008-11 gmail.com 316 367 2008-12 gmail.com 254 484 2009-01 gmail.com 291 898 2009-02 gmail.com 343 510 2009-03 gmail.com 346 808 2009-04 gmail.com 330 981 2009-05 gmail.com 291 614 2009-06 gmail.com 387 591
Here are a few things that I find interesting about these numbers. First, for the past 18 months I have averaged 313 new members (successful registrations) from gmail. That number is remarkably consistent, as shown by this graph. The blue line shows the raw data, and the orange line shows the trend.
Here is the graph for failed registration attempts from gmail.
In this case the red line represents the data and the black line is the trend. The trend is not my friend in this case. Pay careful attention to the scale of those two graphs. While they are presented as the same size (approximately 400 pixels square) the top graph (successes) has a maximum scale of 450 while the bottom graph (failures) goes all the way up to 1200. Here’s a combined graph without trend lines that will help drive that point home.
The data does not look good for Google. Sometime back in 2008 (it looks like August for me) the number of valid registrations and bot registrations were about the same. Prior to that date, bot registrations were in the minority. After that date the bot usage of
gmail.com has clearly soared. In February of 2009 (2009-02 on the graph) there was a dip in bot usage, at least on my board. Was it a result of something Google did? If it was, it clearly was not very successful in the longer term as bot usage popped right back up in the following months.
Here’s another chart that shows the value of gmail to me as a board owner. This is a percentage column chart so it ignores the overall numbers and instead presents the data as percentages.
Just how significant is this? Back at the beginning of this post I noted that for the past 18 months the average success rate for a registration attempt from a
gmail.com email address was 43.7%. If I recalculate the value for the past six months it drops to 31.1%. That’s not good. Is it fair to pick on Google? During the same time that the success ratio for gmail has dropped from 43.7% to 31.3% (a difference of 12.6%) yahoo has dropped 2.4% and hotmail has dropped 3.1%. In other words, all of the top three domains have seen the ratio of legitimate registrations to bots drop, but the ratio for gmail has dropped four times as much as the other two.
What Can I Do About gmail.com?
New board members are important. Without new members a community will start to get stagnant, and a stagnant community typically doesn’t thrive. As I mentioned earlier, I get an average of over 300 new members a month from
gmail.com alone. For the past 18 months I have averaged 751 new members each month, and 314 or 42% of those are from
gmail.com email addresses. If I were to consider banning
gmail.com that’s a large chunk of my community that would disappear. I don’t think that’s a realistic action to take.
What Should I Do About gmail.com?
I think that Google should be held responsible. I can take individual steps that impact my board… Google can (and should) take steps that will protect everyone on the Internet. Am I overstating the problem? I really don’t think so. All of the numbers I have used for this post came from registration attempts on my largest (and most active) phpBB board. Here are some other numbers to chew on. All of these have been filtered to show only log entries with
gmail.com email addresses.
Site Comment Form
Total attempts: 10,441
Total rejected: 10,381
Bot percent: 99.4%
Another phpBB Board
Total attempts: 2,767
Total rejected: 2,723
Bot percent: 98.4%
Still Another phpBB Board
Total attempts: 1,859
Total rejected: 1,843
Bot percent: 99.1%
What conclusion do I draw from these numbers? I submit that the problem is even worse that it appears based on the details I provided in this post! The numbers I used come from an extremely active board. Registration bots don’t pay too much attention to how many legitimate users are already registered on a board. The only goal of a bot is to find a board and register. For a smaller board this means the problem is even worse. My big board didn’t start out big. In the early days we got about 10-20 new registrations each month. Today I get more than that in one day. Because I get so many new legitimate users, it can actually mask just how bad the gmail problem really is. If you are a smaller board owner, having thousands of bogus gmail registrations can be extremely frustrating. If I didn’t have something in place that was – at least for now – somewhat effective in blocking these bogus attempts, I would very seriously have to consider blocking gmail accounts.
The problem is not new. While researching to see if I was the only one impacted by this (of course I am not) I found a post that shows how bots break the gmail CAPTCHA, and the post was from February of 2008. As we have long discussed on phpbb.com there are also services that will put real people to work breaking confirmation codes. I linked a few articles at the end of this post, and most of them are over a year old. The situation hasn’t improved since then either. If anything it has become much worse.
Google, are you listening? It’s time to fix this.
- Raw Data used in this post in Microsoft Excel format
- Breaking Google’s CAPTCHA
- Breaking Google CAPTCHAs for $3 a Day