Checkbox Challenge… Now By Google!
Google is releasing an update to their anti-spam reCAPTCHA system that includes – wait for it – a single checkbox.
|Your premium source for custom modification services for phpBB|
Google is releasing an update to their anti-spam reCAPTCHA system that includes – wait for it – a single checkbox.
I have a particularly persistent spammer that has been driving me nuts. They’re coming from Indonesian IP addresses and are clearly human spammers. As a temporary solution I added some code to the flood control code that sends any post they make into the “flood warning” error message.
It’s not a permanent solution, but it’s fun.
Tonight I added some code to log their attempts so I can figure out a more permanent solution. Just like with the Checkbox Challenge, if I can capture their behavior I can start to look for patterns, and once I find a pattern I can try to do something more interesting to block them. As a final resort, I may have to finish writing up my Post Approval MOD that I started so many years ago.
I haven’t touched any phpBB2 code in years. It was fun to get back into it, even if only for a little while tonight…
As I was working through some code last night I found another “in progress” MOD that I wanted to add to the list of MODs in progress that I published yesterday. Over the years I’ve seen cases where someone from the other side of the planet has a dicey Internet connection and they end up submitting the same post twice because their browser submit times out. Or someone might post the same question in more than one forum, thinking that they’ll get more attention. Or a spammer might hit multiple forums with the same post multiple times.
I think I’ve managed to come up with something that definitely helps solve the first two scenarios and as a bonus helps the spammer problem as well. I call this my “Cross Post / Double Post” MOD, and it’s being tested on my beta board now.
The MOD design has so far turned out to be fairly simple. I tie into the flood control process and retrieve the post text for the last three posts by the user. From there I take the current post text and compare it to the prior posts. The first check is a straight equality check, meaning I check for the exact same post text. This will catch the “copy/paste” folks with very little overhead. If the post text is not identical, then next I use a function called
similar_text(). (similar text reference at php.net) This function takes three arguments. The first two are the two strings to compare, and the third is a variable to store the results of the comparison, which is a number from 0 to 100. The result code should essentially be treated as a percentage. If the two posts are 95% similar then I check to see if the original post already in the database is in the same forum as the new post being attempted. If the forums are the same, then a “Double post” exception is triggered. If the forums are different, then a “Cross post” exception is triggered instead.
The number of posts (3) and percentage of similarity (95) are both controlled via the board configuration screen, so it’s quite flexible. Setting the percentage threshold to zero (0) is the same as turning the comparison process off.
This MOD is being tested on my “beta release” board right now. The first version of the MOD did not use the
similar_text() function mentioned above. I attempted to use the
soundex() function instead. However it seemed that the
soundex() function did not look at enough text, so posts that were clearly different were still being reported as being the same. Switching functions solved that issue.
I’m now wondering if I need to deal with setting different threshold values for different forums. I hate to do that, as it drastically increases the complexity of the code. But for example there are many forum “games” that people play in an “off topic” type of forum. Some of those games look very repetitive, and would potentially trigger the CP/DP exception handling. Then again, the current logic looks across all forums, so as long as the person is active in more areas than just the off-topic games area it might be okay. I don’t want this feature to get in the way of normal use, but I do want to help out the moderator team by capturing / rejecting double post and cross post events.
Stay tuned for details as we start user testing this week.
One of my other blogs had been hit and hit hard by spammer comments advertising headphones. This morning I noticed this one here on this blog:
That’s specifically aimed at human-powered paid-to-comment spam. I would rather already have excellent-quality comments than the next quantity of comments.tour headphones Sadly, I’m nonetheless getting an awful lot of spam comments (what’s up, Akismet?), so I think it’s time to install some additional defense layers.
The words “tour headphones” were a link, of course. Subtle, it was not. But I found it extremely ironic and ultimately amusing that the comment itself talked about spam. If you pick a few phrases from that comment you’ll find the exact same thing on other blogs / boards as well, or at least I did when I searched.
I’ve decided to contact the headphone manufacturer directly and let them know that I will never buy their products. Ever. Might not change anything, but it will make me feel better.
Oh, and I added specific code to my anti-spam process to look for this particular type of link.
I’m sure I’m not alone in seeing this new spammer tactic… I called it delayed spam. How does it work?
A spammer registers on a board. They might not do anything for a while. Then they try to post something that looks legitimate, using generic language that could be appropriate anywhere. Stuff like:
You make some good points, please keep posting
I find your arguments compelling, can you link your sources?
Thanks, it helped me
None of those add anything to the discussion, but they’re not really spam. What happens next? The spammer goes quiet for a few weeks, hoping that the topics they have posted in will fade from the front page. Then they carefully go back in and edit their post. They might change the text of the post itself, or they might add a signature that wasn’t there before. They are relying on the fact that phpBB (and other boards as well) do not bump a post back to the front page if something is edited, only if new content is added.
So far I have not come up with a programmatic solution to the problem. I am working on code that will capture the edit history of a post and allow board moderators to revert to an original version, so that at least would let me prove how the spammer added their content after the fact. That doesn’t solve the problem, it just provides an audit trail should I decide to try to take action against the spammer.
A frequent suggestion at this point might be something along the lines of preventing someone from posting URLs or links until they reach a certain level of post. That doesn’t help either, as the spammers often have five or ten posts under their belt before they come back and edit. Plus it impacts the legitimate new users that come on board with questions that require links. It’s not my favorite concept.
So today what my moderator team does is a manual process. When we get a suspected spammer, they will do a web search for either their username, their email address, or both. If they find the same username on hundreds of different boards that’s a good indication they’re a spammer, especially if the user is recently registered on all of them. They can also pull up posts from the user on these other boards. If they look similar to what they’re posting on our board, that’s another indication. All of these steps are used to decide whether to preemptively ban the spammer before they spam, or decide to wait.
It’s all a manual process for now. So while I’ve been away from phpBB2 for a while because of other demands on my time, this has never really been far from my mind. I just haven’t come up with an idea that can be implemented in code versus a manual process.
Guess I should check in with the BB Protection folks, and see what they’re up to at this point.
The focus for the past several years for board owners has been to prevent (or at least have some easy way to ignore) spammer registrations. When spammers thought it was useful to have an entry on a board memberlist they were often satisfied with getting through the registration process. They didn’t bother to activate their account. As a result, one of the most popular (and fortunately very easy) MODs for discussion boards was to prevent inactive members from showing up on the member list. This is the standard configuration for phpBB3, no MOD required.
Spammers reacted by altering their process so they can activate accounts. (I as well as other board owners have seen a dramatic increase in use of gmail accounts for this, so clearly Google’s registration process has been cracked and automated as well.) Like many board owners, I would like to have a “clean” database. But it wasn’t a huge imposition to get spammer registrations. If they never posted, they were not a contributing member of my board but at least they weren’t getting in the way. I had a MOD that prevented board members from entering a web site until they had a minimum number of posts on my board, so at least I didn’t get a member database sprinkled with unsavory web links. There are also MODs available that prevent zero-post users from showing up, and for pruning inactive or zero-post users after some specific period of time. All of these were okay in their day, but are not as effective anymore.
I’ve posted many times about my Checkbox Challenge code. It has served very well in protecting my blogs, several phpBB boards, and even my comment forms from spammers. However I am starting to see some issues, and that bothers me. Why? Because the new spam seems to be coming from humans rather than bots. I don’t know how we can combat that. Spammers seem to be quite creative with their posting strategies as well. More…
I don’t like most current CAPTCHA techniques. There is nothing that frustrates me more than trying to use a web site and being presented with this:
Yes, that is an actual CAPTCHA image that I was presented with. If anyone can figure out what that one is supposed to be saying, you have better eyes than I do. More…
After just cleaning up yet another gmail spammer (I so love the Spammer Hammer™ MOD, is one of my favorites ) tonight I found myself wondering: Is it worth setting up an extra activation step for gmail.com accounts? More…
It has been a while since I visited my honeypot board. I decided to have a look today…
Our users have posted a total of 385789 articles
We have 43968 registered users
And when I logged in, I had 33 unread PMs as well.
Bots have been busy. I intend to go back and see what additional patterns I can get from the data. In light of one of my recent posts about gmail being the most abused email domain, here are some stats that speak for themselves. These are the top ten email domains in use on my honey pot board:
+-----------------+----------+ | email_domain | users | +-----------------+----------+ | gmail.com | 11323 | | mail.ru | 6034 | | meltmail.com | 1179 | | gawab.com | 859 | | getciallis.info | 855 | | spambox.us | 479 | | serpdomains.com | 449 | | atlantaclubs.cn | 282 | | coolgwen.cn | 274 | | coolsanta.cn | 255 | +-----------------+----------+
Not too long ago I participated in a topic at phpbb.com where the author was asking about blocking gmail email addresses. The general consensus from the community was that the board owner should not block gmail but instead rely on some other methods for blocking spammers. I don’t block gmail, but sometimes I would like to. In this post I think I summarized it best, saying:
hotmail, yahoo, gmail… any free email account is subject to abuse. Spammers are using the fact that board owners are, as you are, reluctant to ban gmail outright because it does have so many legitimate users.
Having said that, I decided it was time to go back and work through some numbers. Instead of guessing how bad the problem is, I wanted to get actual statistics to back up my claims. Anyone can say anything they want. Having numbers makes the claims more substantial. And graphs. Pictures are always good. The data used for this post is available as an Excel file for anyone to download and review (link at the end of the post). Here’s the summary:
Google: Your gmail system is borked. Fix it or risk it becoming irrelevant. More…
Powered by WordPress