How does search work? Part IV: Evolution of a Regular Expression
This is part IV of a series of posts about the phpBB2 search process. Previous posts include:
- Part I: Table Review
- Part II: Making Effective Use of “Stop Words”
- Part III: Efficient clean_words() Function
You don’t have to read all of the prior parts in order to read this one. The last post was quite long, and so part of what I wanted to cover there was postponed until this post. In this post I’m going to analyze what one particular regex (regular expression) from the clean_words() function is doing. In very early versions of phpBB2 it worked very well at keeping short and long words out of your search index tables. In later versions it did not work so well. In this post I will explain why, and provide an extremely easy fix.
Cleaning Words
Prior to any “word” processing the clean_words() function has already removed HTML entities, BBCode, URL’s and special characters. So in theory what is left is a bunch of words separated by spaces. We want to reduce those words to ones that we’re interested in. We are interested in words that are not “stop” words and that are between 3 and 20 letters long. So that’s the purpose of the regex I want to review in this post. In versions of phpBB2 through 2.0.4 it looked like this:
$entry = preg_replace('/\\b([a-z0-9]{1,2}|[a-z0-9]{21,})\\b/',' ', $entry);
The purpose of the regex is to drop any words of one or two letters and any words of 21 letters or more. Now I do not claim to be a regexpert. But I think I can figure this one out. This is a pattern match that will replace certain matches with a space. The patterns are contained between two forward slashes, so at a very basic level this regex is:
replace('/things-that-I-match/', ' ', $entry);
The magic part is the “things-that-I-match” which is everything between the two forward slash characters. So I’ll dissect that… it’s really not too hard.
The \b stands for a word boundary. That just means we’re going to work on full words as identified by php’s definition of a “word boundary”.
The part that looks like this [a-z0-9] is a pattern match condition that catches all letters of the alphabet and numbers zero (0) through nine (9). This structure is called a “character class” and appears quite frequently in regular expressions. The {1,2} is an interval qualifier, and says that it is required to match at least one but no more than two of the identified characters. Putting it all back together and I can see that [a-z0-9]{1,2} says match from 1 to 2 characters in the set from a-z or 0-9.
You might be wondering about case sensitivity at this point. Not to worry, as the entire string was converted to lower case earlier in the code. A space was added to the front and back end of the string too, which will become important later on.
After that pattern there there is an “or” operator (the vertical bar | is for or) and then the same pattern is repeated. But this time the interval is {21,}. This will match any string of letters or numbers of 21 characters or more. I don’t know what this does for foreign language characters, but I’ll come back to that in a bit.
So this:
/\b([a-z0-9]{1,2}|[a-z0-9]{21,})\b/
… says to match any combination of letters and numbers bounded by spaces that are 1-2 or 21+ letters long. After that match, the preg_replace() function replaces them with a space. Done. So that seems really easy, and it functions correctly on all of my boards.
One Step Forward, Two Steps Back
But it changed. In version 2.0.6 the regex I’ve just been through changed to this:
$entry = preg_replace('/[ ]([\S]{1,2}|[\S]{21,})[ ]/',' ', $entry);
First, you’ll notice that the \b has been replaced by the character class brackets with a space, as in [ ]. It turns out this is going to be important, so remember that. Next, our nifty [a-z0-9] has been replaced by [\S] instead. The \S represents any single non-whitespace character, which certainly sounds useful. Remember that earlier I was concerned that the [a-z] portion was only going to match certain languages? I would guess, then, that the switch to \S was an attempt to be more aggressive at matching strings of characters rather than just letters from a to z.
So why doesn’t it work?
As I have been studying regex techniques one of the phrases that comes up over and over goes something like this:
Be careful what you ask for. You might get it.
I believe that’s the explanation here. Having [ ] in the regex means that the match must include the space, thus not leaving it available for the following word. In other words, any space takes up more space than it is supposed to.
It’s easily confirmed that the “new” regex fails to remove two-letter words if there are two two-letter words in a row. Or shorter. So this phrase:
To be in love, ah it is bliss
… causes problems. The word “To” would be dropped, “‘be” would be included, and “in” would be dropped. Why?
First, recall that I mentioned earlier that the string of words has had a space appended to the front and back as well as being converted to lower case. The line of code responsible for that operation is this:
$entry = ' ' . strip_tags(strtolower($entry)) . ' ';
Those extra spaces are important, because we’re actually requiring a space on either side of each word. After this line of code has been executed the short phrase I entered above would look like this:
" to be in love ah it is bliss "
This is after the lower case operation, the extra spaces have been added, and punctuation marks and other special symbols have been removed using the $drop_char_match array. Here’s how the regex will match what remains. Items inside [ ] are matches and are replaced by spaces; the other words are left behind. Note that with the matches eating up the spaces the second (and fourth and sixth…) two-letter word in a sequence of two-letter words will not match, as they don’t include a space. The space was sacrificed to the earlier match! So here is the string, with the matches marked out…
[ to ]be[ in ]love[ ah ]it[ is ]bliss
What is left after the items that matched the regex are replaced by a space?
be love it bliss
So now I can see where the bogus two-letter words are coming from! These are the words that will be added to your search_wordmatch table. If the regex was applied recursively then this wouldn’t matter, as the new space added by the replace operation would be enough to allow the remaining two-letter words to be dropped as well. (In fact one person posted on phpbb.com that they altered their code so that the regex was executed three times in a row.) But part of the beauty of using regular expressions is being able to avoid complex looping code. You don’t have to keep doing the same thing over and over. If I were so inclined, I could write a very inefficent block of code that processed the string a character at a time.
Speaking of that… those of you that are quite observant might have noticed that the two regular expressions I’ve posted are from 2.0.4 and 2.0.6. What happened in 2.0.5?
$entry = explode(' ', $entry); for ($i = 0; $i < sizeof($entry); $i++) { $entry[$i] = trim($entry[$i]); if ((strlen($entry[$i]) < 3) || (strlen($entry[$i]) > 20)) { $entry[$i] = ''; } } $entry = implode(' ', $entry);
Remember that this code is run on the full text of every post that is saved (whether a new post or an edited one). It has to be efficient. Processing strings of data a character or even a word at a time is not very efficient. The loop in 2.0.5 was probably functional (I did not test it during my review) but I doubt that it was anywhere near as efficient.
The Fix
The original regex seemed to work fine for me, but I never used languages other than english. I expect my words will be made up of the of letters from “a” to “z”. The regex has to be more flexible than that, and the inclusion of [\S] does appear to work. The problem is the switch from \b to [ ].
When you use \b it appears that the boundary is shared from one word to the next. If you use a required space as is done with [ ] then the space is not shared. Once it is found and replaced, then the regex starts with the next character. If that character is the first letter of a one or two-letter word (or even worse, a 50-letter word) then that word is included in your search database because it doesn’t have the required space at the beginning.
Without further ado, I present to you the merged regex with the best of both worlds:
$entry = preg_replace('/\\b([\S]{1,2}|[\S]{21,})\\b/',' ', $entry);
I updated my boards to use this regex and ran my MOD that rebuilds my search indexes. So far it has worked flawlessly. This update is included in my Efficient clean_words() Function MOD which was introduced in my last blog post. During the writing of this MOD (and this series of posts) I went back and searched through the phpBB database for other discussions about this. I found a number of folks that had provided solutions, none of them exactly the same as this. Now that I really understand what is going on, I can say that some of the solutions posted should work.
Make it Faster
This post was all about the regex used to get rid of short and long words so that they are not stored in your phpbb_search_wordlist (and wordmatch) tables. Without this fix, your tables can get filled up with words that should not be there, and that’s not a good thing. I did a lot of benchmarking during this process and initially thought that my updates to the clean_words() function were helping to improve performance. That is why I called my MOD related to these past two posts “Efficient clean_words() Function” as I thought I was making it more efficient.
It turns out that I was not. I corrected a problem where short or long words would get into your search index tables. That saves space and some processing time. But as far as the actual processing? It didn’t really add (or subtract) from the efficiency. I do absolutely believe that the changes suggested by the MOD are useful and appropriate, especially if you have a large board with very “chatty” posts made up of lots of short words. But will it dramatically improve your performance? No, not really. It will protect your search index from improper words, and it will protect your server from running extra search queries on duplicate words. Both of those are good for your board.
During my review I did, however, manage to make a change that had a side affect of improving my posting performance by about 4%. I mentioned that in the prior post, and I believe that I know what the actual cause is. I also believe that I can improve it even further.
However, once again we have run out of time and [ ] for this blog post, so you will have to come back for episode V for the big reveal.

I have tested the regex presented in this post quite thoroughly on an english language board with no issues. I have had less success with some foreign language boards. I will post some further results when I have something more concrete to share.
Comment by dave.rathbun — February 9, 2007 @ 7:25 pm
Update posted here. The news is not good; foreign language posts are not being properly processed by this expression.
Comment by dave.rathbun — February 12, 2007 @ 9:26 am
I just built a very similar regex for a similar purpose this morning (to remove one and two letter terms from a search).
It is worth mentioning that the \b boundary notation will break words at any character other than a-z0-9, so the regex shown above will turn the word: “you’re” into “you” because it will see “you’re” as two words: “you” and “re” and the “re” is only 2 characters so it is dropped.
I have not yet found a nice way to get around this.
Comment by Scott — November 5, 2007 @ 11:41 am
Well actually the following method will work to remove any one or two character words from a string without using the \b tags which break words like “don’t” and “you’re”:
Not the prettiest solution ever, but it is working for me.
Comment by Scott — November 5, 2007 @ 11:51 am
Hi, Scott, thanks for your comments. In the phpBB search process special characters like the ‘ in you’re are removed prior to the regex, leaving the word youre instead. So they have that part covered.
I’ve been using the code that I posted earlier on an english language board without issues, but it does cause problems on non-english boards. I might still have some data that I was loaned for testing that I can use… when I get time I will try your suggestion. Thanks!
PS – Hope you don’t mind but I edited your comment to include the “pre” tag so that your extra space will show up as you wanted. I moved one of your inline comments to a separate line to keep the screen from scrolling horizontally.
Comment by Dave Rathbun — November 5, 2007 @ 10:49 pm