Optimizing Random Users Via Last Visit Time
While I have not started in-depth MODding on phpBB3 yet, I do read the phpBB3 MODders forum from time to time just to start to get the flavor of how things have changed. The other day a database (query) question came up and I suggested an answer that I originally thought was only slightly different from what had already been proposed. However, after being asked which of the two solutions would be the least CPU intensive I did a bit more investigating.
I discovered that one solution was clearly better than the other, but only if the proper index was created.
Disclaimer: I tested on phpBB2. The index that I created does not exist in a standard phpBB2, nor does it exist in a standard phpBB3 install, so I suspect this post applies to both.
Defining the Problem
Here is a partial quote of the original question in the phpBB3 MOD forum:
I want to select 5 random memebers from last 30 active users.
Simple enough, yes? First, get the last 30 members that have visited the board, then randomly select 5 of those. This could easily be done procedurally via php code, but it can also be done directly in the database with the correct SQL code.
Order By Random
The first suggestion given was this:
$start = 0; $number = 30; $sql = 'SELECT * FROM ' . USERS_TABLE . ' ORDER BY user_lastvisit DESC, RAND()' ; $result = $db->sql_query_limit($sql, $number, $start);
With apologies to evil<3 who posted this there are a couple of things that I suggested changing. First is to avoid using the * to select every column in the table unless it’s absolutely needed. In this case, I made the assumption that the original poster was looking for a way to display a random five “recent visitors” somewhere on a page on the site. To do this doesn’t require every single bit of information about the user, just certain columns. If you select * then the entire row is returned. There are 73 columns in a standard phpBB3 users table, several of them varchar(255), and in my test installation all of the fields are mandatory. That means every single column has a value, even if it is just a space or other placeholder value. By setting up a specific list of columns to request in the query the amount of I/O is reduced. Less I/O should mean if all other things are equal the query will run faster because there are fewer bits and bytes to move around.
The other issue with the query as provided is that it’s not very efficient. It has two order by columns, one of which cannot possibly be indexed. (You can’t index something that doesn’t exist until the runtime of the query, so the rand() function result is impossible to tune.) Here is an abbreviated display for the explain plan for this query:
+-------------+------+---------------+------+---------+------+-------+---------------------------------+ | select_type | type | possible_keys | key | key_len | ref | rows | Extra | +-------------+------+---------------+------+---------+------+-------+---------------------------------+ | SIMPLE | ALL | NULL | NULL | NULL | NULL | 43367 | Using temporary; Using filesort | +-------------+------+---------------+------+---------+------+-------+---------------------------------+
This shows that no indexes will be used for this query at all, which is not good. I ran this query five times in a row on my large user table and got execution times of 0.12, 0.12, 0.12, 0.13, and 0.12 seconds.
Select Last 30 Then Random 5
As you can see from the explain data above, I have 43,367 rows in my users table right now, which is fairly large. Instead of scanning the entire table, it would be much more efficient to get the last 30 visitors in one pass and then pick five members randomly from that list. I suggested this SQL to do that:
SELECT u.user_id , u.username FROM phpbb_users u , (SELECT user_id FROM phpbb_users v ORDER BY user_lastvisit desc limit 30) v WHERE u.user_id = v.user_id ORDER BY rand() limit 5;
This is an interesting technique called “inline tables” as I am creating a new table on the fly by writing SQL code inside the FROM clause. Every database I have worked with supports this technique, so it should be portable. (I do not count Microsoft Access as a real database. ) What this SQL code will do is run the inline table to return a list of 30 users, then join that virtual table to the real table by
user_id
(which is a unique key) and randomly select five users from the joined result set.
Is it more efficient?
I ran this query five times (as I did with the other one) and got run times of 0.10, 0.12, 0.10, 0.10, and 0.11 seconds. It seems that it’s not really that much more effective, so is there really a clear winner?
Index Key Columns
I ran this query:
show indexes from phpbb_users
Side Note: I run my SQL directly on the database using the MySQL command line, rather than phpMyAdmin. If you use the GUI interface, then you can check for keys by looking at the appropriate screen instead of doing as I documented here.
The results of the query did not show an index on user_lastvisit
which is crucial to this solution. Here is the explain plan for my query without the index:
+-------------+--------+---------------+---------+-----------+-------+---------------------------------+ | select_type | type | possible_keys | key | ref | rows | Extra | +-------------+--------+---------------+---------+-----------+-------+---------------------------------+ | PRIMARY | ALL | NULL | NULL | NULL | 30 | Using temporary; Using filesort | | PRIMARY | eq_ref | PRIMARY | PRIMARY | v.user_id | 1 | | | DERIVED | ALL | NULL | NULL | NULL | 43367 | Using filesort | +-------------+--------+---------------+---------+-----------+-------+---------------------------------+
Notice that is also scans all 43,367 user rows. That’s okay. What isn’t okay is that it does so without the benefit of an index and it also has to do some additional work since more than one table is involved. It would seem that the first query should be more efficient since it only has one explain step and the second one has three.
However, the magic of a database indexing can fix this. The driver for this entire question is the user_lastvisit
column. After creating an index on this field (which is not indexed by default in either phpBB2 or phpBB3) here is the new explain plan.
+-------------+--------+---------------+----------------+-----------+-------+---------------------------------+ | select_type | type | possible_keys | key | ref | rows | Extra | +-------------+--------+---------------+----------------+-----------+-------+---------------------------------+ | PRIMARY | ALL | NULL | NULL | NULL | 30 | Using temporary; Using filesort | | PRIMARY | eq_ref | PRIMARY | PRIMARY | v.user_id | 1 | | | DERIVED | index | NULL | user_lastvisit | NULL | 43367 | | +-------------+--------+---------------+----------------+-----------+-------+---------------------------------+
Yup, still have to look at (or so the database optimizer thinks) all 43,367 users. But this time we do so with the benefit of an index. What is the impact?
The query without an index, remember, ran in 0.10, 0.12, 0.10, 0.10, and 0.11 seconds. After creating the index I ran the same query five times and got 0.00 seconds of execution time on every trial.
Does the index help the first query? Interestingly enough, it does not. The explain plan is identical with or without the index, and the query execution times do not improve either.
However, it gets worse. It doesn’t perform as required. My query gets the last 30 users that have logged in and then randomly selects five of those. The ORDER BY clause happens after all of the select and join process is complete, so I am ordering by RAND() on at most 30 rows. The other suggestion will pick the same five users nearly every single time. Why? The secondary sort column (the “random factor”) will only come into play if two users have exactly the same last visit time (down to the second). When you have two columns in the ORDER BY clause, the first column is the primary sort and every row returned will first be sorted by that column. If there are ties in the first column then and only then will the second column be sorted.
So the first solution suggested is the worst of both worlds: not only is it slower, it is also incorrect.
Conclusion
There may be other solutions to this. I don’t mean to present this post as the ultimate answer to this question. What I hoped to accomplish with this post was to show how two solutions that look the same are not always equivalent. Subtle differences can have a huge impact on functionality.
The second lesson is that if you’re going to be asking the same question from your database over and over you should carefully consider indexing the columns used in the WHERE or ORDER BY clauses. I did some work for someone a while back (their board is one of the top ten phpBB boards on the “big boards” site). They wanted to display the top ten posters on their index. The code as written was taking over ten seconds just to run the query and then the php code / template process still had to complete. I rewrote the code and added an index on the
user_posts
column and the code ran in less than a hundredth of a second.
On the other hand, too many indexes can also be a problem, so don’t go out and create an index for every single column in your database. Just the ones that truly matter. In this case, it makes a substantial difference.

My conclusion over a year ago was to avoid using RAND() against a large data set. Better to select your data, then use PHP for the rest.
Comment by Dog Cow — September 27, 2009 @ 2:36 pm