A little (?) brain exercise

I wanted to try out something that first seemed to be quite simple, but then I had to see that I didn’t manage to get the desired result in one single query. Maybe one of you has an idea how to do it.

Here’s the table that you need for this example and a little Stored Procedure to quickly fill the table:

CREATE TABLE tt (
  id int unsigned NOT NULL auto_increment,
  d1 int unsigned NOT NULL,
  d2 int unsigned NOT NULL,
  PRIMARY KEY (id),
  KEY d1 (d1)
) ENGINE=InnoDB;

… and the Stored Procedure …

DELIMITER //

DROP PROCEDURE IF EXISTS insertIntoTT //
CREATE PROCEDURE insertIntoTT(in _rows int)
begin
declare counter int default 0;

while counter < _rows do
  set @a := floor(rand() * 20) + 1;
  set @b := floor(rand() * 1000000);
  insert into tt (d1, d2) values (@a, @b);
  set counter := counter + 1;
end while;

end //

DELIMITER ;

With

CALL insertIntoTT(10000);

you can insert 10,000 rows and end up with a table like this:

mysql> SELECT * FROM tt LIMIT 10;
+----+----+--------+
| id | d1 | d2     |
+----+----+--------+
| 1  | 16 | 578121 |
| 2  | 12 | 209260 |
| 3  | 6  | 772868 |
| 4  | 1  | 801153 |
| 5  | 19 | 266106 |
| 6  | 11 | 845643 |
| 7  | 13 | 672265 |
| 8  | 9  | 162255 |
| 9  | 11 | 32461  |
| 10 | 14 | 156472 |
+----+----+--------+
10 rows in set (0.00 sec)

The numbers in column d1 are in the range from 1 to 20 and those in d2 are from 0 to 999,999.

Now I wanted to get the top 3 values from the d2 column for each value in d1, the output should look like this (at the bottom of this article, I’ll show you the Stored Procedure that produced the output):

mysql> CALL getOutput();
+------+----+--------+
| id   | d1 | d2     |
+------+----+--------+
| 8590 | 1  | 992748 |
| 5378 | 1  | 991187 |
| 8010 | 1  | 989804 |
| 4271 | 2  | 995870 |
| 6825 | 2  | 995212 |
| 7195 | 2  | 993449 |
| 7518 | 3  | 998659 |
| 327  | 3  | 994499 |
| 7523 | 3  | 992530 |
| 9637 | 4  | 997258 |
| 9505 | 4  | 986400 |
| 3577 | 4  | 984371 |
| 4474 | 5  | 998324 |
| 9616 | 5  | 990006 |
| 7856 | 5  | 987046 |
| 8149 | 6  | 995602 |
| 5339 | 6  | 994606 |
| 4791 | 6  | 991693 |
| 5956 | 7  | 999078 |
| 6360 | 7  | 995711 |
| 1280 | 7  | 994592 |
| 1449 | 8  | 997588 |
| 4993 | 8  | 995742 |
| 3966 | 8  | 994909 |
| 3755 | 9  | 996587 |
| 2537 | 9  | 996559 |
| 5372 | 9  | 994984 |
| 4907 | 10 | 998818 |
| 2512 | 10 | 997335 |
| 1156 | 10 | 996677 |
| 1653 | 11 | 999390 |
| 9901 | 11 | 998331 |
| 8307 | 11 | 998075 |
| 2526 | 12 | 998190 |
| 5922 | 12 | 996840 |
| 3005 | 12 | 991681 |
| 2632 | 13 | 991479 |
| 906  | 13 | 991448 |
| 2023 | 13 | 990165 |
| 3903 | 14 | 998612 |
| 7725 | 14 | 993921 |
| 5169 | 14 | 993021 |
| 6371 | 15 | 999566 |
| 5788 | 15 | 996766 |
| 7331 | 15 | 996545 |
| 8483 | 16 | 998472 |
| 1061 | 16 | 997509 |
| 6350 | 16 | 994854 |
| 2927 | 17 | 998590 |
| 8081 | 17 | 993340 |
| 6595 | 17 | 993104 |
| 4960 | 18 | 999275 |
| 6432 | 18 | 998478 |
| 2029 | 18 | 996769 |
| 1971 | 19 | 999992 |
| 1067 | 19 | 997117 |
| 3812 | 19 | 996488 |
| 9869 | 20 | 999248 |
| 8045 | 20 | 999102 |
| 3520 | 20 | 998395 |
+------+----+--------+
60 rows in set (0.11 sec)

Query OK, 0 rows affected (0.11 sec)

My question is, can I get the same output without a Stored Procedure, only with a normal query (including a subquery or a join). I tried three approaches.

My first one:

mysql> SELECT id, d1, d2
    -> FROM tt AS a
    -> WHERE id IN
    ->   (
    ->   SELECT id
    ->   FROM tt
    ->   WHERE d1 = a.d1
    ->   ORDER BY d2 DESC
    ->   LIMIT 3
    ->   )
    -> ORDER BY d1, d2 DESC;
ERROR 1235 (42000): This version of MySQL doesn't 
yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

The subquery should retrieve the IDs of the rows that have values in the top 3 for column d2 – the where clause should make the connection. Unfortunately, the LIMIT clause is not allowed here.

My second attempt:

mysql> SELECT id, d1, d2
    -> FROM tt AS a
    -> WHERE d2 >=
    ->   (
    ->   SELECT d2
    ->   FROM tt
    ->   WHERE d1 = a.d1
    ->   ORDER BY d2 DESC
    ->   LIMIT 3, 1
    ->   )
    -> ORDER BY d1, d2 DESC;
ERROR 1242 (21000): Subquery returns more than 1 row

This time, I tried to get the third highest d2 value for each d1 value inside the subquery and get all columns, where d2 is greater or equal the retrieved value. Even though the subquery can only result in one row (because of the LIMIT clause), MySQL complains that it returns more than one row.

My third attempt was a bit desperate and might look a bit strange:

mysql> SELECT a.id, a.d1, a.d2
    -> FROM tt AS a INNER JOIN
    ->   (
    ->   SELECT id
    ->   FROM tt
    ->   WHERE d1 = a.d1
    ->   ORDER BY d2 DESC
    ->   LIMIT 3
    ->   ) AS b;
ERROR 1054 (42S22): Unknown column 'a.d1' in 
'where clause'

The derived second table in this join should again retrieve the IDs of all the top 3 rows. But the WHERE clause in the subquery can’t get the connection to the outer query.

Any ideas how to get the result in one query?

So, here’s the promised Stored Procedure that solved the problem in quite a complex manner:

DELIMITER //

DROP PROCEDURE IF EXISTS getOutput //

CREATE PROCEDURE getOutput()
BEGIN
DECLARE _found INT DEFAULT 0;
DECLARE _d1 INT DEFAULT 0;
DECLARE cur CURSOR FOR
  SELECT DISTINCT d1
  FROM tt
  ORDER BY d1;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET _found := 1;

DROP TEMPORARY TABLE IF EXISTS tt_temp;

CREATE TEMPORARY TABLE tt_temp (
  id INT UNSIGNED NOT NULL,
  d1 INT UNSIGNED NOT NULL,
  d2 INT UNSIGNED NOT NULL);

OPEN cur;

d1_loop:LOOP

  FETCH cur INTO _d1;

  IF _found = 1 THEN
    LEAVE d1_loop;
  END IF;

  INSERT INTO tt_temp
    SELECT id, d1, d2
    FROM tt
    WHERE d1 = _d1
    ORDER BY d2 DESC
    LIMIT 3;

END LOOP;

SELECT * FROM tt_temp;

END //


DELIMITER ;

The downside of information_schema

In general, information_schema is a great thing – you can get all kinds of information out of it in quite a simple way.

But it has a downside – on a server with lots of databases and tables, it can be very slow. However, more and more applications rely on information_schema, and that can be a problem.

I just tried to update phpMyAdmin for db4free.net from 2.7.0-pl1 to 2.8.0.3. The update itself was a smooth thing, but when I tested the new version and tried to open a database, phpMyAdmin just couldn’t make it.

So I looked for the cause, and I found it quite quickly. I searched the whole phpMyAdmin directory for the string “information_schema” and there were quite a lot of results. When I did the same for the directory of phpMyAdmin 2.7.0-pl1, there were no results at all.

So while information_schema is good for many things, it might not be so good to use it extensively in administration applications, if it can be avoided.