Getting mySQL to use indexes, Part 1

I recently became a partner in a firm that is hosting a Drupal site with 19 million business listings.  /admin/content/node, that displays a filtered, paged listing of all the nodes in a site, is one of the most common pages for the admin of a Drupal site.  On a site with 19M nodes, just creating this page could take forever.  (The actual CPU time was a bit over 85 seconds, but the real time was a lot longer.)  There were two things that made a huge impact:

The first had to do with the query being run to generate this listing.





SELECT DISTINCT n.*, u.name FROM node n INNER JOIN users u ON n.uid = u.uid WHERE ...


For reasons that are still obscure to me, the DISTINCT keyword precludes the query optimizer from using the primary index on the user table.  (This was explained to me by a very helpful database expert at Rackspace.)  So getting the query to run without the DISTINCT keyword the ticket to getting this page to display without the long delay.

Exploration of the core Drupal code shows that this query gets generated by the function node_admin_nodes() in the core file ./modules/node/node.admin.inc.  

 $result = pager_query(db_rewrite_sql('SELECT n.*, u.name FROM {node} n '. $filter['join'] .' INNER JOIN {users} u ON n.uid = u.uid '. $filter['where'] .' ORDER BY n.changed DESC'), 50, 0, NULL, $filter['args']);

Note that the query goes through db_rewrite_sql() that gives any module implementing hook_rewrite_sql() the opportunity to modify the query.  The node module implements its own over-ride of hook_rewrite_sql() that looks like this: The page /admin/content/node displays a button whose label is "Rebuild Permissions".  While the next page warns you that this might take a long time, in my case it was painless, and resulted in a node_access table with one row:

function node_db_rewrite_sql($query, $primary_table, $primary_field) {
  if ($primary_field == 'nid' && !node_access_view_all_nodes()) {
    $return['join'] = _node_access_join_sql($primary_table);
    $return['where'] = _node_access_where_sql();
    $return['distinct'] = 1;
    return $return;
  }
}

This return value tells db_rewrite_sql() to do a join with the node_access table and to insert the DISTINCT keyword into the query.  Because it is doing a join, and because there is a one-to-many relationship between node and node_access (dues to multiple values for gid and realm), the DISTINCT keyword is necessary.  

But looking again at the function node_db_rewrite_sql(), we see that this modification is conditional on the value of $primary_field and the return value of node_access_view_all_nodes().  Now I can't do anything about the primary field.  But I might be able to stop node_access_view_all_nodes() from returning FALSE.   node_access_view_all_nodes() can be found in ./modules/node/node.module.  It returns TRUE if the query 

SELECT COUNT(*) FROM {node_access} WHERE nid = 0 $grants_sql AND grant_view >= 1

returns a non-zero value.

Examining the node_access table revealed that it was empty!  The the value returned by this query would always be 0.  Fortunately, there is an easy way to correct this.

The page /admin/content/node-settings displays a button labelled "Rebuild permissions".  Hitting it brings up a page warning you that this might take a long time.  Fortunately, in my case it took only a few seconds and resulted in a node access table with a single row:


Full Textsnidgidrealmgrant_viewgrant_updategrant_delete
EditDelete00all100


This says essentially, "Let anyone view anything".  (Permission to update or delete a node is deferred to other factors.)


Now the core node module built the node-listing query without the DISTINCT keyword and it would take no more than a few seconds waiting to see the page display.  


I was in Drupal-admin heaven, but only briefly.  As soon as I tried to filter the nodes by limiting them to only say, nodes of type 'business_listing_premium' things got really slow again.  To find out what I did to speed things up again, read my next post.

Popular Posts