DokuWiki

From my previous email

After being kicked out of our SAN share at our provider for too heavy usage, then buying our own NFS server and trying to put dokuwiki on it, we understood why we got kicked out :) It seems that dokuwiki is doing many, many file accesses, which would probably go unnoticed when local and with fast disks, but which are completely killing when trying to put dokuwiki on NFS.

After some investigation (sigh :D) we found a couple of issues which could be improved. But let's start with the hardware: dokuwiki is running on a dual xeon quadcore 1.6ghz with 8GB of ram, the NFS is a dual 2.8ghz hyperthreaded, 4GB of ram, 10k scsi disks in soft raid 10. Right now, if I put the data of dokuwiki on the NFS, it holds for, well, a good 3 minutes before the load reaches 20 :P

So, investigation: we found the following function:

/**
 * Return a list of available and existing page revisons from the attic
 *
 * @author Andreas Gohr <andi@splitbrain.org>
 * @see    getRevisions()
 */
function getRevisionsFromAttic($id,$sorted=true){
  $revd = dirname(wikiFN($id,'foo'));
  $revs = array();
  $clid = cleanID($id);
  if(strrpos($clid,':')) $clid = substr($clid,strrpos($clid,':')+1); //remove path
  $clid = utf8_encodeFN($clid);
 
  if (is_dir($revd) && $dh = opendir($revd)) {
    while (($file = readdir($dh)) !== false) {
      if (is_dir($revd.'/'.$file)) continue;
      if (preg_match('/^'.$clid.'\.(\d+)\.txt(\.gz)?$/',$file,$match)){
        $revs[]=$match[1];
      }
    }
    closedir($dh);
  }
  if($sorted) rsort($revs);
  return $revs;
}

If I am not mistaken, this function does a readdir of the attic directory, does a preg_match for every file to see if it is quite like $id, and return the list of revisions for $id. Two things:

yann@dongo:/srv/www/fr/doc.ubuntu-fr.org/htdocs/data/attic$ ls -l | wc -l
46426

That's on a 3 years old wiki :) And also:

yann@dongo:/srv/www/fr/doc.ubuntu-fr.org/htdocs$ grep -R getRevisionsFromAttic bin/ conf/ inc/ lib/
inc/changelog.php: * @see    getRevisionsFromAttic()
inc/changelog.php:  $revs = array_merge($revs,getRevisionsFromAttic($id,false));
inc/changelog.php:function getRevisionsFromAttic($id,$sorted=true){

This means the function is called only once, in changelog.php, in the function getRevisions($id, $first, $num, $chunk_size=8192), right at the end:

  $revs = array_merge($revs,getRevisionsFromAttic($id,false));
  $revs = array_unique($revs);

So what happens exactly: the changes to a particular file are stored in data/meta/ in a file called file.changes, which is a split version of the old changes.log. Dokuwiki parses that file to find out the latest changes that happened to a particular file. But dokuwiki *also* checks for existing files in attic/, eventually elder versions, and merge these revisions to the ones found in the changelog.

My suggestion: should we just get rid of these 2 lines? If someone deleted a revision in the changelog, it could be intentionnal, but the revision would still be displayed as the file is still in attic/ :) As this is also the only call to that function, I'd suggest we just get rid of it, or keep it to rebuild changes.log somewhere, but called only from the admin panel… Right now for me this function is doing 45000 getattr() calls, and as many regexp checks :)

This function no more exists on the latest release of DokuWiki (2008-05-05). Furthemore, if you delete one revision in file.changes, this one is no more visible. DidRocks 2008/05/14 10:44

Second point, the following function:

/**
* returns an array of full paths to all metafiles of a given ID
*
* @author Esther Brunner <esther@kaffeehaus.ch>
*/
function metaFiles($id){
$name = noNS($id);
$dir = metaFN(getNS($id),'');
$files = array();
 
$dh = @opendir($dir);
if(!$dh) return $files;
while(($file = readdir($dh)) !== false){
if(strpos($file,$name.'.') === 0 && !is_dir($dir.$file))
$files[] = $dir.$file;
}
closedir($dh);
 
return $files;
}

If I understand it right, it returns an array containing all the metas for a specific file. For this it looks in data/meta, checks for everything that looks like what we are looking for, and adds it to the array it then returns. Comments:

yann@dongo:/srv/www/fr/doc.ubuntu-fr.org/htdocs/data/meta$ ls -l | wc -l
6203

Suggestion: Having a quick look at meta/ , it seems that for a given ID you get 3 different files: a .meta, .changes and .indexed. So, just check if these 3 files do exist, and return them in an array? Seems simple maybe I am missing something, don't be too harsh if that's the case :P

Plugins can have their own files in the meta dir, like the linkback plugin the discussion plugin or the task plugin for example. Also this function is only referenced one time namely when a page is deleted to wipe out remaining meta files . — Michael Klier 2008/05/05 16:09

I will continue to look for improvements - but I think than any readdir() working on attic/, pages/ or meta/ should be get rid of, as it means linear complexity and therefore no scalability as your wiki grows :(

Some more ideas

In io_readFile, it should be possible to get read of fileexist(), and to rely on the error codes returned by gzfile, bzfile etc… If they are coded properly :) This would save unnecessary file accesses.

In the pagemove plugin, the function _pm_movemeta should use the function metaFiles() pointed out earlier, instead of reimplementing it: would make the plugin a lot faster, as that function can probably be improved.

Simple Pie

Let's consider the following function in inc/SimplePie.php:

    /****************************************************
    DELETE OUTDATED CACHE FILES
    Copyright 2004 by "adam at roomvoter dot com". This material
    may be distributed only subject to the terms and conditions set
    forth in the Open Publication License, v1.0 or later (the latest
    version is presently available at http://www.opencontent.org/openpub/).
    This function deletes cache files that have not been used in a hour.
    ****************************************************/
    function clear_cache($path, $max_minutes=60) {
        if (is_dir($path) ) {
            $handle = opendir($path);
            while (false !== ($file = readdir($handle))) {
                if ($file != '.' && $file != '..' && pathinfo($file, PATHINFO_EXTENSION) == 'spc') {
                    $diff = (time() - filemtime("$path/$file"))/60;
                    if ($diff > $max_minutes) unlink("$path/$file");
                }
            }
            closedir($handle);
        }
    }

This function reads all the file in the cache directory and deletes the ones that are older than 60 minutes. I don't know which directory it is cleaning, but if it is one in data/cache/, then this represent many files again. This function is called in init(), in simplepie, so every time a RSS file is asked for?

Several other options:

Run this via a cron script every hour
Create a timestamp file, reclear the cache once this file is older than 60minutes, update the file to change its timestamp.
Is this needed? Can't the timestamp of a file be checked every time it tries to be accessed?

This function has changed with latest revision of DokuWiki. Now, the cache is a class declared in inc/classe.php. The function removeCache() seems to delete the $cache referenced file (@unlink) directly, after having retrieving the concerned file in $cache (no mote readir)
This function is called from saveWikiText() in inc/common.php (apparently everytime a modification is saved) after the call
 $cache = new cache_instructions($id, $file);
> retrieving the cache file associated with this wiki and removing it to create a new one when a load of the page is done.
The second call is on p_cached_output() function in inc/parserutils.php and executed if the cache is no more up to date (function useCache in inc/classe.php using common function trigger_event()).
This second function is called when : an export is done (act_export), a plugin called it (locale_xhtml, called many times in every actions).
DidRocks 2008/05/14 - 11:40

Table of Contents

From my previous email

Some more ideas

Simple Pie