DokuWiki

It's better when it's simple

User Tools

Site Tools


devel:metadata

Metadata Storage

If the content stored in a wiki page is data, things like the time of last update, who updated it, the filesize etc. could all be regarded as metadata for the wiki page. This page describes where and how such additional data is stored in DokuWiki.

Metadata can also be used by plugins for different purposes, apart from storing obvious metadata for the page it can also be used to store data that can be used to determine whether a cache can be used or settings like if a certain feature of the plugin should be enabled on a page.

Storage

DokuWiki does not store all metadata at central place (like a database or registry). Metadata can basically be the own datafile's properties (eg. filesize, last modified date), the other metadata are kept by DokuWiki within the meta directory. Metadata are found within the .meta file corresponding to the wiki page name. There is also an index in which selected metadata can be searched.

Metadata Renderer

Info in the meta directory is initially written by the metadata renderer. It creates a parallel file for each page named <pageid>.meta in the meta directory. The file is a serialized multi-dimensional PHP array whose keys follow the Dublin Core element names.

Data Structure

Currently, the following metadata is saved by the core metadata renderer:

  • title – string, first heading
  • creator – string, full name of the user who created the page
  • user – string, the login name of the user who created the page
  • description – array
    • abstract – raw text abstract (250 to 500 chars) of the page
    • tableofcontents – array, list of arrays with header id ('hid'), title ('title'), list item type ('type') and header level ('level')
  • contributor array, list of user ID ⇒ full name of users, who have contributed to the page
  • date – array
    • created – timestamp, creation date
    • modified– timestamp, date of last non-minor modification
    • valid
      • age – seconds, period in seconds before the page should be refreshed (used by 'rss' syntax only)
  • last_change – array, the last changelog entry
    • date – timestamp, date of the last change
    • ip – ip of the user editing
    • type – type of the edit (C create, E edit, e minor edit, D delete, R revert)
    • id – id of the page
    • user – username of the user editing
    • sum – summary of the editor
    • extra – extra data, used for storing the revision (timestamp) in the case of a revert
  • relation – array
    • isreferencedby – array, list of pages that link to this page: ID ⇒ boolean exists, this is not used or written by DokuWiki core
    • references – array, list of linked pages: page ID ⇒ boolean exists
    • media – array, list of linked media files: media ID ⇒ boolean exists
    • firstimage – id or url of the first image in the page
    • haspart – array, list of included rss feeds (and more, see below)
  • internal – array
    • cache – boolean, if the cache may be used
    • toc – boolean, if the toc shall be displayed

Additionally, plugins can support more metadata elements. Currently used:

  • relation – array
    • haspart – array, list of included pages: ID ⇒ boolean exists (include plugin) or rss feeds
    • odt – array, list with properties for ODT plugin
      • template – media id of ODT-file used as template
  • subject – array, lists of tags (tag plugin, blogtng plugin, flattr plugin); this is used by feed.php, if present
  • type – string, 'draft' for drafts (blog plugin)
  • geo – array, list of geographic tags (geotag, openlayersmap, socialcards and spatialhelper plugins)
    • lat – number, latitude of this location in decimal degrees
    • lon – number, longitude of this location in decimal degrees
    • alt – number, altitude in meter above sea level
    • region – string, region of this location, eg. a province or state
    • country – string, the country of this location
    • placename – string, placename describing this location or area
    • geohash – string, geohash of this location

It's recommended to use keys from the Dublin Core element set for any metadata that might be interesting for external use.

For plugin internal data it is recommended to store your keys under the plugin key:

  • plugin – array, contains keys for all plugins storing metadata
    • yourplugin – array, the keys you need for your plugin

This data is stored in an associative array with two keys: 'current' for all current data (including persistent one), 'persistent' for data that shall be kept over metadata rendering.

Metadata Persistence

Internally DokuWiki maintains two arrays of metadata, current & persistent. The persistent array holds duplicates of those key/values which should not be cleared during the rendering process. All requests for metadata values using p_get_metadata() are met using the current array.

Examples of persistent metadata keys are:

  • 'creator'
  • 'contributor'

Running of metadata rendering

The metadata rendering is only started by the p_get_metadata() and p_set_metadata(). This differs from the xhtml renderer. The wikipage parsing process has two stages: generation of the instructions by the Handler and next the generation of xhtml output with these instructions as input. As all Renderers the metadata renderer uses the same instructions as input. In the metadata renderer the metadata can directly be accessed at renderer->meta and renderer->persistent. Some examples and bit of explanation can be found at syntax plugins development documentation.

The metadata renderer creates also an short raw text abstract. The abstract is created from the rendered instruction by adding compact text without html to $this->doc. Use the $this->capture to check whether the renderer still collects text for the abstract.

// capture only the first few sections. 
// Is switched off as well by eg. section metarenderer
if ($this->capture){ 
    if($linktitle) {
        $this->doc .= $linktitle;
    } else {
        $this->doc .= '<'.$url.'>';
    }
}

The timing is thus not equal to xhtml renderer, but depends on render flags given to the p_get_metadata() and the cache status. The logic here is to guarantee the metadata renderer is running when needed, but not unnecessary. Read more about render flags in functions to Get and Set Metadata below.

Metadata and Plugins

There are two ways for plugins to interact with metadata rendering:

  • Syntax Plugins can create metadata for the rendered page with their render() method by handling the $format=="metadata". The current metadata can be accessed and modified in the renderer->meta array and persistent values are in the renderer->persistent array, when persistent metadata is modified the copy of it in the current metadata should be modified, too.
  • Action Plugins can register for the PARSER_METADATA_RENDER method to inspect or modify metadata before or after metadata rendering.

Persistent metadata can also be set at any time using the p_set_metadata function that is described below, current metadata should only be set in the context of the renderer as it will be overwritten the next time metadata is rendered.

Metadata can be retrieved using the p_get_metadata function that is described below. Plugins can also add metadata to the metadata index and search the indexed metadata. This is used in the tag plugin.

Note that persistent metadata is never cleaned and always used as basis for the current metadata so when switching from persistent to non-persistent metadata in a plugin make sure you implement a cleanup routine which removes persistent metadata from your plugin whenever it exists. For this reason non-persistent metadata should also be preferred whenever possible.

If you want to make sure that your plugin's metadata doesn't interfere with other plugins or DokuWiki itself consider using plugin_$plugin as prefix/top level key (especially for persistent metadata, current metadata that fits in the Dublin Core element set should be stored as outlined above).

As it is very difficult to cleanly update persistent metadata properties that are arrays from various places (in most cases you don't know which is old metadata that should be cleaned up and which is metadata from other plugins that should be kept - or not because the plugin was disabled) consider using keys that are unique to your plugin for this case and merge them manually into the current metadata using the PARSER_METADATA_RENDER event, that way you can for example store custom tags in the persistent metadata and add them to the subject metadata. Then your plugin's metadata also won't be used anymore when your plugin is disabled.

Functions to Get and Set Metadata

There are two functions in inc/parserutils.php to deal with metadata:

  • p_get_metadata($id, $key, $render) returns a metadata value for a page.
    • $id is the ID of a wiki page; required
    • $key the name of the metadata item to be retrieved. Defaults to false. If empty, an array of all the metadata items is returned. For retrieving items that are stored in sub-arrays, separate the keys of the different levels by spaces like relation references for the data stored in $meta['relation']['references'] in the renderer.
    • $render int, the parameter determines if the page metadata should be generated by the renderer when the metadata cache indicates that it shouldn't be used and p_get_metadata isn't called from within p_get_metadata. There are several possibilities:
      • METADATA_DONT_RENDER means the metadata won't be generated/updated on request, use this when you request metadata for a lot of pages in a row as p_get_metadata can trigger the parsing and rendering of the requested page.
      • METADATA_RENDER_USING_CACHE is the default, it uses the standard DokuWiki caching system, the behavior can be changed using the PARSER_CACHE_USE event. Below you can find more details on metadata and caching.
      • METADATA_RENDER_SIMPLE_CACHE means a lot simpler caching will be used, it only considers the modification time of the page and can't be changed using plugins. Use this when you request very simple properties of the page like its title.
      • METADATA_RENDER_UNLIMITED means that metadata for an unlimited number of pages should be rendered. Normally only P_GET_METADATA_RENDER_LIMIT (default: 5) pages are rendered for metadata in one request. This should be used in locations like the cli indexer where time doesn't really matter but metadata should always be fresh. This option can be combined with the previous two options using logical or.
      • false is interpreted as METADATA_DONT_RENDER (this parameter used to be a boolean before the 2011-05-25 release)
      • true is interpreted as METADATA_RENDER_USING_CACHE
  • p_set_metadata($id, $data, $render, $persistent) sets some properties in the metadata, uses the metadata inside the renderer when there is a renderer for the specified page.
    • $id is the ID of a wiki page; required
    • $data is an array with key => value pairs to be set in the metadata, required. Note that here the keys are only keys for the top level. If the key is description, data or contributor the value is expected to be an array and merged with the existing data, if the key is relation, all sub-keys will be merged when there is existing array data for them. Other keys are not merged as array, but just stored as value, which will overwrite eventually subkeys.
    • $render boolean, whether or not the page metadata should be generated with the renderer before the metadata is set; optional, default is false
    • $persistent a boolean which indicates whether or not the particular metadata value will persist through the next metadata rendering. The default value is true.

Metadata and caching

In general, metadata is rendered on demand when p_get_metadata is called. This happens normally right after the redirect after saving a page but also from time to time when the cache expires or is expired by a plugin using the PARSER_CACHE_USE event or when caching has been disabled in the renderer (but at most once in every request). In the cache file itself only a timestamp is stored. The timestamp is always updated when metadata is rendered, the .meta file only when the metadata was actually changed (the xhtml cache depends on it, that way it is only updated when really needed).

When metadata is requested inside the cache handler the old metadata is returned, that way you can compare new data to the old stored metadata in order to decide whether to use the cache or not. In the xhtml cache handler you get the new metadata but as the xhtml cache depends on the metadata whenever you change the metadata the xhtml will be updated.

In versions prior to 2011, metadata was only rendered when the xhtml was rendered. Back then you got the old metadata in the xhtml cache handler, plugins that still rely on this need to be updated.

Metadata index

Since the 2011-05-25 (“Rincewind”) release there is an index where metadata properties can be stored. It is organized in a similar manner as the fulltextindex and uses the same page list but different word indexes for each indexed metadata property, they are named $metaname_w.idx, $metaname_i.idx and $metaname_p.idx. In DokuWiki itself currently the properties relation_references and title are indexed. Plugins can add their own metadata keys and it is also possible to add arbitrary data to the index. This can be done with the INDEXER_PAGE_ADD event. Plugins need to make sure they add themselves to the indexer version using the INDEXER_VERSION_GET event, the index of a page is re-created when this version is different from the version with which it has been indexed before. All metadata indexes are recorded in the metadata.idx index so deleted pages can be removed from all metadata indexes.

The data is updated right after the fulltextindex so it can be regenerated in the same way, when a plugin wants to force an update of the index of a certain page it can delete the .indexed meta file of that page (the index is not automatically updated when metadata is changed but only when the page itself is changed).

The indexer object (which can be obtained by using idx_get_indexer) supports the following methods for metadata:

  • addMetaKeys($page, $key, $value=null) - adds one or more metadata entries to a page (normally this should be done using INDEXER_PAGE_ADD but if plugins want to update the index explicitly and immediately this function can be used)
  • lookupKey($key, &$value, $func=null) - for looking up all pages where a certain metadata key has the specified value. It is possible to pass multiple keys as array, then an array with matches for each key is returned. Additionally with the $func parameter it is possible to pass a comparison function like preg_match.
  • getPages($key=null) - if the $key parameter is set only pages where the metadata key $key is set to at least one value are returned.

Example for getting the ids of all pages that link to a certain page:

$result = idx_get_indexer()->lookupKey('relation_references', $id);

(note that this functionality including an ACL check is available as ft_backlinks($id)).

For more advanced queries (like getting all values stored for a certain metadata property) can be needed to access the index files directly using idx_getIndex, feel free to suggest additional features for the metadata index in the bug tracker.

The tag plugin uses the metadata index, in its helper part there are example of how the index can be queried, in its action part you can see how the index is written.

devel/metadata.txt · Last modified: 2023-02-28 07:43 by saggi

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
CC Attribution-Share Alike 4.0 International Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki